hub Canonical reference

GIT: A Generative Image-to-text Transformer for Vision and Language

Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan · 2022 · cs.CV · arXiv 2205.14100

Canonical reference. 88% of citing Pith papers cite this work as background.

28 Pith papers citing it

Background 88% of classified citations

open full Pith review browse 28 citing papers arXiv PDF

abstract

In this paper, we design and train a Generative Image-to-text Transformer, GIT, to unify vision-language tasks such as image/video captioning and question answering. While generative models provide a consistent network architecture between pre-training and fine-tuning, existing work typically contains complex structures (uni/multi-modal encoder/decoder) and depends on external modules such as object detectors/taggers and optical character recognition (OCR). In GIT, we simplify the architecture as one image encoder and one text decoder under a single language modeling task. We also scale up the pre-training data and the model size to boost the model performance. Without bells and whistles, our GIT establishes new state of the arts on 12 challenging benchmarks with a large margin. For instance, our model surpasses the human performance for the first time on TextCaps (138.2 vs. 125.5 in CIDEr). Furthermore, we present a new scheme of generation-based image classification and scene text recognition, achieving decent performance on standard benchmarks. Codes are released at \url{https://github.com/microsoft/GenerativeImage2Text}.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 7 baseline 1

citation-polarity summary

background 7 baseline 1

representative citing papers

Is Dimensionality a Barrier for Retrieval Models?

cs.LG · 2026-05-22 · unverdicted · novelty 8.0

Dimension d = O(m^{-2} log n) nearly achieves the optimal margin m^rd(+∞, A) for retrieval embeddings, with matching lower bounds showing d = O(k log(n/k)) suffices and is necessary for m = Θ(k^{-1/2}) on k-sparse query matrices.

WikiCLIP: An Efficient Contrastive Baseline for Open-domain Visual Entity Recognition

cs.CV · 2026-03-10 · unverdicted · novelty 7.0

WikiCLIP delivers an efficient contrastive baseline for open-domain visual entity recognition that improves accuracy by 16% on OVEN unseen entities and runs nearly 100 times faster than leading generative models.

HyperCap: Hyperspectral Land Cover Captioning Dataset for Vision Language Models

cs.CV · 2025-05-18 · unverdicted · novelty 7.0

HyperCap is the first large-scale hyperspectral captioning dataset built from four benchmark HSI datasets using hybrid automated-manual annotations, with evaluations showing classification gains for vision-language models.

Stitch-a-Demo: Video Demonstrations from Multistep Descriptions

cs.CV · 2025-03-18 · unverdicted · novelty 7.0

Stitch-a-Demo is a retrieval-based method that assembles visually coherent video demonstrations from multistep textual descriptions by training on weakly supervised procedural data with hard negatives.

HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models

cs.CV · 2023-10-23 · unverdicted · novelty 7.0

HallusionBench shows GPT-4V reaches only 31.42% accuracy on paired questions testing language hallucination and visual illusion in LVLMs, with other models below 16%.

Visual Instruction Tuning

cs.CV · 2023-04-17 · unverdicted · novelty 7.0

LLaVA is trained on GPT-4 generated visual instruction data to achieve 85.1% relative performance to GPT-4 on synthetic multimodal tasks and 92.53% accuracy on Science QA.

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

cs.CV · 2023-03-28 · conditional · novelty 7.0

LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.

Language Is Not All You Need: Aligning Perception with Language Models

cs.CL · 2023-02-27 · conditional · novelty 7.0

Kosmos-1 shows strong zero-shot and few-shot results on language tasks, image captioning, visual QA, OCR-free document understanding, and image recognition guided by text instructions.

PaLI: A Jointly-Scaled Multilingual Language-Image Model

cs.CV · 2022-09-14 · conditional · novelty 7.0

PaLI jointly scales a 4B-parameter vision transformer with language models on a new 10B multilingual image-text dataset to reach state-of-the-art results on vision-language tasks while keeping a simple modular design.

VISOR: A Vision-Language Model-based Test Oracle for Testing Robots

cs.SE · 2026-05-11 · unverdicted · novelty 6.0 · 2 refs

VISOR is a VLM-based automated test oracle that evaluates robot task correctness and quality from videos while reporting its own uncertainty, tested on GPT and Gemini across four tasks and over 1000 videos with Gemini showing higher recall and GPT higher precision but low uncertainty-correctness tie

Embedding Arithmetic: A Lightweight, Tuning-Free Framework for Post-hoc Bias Mitigation in Text-to-Image Models

cs.CV · 2026-04-20 · unverdicted · novelty 6.0

Embedding Arithmetic performs vector operations in the embedding space of T2I models to mitigate bias at inference time, outperforming baselines on diversity while preserving coherence via a new Concept Coherence Score.

ITIScore: An Image-to-Text-to-Image Rating Framework for the Image Captioning Ability of MLLMs

cs.CV · 2026-04-04 · unverdicted · novelty 6.0

ITIScore evaluates MLLM image captions via image-to-text-to-image reconstruction consistency and aligns with human judgments on a new 40K-caption benchmark.

From Broad Exploration to Stable Synthesis: Entropy-Guided Optimization for Autoregressive Image Generation

cs.LG · 2026-03-12 · unverdicted · novelty 6.0

EG-GRPO improves autoregressive text-to-image models by reallocating RL updates according to token entropy, excluding low-entropy tokens from reward signals while adding entropy bonuses to high-entropy ones, yielding state-of-the-art results on standard benchmarks.

How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks

cs.CV · 2025-07-02 · unverdicted · novelty 6.0

Multimodal foundation models achieve respectable but sub-specialist performance on semantic vision tasks and weaker results on geometric tasks when evaluated through prompt chaining on established benchmarks.

SketchDeco: Training-Free Latent Composition for Precise Sketch Colourisation

cs.CV · 2024-05-29 · unverdicted · novelty 6.0

SketchDeco performs training-free sketch colourisation via diffusion inversion to insert user colors followed by custom self-attention blending for local fidelity and global harmony.

CogVLM: Visual Expert for Pretrained Language Models

cs.CV · 2023-11-06 · conditional · novelty 6.0

CogVLM adds a trainable visual expert inside frozen language model layers for deep vision-language fusion and reports state-of-the-art results on ten cross-modal benchmarks while preserving NLP performance.

Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

cs.CV · 2023-06-26 · accept · novelty 6.0

A new dataset of 400k visual instructions including negative examples at three semantic levels reduces hallucinations in models like MiniGPT-4 when used for fine-tuning while improving benchmark performance.

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

cs.CV · 2023-06-23 · unverdicted · novelty 6.0

MME is a manually annotated benchmark evaluating MLLMs on perception and cognition across 14 subtasks to avoid data leakage and support fair model comparisons.

OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models

cs.CV · 2023-05-13 · accept · novelty 6.0

OCRBench provides the largest evaluation suite yet for OCR capabilities in large multimodal models, revealing gaps in multilingual, handwritten, and mathematical text handling.

Sigmoid Loss for Language Image Pre-Training

cs.CV · 2023-03-27 · conditional · novelty 6.0

SigLIP replaces softmax-based contrastive loss with a simple pairwise sigmoid loss for vision-language pre-training, decoupling batch size from normalization and reaching strong zero-shot performance with limited compute.

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

cs.CV · 2023-03-20 · unverdicted · novelty 6.0

MM-REACT uses textual prompts to let ChatGPT collaborate with external vision experts for zero-shot multimodal reasoning and action on advanced visual tasks.

Text-Guided Multi-Scale Frequency Representation Adaptation

cs.CV · 2026-05-05 · unverdicted · novelty 5.0

FreqAdapter adapts multimodal models by text-guided multi-scale fine-tuning in the frequency domain, claiming better performance and efficiency than signal-space PEFT methods.

mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

cs.CV · 2024-08-09 · unverdicted · novelty 5.0

mPLUG-Owl3 introduces hyper attention blocks to integrate vision and language for long image-sequence understanding and reports SOTA results on single-image, multi-image, and video benchmarks.

Automated Description Generation of Cytologic Findings for Lung Cytological Images Using a Pretrained Vision Model and Dual Text Decoders: Preliminary Study

eess.IV · 2024-03-26 · unverdicted · novelty 5.0

A CNN classifies lung cytology patches as benign or malignant at 100% sensitivity and 96.4% specificity, then routes to one of two Transformer decoders to generate findings text achieving BLEU-4 of 0.828 on 801 images.

citing papers explorer

Showing 28 of 28 citing papers.

Is Dimensionality a Barrier for Retrieval Models? cs.LG · 2026-05-22 · unverdicted · none · ref 113 · internal anchor
Dimension d = O(m^{-2} log n) nearly achieves the optimal margin m^rd(+∞, A) for retrieval embeddings, with matching lower bounds showing d = O(k log(n/k)) suffices and is necessary for m = Θ(k^{-1/2}) on k-sparse query matrices.
WikiCLIP: An Efficient Contrastive Baseline for Open-domain Visual Entity Recognition cs.CV · 2026-03-10 · unverdicted · none · ref 40 · internal anchor
WikiCLIP delivers an efficient contrastive baseline for open-domain visual entity recognition that improves accuracy by 16% on OVEN unseen entities and runs nearly 100 times faster than leading generative models.
HyperCap: Hyperspectral Land Cover Captioning Dataset for Vision Language Models cs.CV · 2025-05-18 · unverdicted · none · ref 49 · internal anchor
HyperCap is the first large-scale hyperspectral captioning dataset built from four benchmark HSI datasets using hybrid automated-manual annotations, with evaluations showing classification gains for vision-language models.
Stitch-a-Demo: Video Demonstrations from Multistep Descriptions cs.CV · 2025-03-18 · unverdicted · none · ref 63 · internal anchor
Stitch-a-Demo is a retrieval-based method that assembles visually coherent video demonstrations from multistep textual descriptions by training on weakly supervised procedural data with hard negatives.
HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models cs.CV · 2023-10-23 · unverdicted · none · ref 44 · internal anchor
HallusionBench shows GPT-4V reaches only 31.42% accuracy on paired questions testing language hallucination and visual illusion in LVLMs, with other models below 16%.
Visual Instruction Tuning cs.CV · 2023-04-17 · unverdicted · none · ref 50 · internal anchor
LLaVA is trained on GPT-4 generated visual instruction data to achieve 85.1% relative performance to GPT-4 on synthetic multimodal tasks and 92.53% accuracy on Science QA.
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention cs.CV · 2023-03-28 · conditional · none · ref 141 · internal anchor
LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.
Language Is Not All You Need: Aligning Perception with Language Models cs.CL · 2023-02-27 · conditional · none · ref 31 · internal anchor
Kosmos-1 shows strong zero-shot and few-shot results on language tasks, image captioning, visual QA, OCR-free document understanding, and image recognition guided by text instructions.
PaLI: A Jointly-Scaled Multilingual Language-Image Model cs.CV · 2022-09-14 · conditional · none · ref 80 · internal anchor
PaLI jointly scales a 4B-parameter vision transformer with language models on a new 10B multilingual image-text dataset to reach state-of-the-art results on vision-language tasks while keeping a simple modular design.
VISOR: A Vision-Language Model-based Test Oracle for Testing Robots cs.SE · 2026-05-11 · unverdicted · none · ref 56 · 2 links · internal anchor
VISOR is a VLM-based automated test oracle that evaluates robot task correctness and quality from videos while reporting its own uncertainty, tested on GPT and Gemini across four tasks and over 1000 videos with Gemini showing higher recall and GPT higher precision but low uncertainty-correctness tie
Embedding Arithmetic: A Lightweight, Tuning-Free Framework for Post-hoc Bias Mitigation in Text-to-Image Models cs.CV · 2026-04-20 · unverdicted · none · ref 41 · internal anchor
Embedding Arithmetic performs vector operations in the embedding space of T2I models to mitigate bias at inference time, outperforming baselines on diversity while preserving coherence via a new Concept Coherence Score.
ITIScore: An Image-to-Text-to-Image Rating Framework for the Image Captioning Ability of MLLMs cs.CV · 2026-04-04 · unverdicted · none · ref 53 · internal anchor
ITIScore evaluates MLLM image captions via image-to-text-to-image reconstruction consistency and aligns with human judgments on a new 40K-caption benchmark.
From Broad Exploration to Stable Synthesis: Entropy-Guided Optimization for Autoregressive Image Generation cs.LG · 2026-03-12 · unverdicted · none · ref 27 · internal anchor
EG-GRPO improves autoregressive text-to-image models by reallocating RL updates according to token entropy, excluding low-entropy tokens from reward signals while adding entropy bonuses to high-entropy ones, yielding state-of-the-art results on standard benchmarks.
How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks cs.CV · 2025-07-02 · unverdicted · none · ref 61 · internal anchor
Multimodal foundation models achieve respectable but sub-specialist performance on semantic vision tasks and weaker results on geometric tasks when evaluated through prompt chaining on established benchmarks.
SketchDeco: Training-Free Latent Composition for Precise Sketch Colourisation cs.CV · 2024-05-29 · unverdicted · none · ref 71 · internal anchor
SketchDeco performs training-free sketch colourisation via diffusion inversion to insert user colors followed by custom self-attention blending for local fidelity and global harmony.
CogVLM: Visual Expert for Pretrained Language Models cs.CV · 2023-11-06 · conditional · none · ref 22 · internal anchor
CogVLM adds a trainable visual expert inside frozen language model layers for deep vision-language fusion and reports state-of-the-art results on ten cross-modal benchmarks while preserving NLP performance.
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning cs.CV · 2023-06-26 · accept · none · ref 20 · internal anchor
A new dataset of 400k visual instructions including negative examples at three semantic levels reduces hallucinations in models like MiniGPT-4 when used for fine-tuning while improving benchmark performance.
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models cs.CV · 2023-06-23 · unverdicted · none · ref 45 · internal anchor
MME is a manually annotated benchmark evaluating MLLMs on perception and cognition across 14 subtasks to avoid data leakage and support fair model comparisons.
OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models cs.CV · 2023-05-13 · accept · none · ref 14 · internal anchor
OCRBench provides the largest evaluation suite yet for OCR capabilities in large multimodal models, revealing gaps in multilingual, handwritten, and mathematical text handling.
Sigmoid Loss for Language Image Pre-Training cs.CV · 2023-03-27 · conditional · none · ref 49 · internal anchor
SigLIP replaces softmax-based contrastive loss with a simple pairwise sigmoid loss for vision-language pre-training, decoupling batch size from normalization and reaching strong zero-shot performance with limited compute.
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action cs.CV · 2023-03-20 · unverdicted · none · ref 29 · internal anchor
MM-REACT uses textual prompts to let ChatGPT collaborate with external vision experts for zero-shot multimodal reasoning and action on advanced visual tasks.
Text-Guided Multi-Scale Frequency Representation Adaptation cs.CV · 2026-05-05 · unverdicted · none · ref 25 · internal anchor
FreqAdapter adapts multimodal models by text-guided multi-scale fine-tuning in the frequency domain, claiming better performance and efficiency than signal-space PEFT methods.
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models cs.CV · 2024-08-09 · unverdicted · none · ref 31 · internal anchor
mPLUG-Owl3 introduces hyper attention blocks to integrate vision and language for long image-sequence understanding and reports SOTA results on single-image, multi-image, and video benchmarks.
Automated Description Generation of Cytologic Findings for Lung Cytological Images Using a Pretrained Vision Model and Dual Text Decoders: Preliminary Study eess.IV · 2024-03-26 · unverdicted · none · ref 24 · internal anchor
A CNN classifies lung cytology patches as benign or malignant at 100% sensitivity and 96.4% specificity, then routes to one of two Transformer decoders to generate findings text achieving BLEU-4 of 0.828 on 801 images.
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration cs.CL · 2023-11-07 · unverdicted · none · ref 61 · internal anchor
mPLUG-Owl2 presents a modular MLLM architecture that enables modality collaboration via shared functional modules and modality-adaptive components, achieving SOTA on both text and multi-modal tasks with one generic model.
Character-Centered Dialogue Generation from Scene-Level Prompts cs.CV · 2025-05-22 · unverdicted · none · ref 54 · internal anchor
A training-free framework generates expressive, character-grounded dialogue and speech from scene prompts using vision-language encoders, LLMs, and a recursive narrative memory bank for cross-scene consistency.
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models cs.CV · 2024-02-27 · unverdicted · none · ref 82 · internal anchor
The paper reviews the background, technology, applications, limitations, and future directions of OpenAI's Sora text-to-video generative model based on public information.
Let ViT Speak: Generative Language-Image Pre-training cs.CV · 2026-05-01 · unreviewed · ref 69 · internal anchor

GIT: A Generative Image-to-text Transformer for Vision and Language

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer