Coarse-to-fine vision-language pre-training with fusion in the backbone

Zi-Yi Dou, Aishwarya Kamath, Zhe Gan, Pengchuan Zhang, Jianfeng Wang, Linjie Li, Zicheng Liu, Ce Liu, Yann LeCun, Nanyun Peng, Jianfeng Gao, Lijuan Wang · 2022 · arXiv 2206.07643

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

read on arXiv browse 2 citing papers

representative citing papers

SAM 3: Segment Anything with Concepts

cs.CV · 2025-11-20 · unverdicted · novelty 7.0

SAM 3 introduces promptable concept segmentation that doubles accuracy of prior systems on images and videos while improving standard SAM segmentation performance.

Kosmos-2: Grounding Multimodal Large Language Models to the World

cs.CL · 2023-06-26 · unverdicted · novelty 6.0

Kosmos-2 grounds text to image regions by encoding refer expressions as Markdown links to sequences of location tokens and trains on a new GrIT dataset of grounded image-text pairs.

citing papers explorer

Showing 2 of 2 citing papers.

SAM 3: Segment Anything with Concepts cs.CV · 2025-11-20 · unverdicted · none · ref 30
SAM 3 introduces promptable concept segmentation that doubles accuracy of prior systems on images and videos while improving standard SAM segmentation performance.
Kosmos-2: Grounding Multimodal Large Language Models to the World cs.CL · 2023-06-26 · unverdicted · none · ref 4
Kosmos-2 grounds text to image regions by encoding refer expressions as Markdown links to sequences of location tokens and trains on a new GrIT dataset of grounded image-text pairs.

Coarse-to-fine vision-language pre-training with fusion in the backbone

fields

years

verdicts

representative citing papers

citing papers explorer