FG-CLIP 2: A Bilingual Fine-grained Vision-Language Alignment Model

Bin Wang; Chunyu Xie; Dawei Leng; Dawei Liang; Fanjing Kong; Ji Ao; Jincheng Li; Yuhui Yin

arxiv: 2510.10921 · v3 · pith:276DMUK5new · submitted 2025-10-13 · 💻 cs.CV · cs.AI· cs.LG

FG-CLIP 2: A Bilingual Fine-grained Vision-Language Alignment Model

Chunyu Xie , Bin Wang , Fanjing Kong , Jincheng Li , Dawei Liang , Ji Ao , Dawei Leng , Yuhui Yin This is my paper

classification 💻 cs.CV cs.AIcs.LG

keywords fine-grainedalignmentbilingualchinesefg-clipvision-languagemodelbenchmark

0 comments

read the original abstract

Fine-grained vision-language understanding requires precise alignment between visual content and linguistic descriptions, a capability that remains limited in current models, particularly in non-English settings. While models like CLIP perform well on global alignment, they often struggle to capture fine-grained details in object attributes, spatial relations, and linguistic expressions, with limited support for bilingual comprehension. To address these challenges, we introduce FG-CLIP 2, a bilingual vision-language model designed to advance fine-grained alignment for both English and Chinese. Our approach leverages rich fine-grained supervision, including region-text matching and long-caption modeling, alongside multiple discriminative objectives. We further introduce the Textual Intra-modal Contrastive (TIC) loss to better distinguish semantically similar captions. Trained on a carefully curated mixture of large-scale English and Chinese data, including a newly released 12M Chinese region-text dataset, FG-CLIP 2 achieves powerful bilingual performance. To enable rigorous evaluation, we present a new benchmark for Chinese multimodal understanding, featuring long-caption retrieval and bounding box classification. Extensive experiments on 29 datasets across 8 tasks show that FG-CLIP 2 outperforms existing methods, achieving state-of-the-art results in both languages. We release the model, code, and benchmark to facilitate future research on bilingual fine-grained vision-language alignment.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Evaluating Remote Sensing Image Captions Beyond Metric Biases
cs.CV 2026-04 unverdicted novelty 7.0

Unfine-tuned MLLMs outperform fine-tuned models on remote sensing image captioning when captions are scored by their ability to reconstruct the source image, and a training-free self-correction method achieves SOTA pe...
PixVerve: Advancing Native UHR Image Generation to 100MP with a Large-Scale High-Quality Dataset
cs.CV 2026-05 unverdicted novelty 5.0

PixVerve introduces a 95K ultra-high-resolution image-text dataset and training strategies that enable native 100-megapixel text-to-image generation together with a new evaluation benchmark.