Fg-clip: Fine-grained visual and textual alignment.arXiv preprint arXiv:2505.05071, 2025

Chunyu Xie, Bin Wang, Fanjing Kong, Jincheng Li, Dawei Liang, Gengshen Zhang, Dawei Leng, Yuhui Yin · 2025 · arXiv 2505.05071

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

read on arXiv browse 9 citing papers

citation-role summary

background 3 baseline 1

citation-polarity summary

background 3 baseline 1

representative citing papers

LAGO: Language-Guided Adaptive Object-Region Focus for Zero-Shot Visual-Text Alignment

cs.CV · 2026-05-04 · unverdicted · novelty 7.0

LAGO achieves state-of-the-art zero-shot performance with fewer image regions by using class-agnostic object discovery followed by confidence-controlled language-guided refinement and dual-channel aggregation.

POMA-3D: The Point Map Way to 3D Scene Understanding

cs.CV · 2025-11-20 · unverdicted · novelty 7.0

POMA-3D learns self-supervised 3D scene representations from point maps and improves performance on geometric 3D tasks including navigation and scene retrieval.

L2P: Unlocking Latent Potential for Pixel Generation

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

L2P repurposes pre-trained LDMs for direct pixel generation via large-patch tokenization and shallow-layer training on synthetic data, matching source performance with 8-GPU training and enabling native 4K output.

Mitigating Multimodal Hallucination via Phase-wise Self-reward

cs.CV · 2026-04-20 · unverdicted · novelty 6.0

PSRD mitigates visual hallucinations in LVLMs via phase-wise self-reward decoding, cutting rates by 50% on LLaVA-1.5-7B and outperforming prior methods on five benchmarks.

Contrastive Language-Colored Pointmap Pretraining for Unified 3D Scene Understanding

cs.CV · 2026-04-02 · unverdicted · novelty 6.0

UniScene3D learns unified 3D scene representations from colored pointmaps using contrastive CLIP pretraining plus cross-view geometric and grounded view alignments, achieving state-of-the-art results on viewpoint grounding, scene retrieval, classification, and 3D VQA.

Emu3.5: Native Multimodal Models are World Learners

cs.CV · 2025-10-30 · unverdicted · novelty 6.0

Emu3.5 is a native multimodal world model pre-trained on over 10 trillion vision-language tokens with next-token prediction, post-trained via reinforcement learning, and accelerated by Discrete Diffusion Adaptation for efficient interleaved generation and world exploration.

IdentiFace: Multi-Modal Iterative Diffusion Framework for Identifiable Suspect Face Generation in Crime Investigations

cs.CV · 2026-05-01 · unverdicted · novelty 5.0

IdentiFace is a multi-modal iterative diffusion framework that generates identifiable suspect faces with improved identity retrieval for law enforcement applications.

AFMRL: Attribute-Enhanced Fine-Grained Multi-Modal Representation Learning in E-commerce

cs.CL · 2026-04-22 · unverdicted · novelty 5.0

AFMRL uses MLLM-generated attributes in attribute-guided contrastive learning and retrieval-aware reinforcement to achieve SOTA fine-grained multimodal retrieval on e-commerce datasets.

Attention Grounded Enhancement for Visual Document Retrieval

cs.IR · 2025-11-17 · unverdicted · novelty 5.0

AGREE boosts visual document retrieval by adding local relevance signals from MLLM attention maps to global document labels during retriever training.

citing papers explorer

Showing 9 of 9 citing papers.

LAGO: Language-Guided Adaptive Object-Region Focus for Zero-Shot Visual-Text Alignment cs.CV · 2026-05-04 · unverdicted · none · ref 9
LAGO achieves state-of-the-art zero-shot performance with fewer image regions by using class-agnostic object discovery followed by confidence-controlled language-guided refinement and dual-channel aggregation.
POMA-3D: The Point Map Way to 3D Scene Understanding cs.CV · 2025-11-20 · unverdicted · none · ref 50
POMA-3D learns self-supervised 3D scene representations from point maps and improves performance on geometric 3D tasks including navigation and scene retrieval.
L2P: Unlocking Latent Potential for Pixel Generation cs.CV · 2026-05-12 · unverdicted · none · ref 23
L2P repurposes pre-trained LDMs for direct pixel generation via large-patch tokenization and shallow-layer training on synthetic data, matching source performance with 8-GPU training and enabling native 4K output.
Mitigating Multimodal Hallucination via Phase-wise Self-reward cs.CV · 2026-04-20 · unverdicted · none · ref 53
PSRD mitigates visual hallucinations in LVLMs via phase-wise self-reward decoding, cutting rates by 50% on LLaVA-1.5-7B and outperforming prior methods on five benchmarks.
Contrastive Language-Colored Pointmap Pretraining for Unified 3D Scene Understanding cs.CV · 2026-04-02 · unverdicted · none · ref 52
UniScene3D learns unified 3D scene representations from colored pointmaps using contrastive CLIP pretraining plus cross-view geometric and grounded view alignments, achieving state-of-the-art results on viewpoint grounding, scene retrieval, classification, and 3D VQA.
Emu3.5: Native Multimodal Models are World Learners cs.CV · 2025-10-30 · unverdicted · none · ref 112
Emu3.5 is a native multimodal world model pre-trained on over 10 trillion vision-language tokens with next-token prediction, post-trained via reinforcement learning, and accelerated by Discrete Diffusion Adaptation for efficient interleaved generation and world exploration.
IdentiFace: Multi-Modal Iterative Diffusion Framework for Identifiable Suspect Face Generation in Crime Investigations cs.CV · 2026-05-01 · unverdicted · none · ref 44
IdentiFace is a multi-modal iterative diffusion framework that generates identifiable suspect faces with improved identity retrieval for law enforcement applications.
AFMRL: Attribute-Enhanced Fine-Grained Multi-Modal Representation Learning in E-commerce cs.CL · 2026-04-22 · unverdicted · none · ref 8
AFMRL uses MLLM-generated attributes in attribute-guided contrastive learning and retrieval-aware reinforcement to achieve SOTA fine-grained multimodal retrieval on e-commerce datasets.
Attention Grounded Enhancement for Visual Document Retrieval cs.IR · 2025-11-17 · unverdicted · none · ref 56
AGREE boosts visual document retrieval by adding local relevance signals from MLLM attention maps to global document labels during retriever training.

Fg-clip: Fine-grained visual and textual alignment.arXiv preprint arXiv:2505.05071, 2025

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer