LAGO achieves state-of-the-art zero-shot performance with fewer image regions by using class-agnostic object discovery followed by confidence-controlled language-guided refinement and dual-channel aggregation.
Fg-clip: Fine-grained visual and textual alignment.arXiv preprint arXiv:2505.05071, 2025
9 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 9representative citing papers
POMA-3D learns self-supervised 3D scene representations from point maps and improves performance on geometric 3D tasks including navigation and scene retrieval.
L2P repurposes pre-trained LDMs for direct pixel generation via large-patch tokenization and shallow-layer training on synthetic data, matching source performance with 8-GPU training and enabling native 4K output.
PSRD mitigates visual hallucinations in LVLMs via phase-wise self-reward decoding, cutting rates by 50% on LLaVA-1.5-7B and outperforming prior methods on five benchmarks.
UniScene3D learns unified 3D scene representations from colored pointmaps using contrastive CLIP pretraining plus cross-view geometric and grounded view alignments, achieving state-of-the-art results on viewpoint grounding, scene retrieval, classification, and 3D VQA.
Emu3.5 is a native multimodal world model pre-trained on over 10 trillion vision-language tokens with next-token prediction, post-trained via reinforcement learning, and accelerated by Discrete Diffusion Adaptation for efficient interleaved generation and world exploration.
IdentiFace is a multi-modal iterative diffusion framework that generates identifiable suspect faces with improved identity retrieval for law enforcement applications.
AFMRL uses MLLM-generated attributes in attribute-guided contrastive learning and retrieval-aware reinforcement to achieve SOTA fine-grained multimodal retrieval on e-commerce datasets.
AGREE boosts visual document retrieval by adding local relevance signals from MLLM attention maps to global document labels during retriever training.
citing papers explorer
-
LAGO: Language-Guided Adaptive Object-Region Focus for Zero-Shot Visual-Text Alignment
LAGO achieves state-of-the-art zero-shot performance with fewer image regions by using class-agnostic object discovery followed by confidence-controlled language-guided refinement and dual-channel aggregation.
-
POMA-3D: The Point Map Way to 3D Scene Understanding
POMA-3D learns self-supervised 3D scene representations from point maps and improves performance on geometric 3D tasks including navigation and scene retrieval.
-
L2P: Unlocking Latent Potential for Pixel Generation
L2P repurposes pre-trained LDMs for direct pixel generation via large-patch tokenization and shallow-layer training on synthetic data, matching source performance with 8-GPU training and enabling native 4K output.
-
Mitigating Multimodal Hallucination via Phase-wise Self-reward
PSRD mitigates visual hallucinations in LVLMs via phase-wise self-reward decoding, cutting rates by 50% on LLaVA-1.5-7B and outperforming prior methods on five benchmarks.
-
Contrastive Language-Colored Pointmap Pretraining for Unified 3D Scene Understanding
UniScene3D learns unified 3D scene representations from colored pointmaps using contrastive CLIP pretraining plus cross-view geometric and grounded view alignments, achieving state-of-the-art results on viewpoint grounding, scene retrieval, classification, and 3D VQA.
-
Emu3.5: Native Multimodal Models are World Learners
Emu3.5 is a native multimodal world model pre-trained on over 10 trillion vision-language tokens with next-token prediction, post-trained via reinforcement learning, and accelerated by Discrete Diffusion Adaptation for efficient interleaved generation and world exploration.
-
IdentiFace: Multi-Modal Iterative Diffusion Framework for Identifiable Suspect Face Generation in Crime Investigations
IdentiFace is a multi-modal iterative diffusion framework that generates identifiable suspect faces with improved identity retrieval for law enforcement applications.
-
AFMRL: Attribute-Enhanced Fine-Grained Multi-Modal Representation Learning in E-commerce
AFMRL uses MLLM-generated attributes in attribute-guided contrastive learning and retrieval-aware reinforcement to achieve SOTA fine-grained multimodal retrieval on e-commerce datasets.
-
Attention Grounded Enhancement for Visual Document Retrieval
AGREE boosts visual document retrieval by adding local relevance signals from MLLM attention maps to global document labels during retriever training.