NeRP corrects asymmetric class confusion in VLMs for unseen classes by combining neutral-prompt priors with sample likelihood to flip predictions on confusable pairs, improving new-class accuracy while preserving base-class performance.
hub
Filip: Fine-grained interactive language-image pre-training
22 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 4representative citing papers
Thermal-Det is the first LLM-supervised open-vocabulary thermal object detector, created via synthetic data conversion from GroundingCap-1M and RGB-to-thermal distillation, yielding 2-4% AP gains on benchmarks.
WikiCLIP delivers an efficient contrastive baseline for open-domain visual entity recognition that improves accuracy by 16% on OVEN unseen entities and runs nearly 100 times faster than leading generative models.
A contrastive multimodal framework augments satellite-audio datasets with vision-language model sound descriptions to learn shared soundscape concepts for zero-shot retrieval and synthesis.
VideoChat integrates video models and LLMs via a learnable interface for chat-based spatiotemporal and causal video reasoning, trained on a new video-centric instruction dataset.
Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.
SWIM aligns cross-attention maps from object nouns to ground-truth masks during training on the new NL-Refer dataset to enable text-only fine-grained video object understanding in MLLMs.
CPT creates cluster-invariant spaces from pre-trained VLM semantics and applies neural collapse losses to boost long-tail performance and unseen-class generalization in prompt tuning.
GL-HPN combines global vector matching for fast recall with local patch-token alignment and structure filtering to improve zero-shot Chinese character recognition while cutting large-scale inference cost.
IPL alternates discrete semantic token selection using approximate submodular optimization with continuous prompt optimization to boost both interpretability and task performance in vision-language model adaptation.
G-MIXER achieves state-of-the-art zero-shot composed image retrieval by using geodesic mixup to build diverse implicit candidates and MLLM-derived explicit semantics for re-ranking.
MApLe disentangles anatomy and pathology to align free-text diagnostic sentences with specific patches in large medical images via multi-instance learning.
GradNorm selects positive nouns via gradient magnitudes from cross-entropy loss, with an error bound proving it subsumes prior CLIP methods and delivers SOTA clustering results.
InternVid supplies 7M videos and LLM captions to train ViCLIP, which reaches leading zero-shot action recognition and competitive retrieval performance.
CoCa unifies contrastive and generative pretraining in one image-text model to reach 86.3% zero-shot ImageNet accuracy and new state-of-the-art results on multiple downstream benchmarks.
Florence is a new vision foundation model that learns universal visual-language representations from web-scale data and reports state-of-the-art results on 44 benchmarks including 83.74% zero-shot ImageNet top-1 accuracy.
Inverse attention embeddings combined with standard visual features improve recall in video semantic search for crowded scenes without additional training.
Dual-encoder VLMs gain robust compositional generalization by learning localized alignments from frozen patch and token embeddings instead of using global similarity.
AGREE boosts visual document retrieval by adding local relevance signals from MLLM attention maps to global document labels during retriever training.
LPT reduces overfitting during prompt tuning of VLMs by CLIP-based foreground filtering, a structural preservation constraint aligning features to frozen CLIP, and a hierarchical logit constraint at the output, improving generalization on base-to-novel, cross-dataset, and domain-generalization tasks
InternVideo combines masked video modeling and video-language contrastive learning into a single foundation model that reaches state-of-the-art results on 39 video datasets including 91.1% top-1 on Kinetics-400.
A patch-based fusion method extends CLIP to high-resolution images by retaining multi-scale details for improved class-prompted retrieval.
citing papers explorer
-
Neutral-Reference Prompting for Vision-Language Models
NeRP corrects asymmetric class confusion in VLMs for unseen classes by combining neutral-prompt priors with sample likelihood to flip predictions on confusable pairs, improving new-class accuracy while preserving base-class performance.
-
Thermal-Det: Language-Guided Cross-Modal Distillation for Open-Vocabulary Thermal Object Detection
Thermal-Det is the first LLM-supervised open-vocabulary thermal object detector, created via synthetic data conversion from GroundingCap-1M and RGB-to-thermal distillation, yielding 2-4% AP gains on benchmarks.
-
WikiCLIP: An Efficient Contrastive Baseline for Open-domain Visual Entity Recognition
WikiCLIP delivers an efficient contrastive baseline for open-domain visual entity recognition that improves accuracy by 16% on OVEN unseen entities and runs nearly 100 times faster than leading generative models.
-
Sat2Sound: A Unified Framework for Zero-Shot Soundscape Mapping
A contrastive multimodal framework augments satellite-audio datasets with vision-language model sound descriptions to learn shared soundscape concepts for zero-shot retrieval and synthesis.
-
VideoChat: Chat-Centric Video Understanding
VideoChat integrates video models and LLMs via a learnable interface for chat-based spatiotemporal and causal video reasoning, trained on a new video-centric instruction dataset.
-
Flamingo: a Visual Language Model for Few-Shot Learning
Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.
-
See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding
SWIM aligns cross-attention maps from object nouns to ground-truth masks during training on the new NL-Refer dataset to enable text-only fine-grained video object understanding in MLLMs.
-
Cluster-Aware Neural Collapse Prompt Tuning for Long-Tailed Generalization of Vision-Language Models
CPT creates cluster-invariant spaces from pre-trained VLM semantics and applies neural collapse losses to boost long-tail performance and unseen-class generalization in prompt tuning.
-
Zero-Shot Chinese Character Recognition via Global-Local Dual-Branch Alignment and Hierarchical Inference
GL-HPN combines global vector matching for fast recall with local patch-token alignment and structure filtering to improve zero-shot Chinese character recognition while cutting large-scale inference cost.
-
Joint Semantic Token Selection and Prompt Optimization for Interpretable Prompt Learning
IPL alternates discrete semantic token selection using approximate submodular optimization with continuous prompt optimization to boost both interpretability and task performance in vision-language model adaptation.
-
G-MIXER: Geodesic Mixup-based Implicit Semantic Expansion and Explicit Semantic Re-ranking for Zero-Shot Composed Image Retrieval
G-MIXER achieves state-of-the-art zero-shot composed image retrieval by using geodesic mixup to build diverse implicit candidates and MLLM-derived explicit semantics for re-ranking.
-
MApLe: Multi-instance Alignment of Diagnostic Reports and Large Medical Images
MApLe disentangles anatomy and pathology to align free-text diagnostic sentences with specific patches in large medical images via multi-instance learning.
-
On the Provable Importance of Gradients for Language-Assisted Image Clustering
GradNorm selects positive nouns via gradient magnitudes from cross-entropy loss, with an error bound proving it subsumes prior CLIP methods and delivers SOTA clustering results.
-
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation
InternVid supplies 7M videos and LLM captions to train ViCLIP, which reaches leading zero-shot action recognition and competitive retrieval performance.
-
CoCa: Contrastive Captioners are Image-Text Foundation Models
CoCa unifies contrastive and generative pretraining in one image-text model to reach 86.3% zero-shot ImageNet accuracy and new state-of-the-art results on multiple downstream benchmarks.
-
Florence: A New Foundation Model for Computer Vision
Florence is a new vision foundation model that learns universal visual-language representations from web-scale data and reports state-of-the-art results on 44 benchmarks including 83.74% zero-shot ImageNet top-1 accuracy.
-
Look Beyond Saliency: Low-Attention Guided Dual Encoding for Video Semantic Search
Inverse attention embeddings combined with standard visual features improve recall in video semantic search for crowded scenes without additional training.
-
Revisiting Compositionality in Dual-Encoder Vision-Language Models: The Role of Inference
Dual-encoder VLMs gain robust compositional generalization by learning localized alignments from frozen patch and token embeddings instead of using global similarity.
-
Attention Grounded Enhancement for Visual Document Retrieval
AGREE boosts visual document retrieval by adding local relevance signals from MLLM attention maps to global document labels during retriever training.
-
LPT: Less-overfitting Prompt Tuning for Vision-Language Model
LPT reduces overfitting during prompt tuning of VLMs by CLIP-based foreground filtering, a structural preservation constraint aligning features to frozen CLIP, and a hierarchical logit constraint at the output, improving generalization on base-to-novel, cross-dataset, and domain-generalization tasks
-
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
InternVideo combines masked video modeling and video-language contrastive learning into a single foundation model that reaches state-of-the-art results on 39 video datasets including 91.1% top-1 on Kinetics-400.
-
DetailCLIP: Injecting Image Details into CLIP's Feature Space
A patch-based fusion method extends CLIP to high-resolution images by retaining multi-scale details for improved class-prompted retrieval.