hub

Filip: Fine-grained interactive language-image pre-training

| Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe Niu, Hang Xu, Xiaodan Liang, Zhenguo Li, Xin Jiang, Chunjing Xu · 2021 · arXiv 2111.07783

22 Pith papers cite this work. Polarity classification is still indexing.

22 Pith papers citing it

read on arXiv browse 22 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4

citation-polarity summary

background 3 unclear 1

representative citing papers

Neutral-Reference Prompting for Vision-Language Models

cs.CV · 2026-05-15 · unverdicted · novelty 7.0

NeRP corrects asymmetric class confusion in VLMs for unseen classes by combining neutral-prompt priors with sample likelihood to flip predictions on confusable pairs, improving new-class accuracy while preserving base-class performance.

Thermal-Det: Language-Guided Cross-Modal Distillation for Open-Vocabulary Thermal Object Detection

cs.CV · 2026-05-11 · unverdicted · novelty 7.0

Thermal-Det is the first LLM-supervised open-vocabulary thermal object detector, created via synthetic data conversion from GroundingCap-1M and RGB-to-thermal distillation, yielding 2-4% AP gains on benchmarks.

WikiCLIP: An Efficient Contrastive Baseline for Open-domain Visual Entity Recognition

cs.CV · 2026-03-10 · unverdicted · novelty 7.0

WikiCLIP delivers an efficient contrastive baseline for open-domain visual entity recognition that improves accuracy by 16% on OVEN unseen entities and runs nearly 100 times faster than leading generative models.

Sat2Sound: A Unified Framework for Zero-Shot Soundscape Mapping

cs.CV · 2025-05-19 · unverdicted · novelty 7.0

A contrastive multimodal framework augments satellite-audio datasets with vision-language model sound descriptions to learn shared soundscape concepts for zero-shot retrieval and synthesis.

VideoChat: Chat-Centric Video Understanding

cs.CV · 2023-05-10 · conditional · novelty 7.0

VideoChat integrates video models and LLMs via a learnable interface for chat-based spatiotemporal and causal video reasoning, trained on a new video-centric instruction dataset.

Flamingo: a Visual Language Model for Few-Shot Learning

cs.CV · 2022-04-29 · unverdicted · novelty 7.0

Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.

See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding

cs.CV · 2026-05-18 · unverdicted · novelty 6.0

SWIM aligns cross-attention maps from object nouns to ground-truth masks during training on the new NL-Refer dataset to enable text-only fine-grained video object understanding in MLLMs.

Cluster-Aware Neural Collapse Prompt Tuning for Long-Tailed Generalization of Vision-Language Models

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

CPT creates cluster-invariant spaces from pre-trained VLM semantics and applies neural collapse losses to boost long-tail performance and unseen-class generalization in prompt tuning.

Zero-Shot Chinese Character Recognition via Global-Local Dual-Branch Alignment and Hierarchical Inference

cs.CV · 2026-05-09 · unverdicted · novelty 6.0

GL-HPN combines global vector matching for fast recall with local patch-token alignment and structure filtering to improve zero-shot Chinese character recognition while cutting large-scale inference cost.

Joint Semantic Token Selection and Prompt Optimization for Interpretable Prompt Learning

cs.CV · 2026-05-06 · unverdicted · novelty 6.0

IPL alternates discrete semantic token selection using approximate submodular optimization with continuous prompt optimization to boost both interpretability and task performance in vision-language model adaptation.

G-MIXER: Geodesic Mixup-based Implicit Semantic Expansion and Explicit Semantic Re-ranking for Zero-Shot Composed Image Retrieval

cs.CV · 2026-04-16 · unverdicted · novelty 6.0

G-MIXER achieves state-of-the-art zero-shot composed image retrieval by using geodesic mixup to build diverse implicit candidates and MLLM-derived explicit semantics for re-ranking.

MApLe: Multi-instance Alignment of Diagnostic Reports and Large Medical Images

cs.CV · 2026-04-15 · unverdicted · novelty 6.0

MApLe disentangles anatomy and pathology to align free-text diagnostic sentences with specific patches in large medical images via multi-instance learning.

On the Provable Importance of Gradients for Language-Assisted Image Clustering

cs.CV · 2025-10-18 · unverdicted · novelty 6.0

GradNorm selects positive nouns via gradient magnitudes from cross-entropy loss, with an error bound proving it subsumes prior CLIP methods and delivers SOTA clustering results.

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

cs.CV · 2023-07-13 · unverdicted · novelty 6.0

InternVid supplies 7M videos and LLM captions to train ViCLIP, which reaches leading zero-shot action recognition and competitive retrieval performance.

CoCa: Contrastive Captioners are Image-Text Foundation Models

cs.CV · 2022-05-04 · accept · novelty 6.0

CoCa unifies contrastive and generative pretraining in one image-text model to reach 86.3% zero-shot ImageNet accuracy and new state-of-the-art results on multiple downstream benchmarks.

Florence: A New Foundation Model for Computer Vision

cs.CV · 2021-11-22 · unverdicted · novelty 6.0

Florence is a new vision foundation model that learns universal visual-language representations from web-scale data and reports state-of-the-art results on 44 benchmarks including 83.74% zero-shot ImageNet top-1 accuracy.

Look Beyond Saliency: Low-Attention Guided Dual Encoding for Video Semantic Search

cs.CV · 2026-05-07 · unverdicted · novelty 5.0

Inverse attention embeddings combined with standard visual features improve recall in video semantic search for crowded scenes without additional training.

Revisiting Compositionality in Dual-Encoder Vision-Language Models: The Role of Inference

cs.CV · 2026-04-13 · unverdicted · novelty 5.0

Dual-encoder VLMs gain robust compositional generalization by learning localized alignments from frozen patch and token embeddings instead of using global similarity.

Attention Grounded Enhancement for Visual Document Retrieval

cs.IR · 2025-11-17 · unverdicted · novelty 5.0

AGREE boosts visual document retrieval by adding local relevance signals from MLLM attention maps to global document labels during retriever training.

LPT: Less-overfitting Prompt Tuning for Vision-Language Model

cs.CV · 2024-10-14 · unverdicted · novelty 5.0

LPT reduces overfitting during prompt tuning of VLMs by CLIP-based foreground filtering, a structural preservation constraint aligning features to frozen CLIP, and a hierarchical logit constraint at the output, improving generalization on base-to-novel, cross-dataset, and domain-generalization tasks

InternVideo: General Video Foundation Models via Generative and Discriminative Learning

cs.CV · 2022-12-06 · unverdicted · novelty 5.0

InternVideo combines masked video modeling and video-language contrastive learning into a single foundation model that reaches state-of-the-art results on 39 video datasets including 91.1% top-1 on Kinetics-400.

DetailCLIP: Injecting Image Details into CLIP's Feature Space

cs.CV · 2022-08-31 · unverdicted · novelty 5.0

A patch-based fusion method extends CLIP to high-resolution images by retaining multi-scale details for improved class-prompted retrieval.

citing papers explorer

Showing 22 of 22 citing papers.

Neutral-Reference Prompting for Vision-Language Models cs.CV · 2026-05-15 · unverdicted · none · ref 15
NeRP corrects asymmetric class confusion in VLMs for unseen classes by combining neutral-prompt priors with sample likelihood to flip predictions on confusable pairs, improving new-class accuracy while preserving base-class performance.
Thermal-Det: Language-Guided Cross-Modal Distillation for Open-Vocabulary Thermal Object Detection cs.CV · 2026-05-11 · unverdicted · none · ref 42
Thermal-Det is the first LLM-supervised open-vocabulary thermal object detector, created via synthetic data conversion from GroundingCap-1M and RGB-to-thermal distillation, yielding 2-4% AP gains on benchmarks.
WikiCLIP: An Efficient Contrastive Baseline for Open-domain Visual Entity Recognition cs.CV · 2026-03-10 · unverdicted · none · ref 44
WikiCLIP delivers an efficient contrastive baseline for open-domain visual entity recognition that improves accuracy by 16% on OVEN unseen entities and runs nearly 100 times faster than leading generative models.
Sat2Sound: A Unified Framework for Zero-Shot Soundscape Mapping cs.CV · 2025-05-19 · unverdicted · none · ref 44
A contrastive multimodal framework augments satellite-audio datasets with vision-language model sound descriptions to learn shared soundscape concepts for zero-shot retrieval and synthesis.
VideoChat: Chat-Centric Video Understanding cs.CV · 2023-05-10 · conditional · none · ref 51
VideoChat integrates video models and LLMs via a learnable interface for chat-based spatiotemporal and causal video reasoning, trained on a new video-centric instruction dataset.
Flamingo: a Visual Language Model for Few-Shot Learning cs.CV · 2022-04-29 · unverdicted · none · ref 139
Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.
See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding cs.CV · 2026-05-18 · unverdicted · none · ref 83
SWIM aligns cross-attention maps from object nouns to ground-truth masks during training on the new NL-Refer dataset to enable text-only fine-grained video object understanding in MLLMs.
Cluster-Aware Neural Collapse Prompt Tuning for Long-Tailed Generalization of Vision-Language Models cs.CV · 2026-05-12 · unverdicted · none · ref 50
CPT creates cluster-invariant spaces from pre-trained VLM semantics and applies neural collapse losses to boost long-tail performance and unseen-class generalization in prompt tuning.
Zero-Shot Chinese Character Recognition via Global-Local Dual-Branch Alignment and Hierarchical Inference cs.CV · 2026-05-09 · unverdicted · none · ref 16
GL-HPN combines global vector matching for fast recall with local patch-token alignment and structure filtering to improve zero-shot Chinese character recognition while cutting large-scale inference cost.
Joint Semantic Token Selection and Prompt Optimization for Interpretable Prompt Learning cs.CV · 2026-05-06 · unverdicted · none · ref 17
IPL alternates discrete semantic token selection using approximate submodular optimization with continuous prompt optimization to boost both interpretability and task performance in vision-language model adaptation.
G-MIXER: Geodesic Mixup-based Implicit Semantic Expansion and Explicit Semantic Re-ranking for Zero-Shot Composed Image Retrieval cs.CV · 2026-04-16 · unverdicted · none · ref 38
G-MIXER achieves state-of-the-art zero-shot composed image retrieval by using geodesic mixup to build diverse implicit candidates and MLLM-derived explicit semantics for re-ranking.
MApLe: Multi-instance Alignment of Diagnostic Reports and Large Medical Images cs.CV · 2026-04-15 · unverdicted · none · ref 5
MApLe disentangles anatomy and pathology to align free-text diagnostic sentences with specific patches in large medical images via multi-instance learning.
On the Provable Importance of Gradients for Language-Assisted Image Clustering cs.CV · 2025-10-18 · unverdicted · none · ref 14
GradNorm selects positive nouns via gradient magnitudes from cross-entropy loss, with an error bound proving it subsumes prior CLIP methods and delivers SOTA clustering results.
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation cs.CV · 2023-07-13 · unverdicted · none · ref 29
InternVid supplies 7M videos and LLM captions to train ViCLIP, which reaches leading zero-shot action recognition and competitive retrieval performance.
CoCa: Contrastive Captioners are Image-Text Foundation Models cs.CV · 2022-05-04 · accept · none · ref 61
CoCa unifies contrastive and generative pretraining in one image-text model to reach 86.3% zero-shot ImageNet accuracy and new state-of-the-art results on multiple downstream benchmarks.
Florence: A New Foundation Model for Computer Vision cs.CV · 2021-11-22 · unverdicted · none · ref 24
Florence is a new vision foundation model that learns universal visual-language representations from web-scale data and reports state-of-the-art results on 44 benchmarks including 83.74% zero-shot ImageNet top-1 accuracy.
Look Beyond Saliency: Low-Attention Guided Dual Encoding for Video Semantic Search cs.CV · 2026-05-07 · unverdicted · none · ref 13
Inverse attention embeddings combined with standard visual features improve recall in video semantic search for crowded scenes without additional training.
Revisiting Compositionality in Dual-Encoder Vision-Language Models: The Role of Inference cs.CV · 2026-04-13 · unverdicted · none · ref 10
Dual-encoder VLMs gain robust compositional generalization by learning localized alignments from frozen patch and token embeddings instead of using global similarity.
Attention Grounded Enhancement for Visual Document Retrieval cs.IR · 2025-11-17 · unverdicted · none · ref 58
AGREE boosts visual document retrieval by adding local relevance signals from MLLM attention maps to global document labels during retriever training.
LPT: Less-overfitting Prompt Tuning for Vision-Language Model cs.CV · 2024-10-14 · unverdicted · none · ref 40
LPT reduces overfitting during prompt tuning of VLMs by CLIP-based foreground filtering, a structural preservation constraint aligning features to frozen CLIP, and a hierarchical logit constraint at the output, improving generalization on base-to-novel, cross-dataset, and domain-generalization tasks
InternVideo: General Video Foundation Models via Generative and Discriminative Learning cs.CV · 2022-12-06 · unverdicted · none · ref 51
InternVideo combines masked video modeling and video-language contrastive learning into a single foundation model that reaches state-of-the-art results on 39 video datasets including 91.1% top-1 on Kinetics-400.
DetailCLIP: Injecting Image Details into CLIP's Feature Space cs.CV · 2022-08-31 · unverdicted · none · ref 29
A patch-based fusion method extends CLIP to high-resolution images by retaining multi-scale details for improved class-prompted retrieval.

Filip: Fine-grained interactive language-image pre-training

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer