FLARE is a new benchmark with 399 long videos, 87k multimodal clips, and 275k user-style queries for testing audiovisual retrieval under caption and query regimes.
hub
Meta clip 2: A worldwide scaling recipe
13 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
A wrinkle-field perturbation method creates photorealistic non-rigid image changes that degrade state-of-the-art VLMs on image captioning and VQA more effectively than prior baselines.
PowerCLIP improves CLIP-style models by exhaustively aligning powersets of image regions to textual parse trees via efficient non-linear aggregators that approximate the full combinatorial loss.
CRAFT adapts diffusion models to medical images via clinical reward alignment from LLMs and VLMs, improving alignment scores and cutting low-quality generations by 20.4% on average across modalities.
SearchAD is a large-scale semantic image retrieval benchmark for rare driving scenarios that supports text-to-image and image-to-image tasks and shows text-based methods outperform image-based ones while overall performance stays limited.
Frozen features from vision foundation models enable a linear probe to outperform specialized AIGI detectors by over 30% on in-the-wild data due to emergent forgery knowledge from pre-training.
GRAPE applies GRPO to an LLM query rewriter with a corpus-relative ranking reward to improve frozen CLIP retrieval by an average 4.9% Recall@10 on shifted benchmarks without retraining or re-embedding.
MetaEmbed trains fixed learnable Meta Tokens to produce granularity-organized multi-vector embeddings that support test-time scaling in multimodal retrieval.
Sapiens2 improves pretraining, data scale, and architecture over its predecessor to set new state-of-the-art results on human pose estimation, body-part segmentation, normal estimation, and new tasks like pointmap and albedo estimation.
Using lexical concreteness to guide contrastive negative mining and a new margin-based Cement loss, the Slipform framework reaches state-of-the-art on compositional benchmarks for vision-language models.
LOGER ensembles heterogeneous global vision models with selective local patch aggregation via multiple instance learning to achieve robust deepfake detection across varied manipulations and degradations.
LoRA-based pairwise training with distortion and size simulations boosts robust AIGI detection under severe distortions, placing third in the NTIRE challenge.
HEDGE is a heterogeneous ensemble using progressive DINOv3 training, multi-scale features, and MetaCLIP2 diversity with dual-gating fusion to achieve robust AI-generated image detection and 4th place in the NTIRE 2026 challenge.
citing papers explorer
-
FLARE: Full-Modality Long-Video Audiovisual Retrieval Benchmark with User-Simulated Queries
FLARE is a new benchmark with 399 long videos, 87k multimodal clips, and 275k user-style queries for testing audiovisual retrieval under caption and query regimes.
-
When Surfaces Lie: Exploiting Wrinkle-Induced Attention Shift to Attack Vision-Language Models
A wrinkle-field perturbation method creates photorealistic non-rigid image changes that degrade state-of-the-art VLMs on image captioning and VQA more effectively than prior baselines.
-
PowerCLIP: Powerset Alignment for Contrastive Pre-Training
PowerCLIP improves CLIP-style models by exhaustively aligning powersets of image regions to textual parse trees via efficient non-linear aggregators that approximate the full combinatorial loss.
-
CRAFT: Clinical Reward-Aligned Finetuning for Medical Image Synthesis
CRAFT adapts diffusion models to medical images via clinical reward alignment from LLMs and VLMs, improving alignment scores and cutting low-quality generations by 20.4% on average across modalities.
-
SearchAD: Large-Scale Rare Image Retrieval Dataset for Autonomous Driving
SearchAD is a large-scale semantic image retrieval benchmark for rare driving scenarios that supports text-to-image and image-to-image tasks and shows text-based methods outperform image-based ones while overall performance stays limited.
-
Simplicity Prevails: The Emergence of Generalizable AIGI Detection in Visual Foundation Models
Frozen features from vision foundation models enable a linear probe to outperform specialized AIGI detectors by over 30% on in-the-wild data due to emergent forgery knowledge from pre-training.
-
GRAPE: Let GRPO Supervise Query Rewriting by Ranking for Retrieval
GRAPE applies GRPO to an LLM query rewriter with a corpus-relative ranking reward to improve frozen CLIP retrieval by an average 4.9% Recall@10 on shifted benchmarks without retraining or re-embedding.
-
MetaEmbed: Scaling Multimodal Retrieval at Test-Time with Flexible Late Interaction
MetaEmbed trains fixed learnable Meta Tokens to produce granularity-organized multi-vector embeddings that support test-time scaling in multimodal retrieval.
-
Sapiens2
Sapiens2 improves pretraining, data scale, and architecture over its predecessor to set new state-of-the-art results on human pose estimation, body-part segmentation, normal estimation, and new tasks like pointmap and albedo estimation.
-
Concrete Jungle: Towards Concreteness Paved Contrastive Negative Mining for Compositional Understanding
Using lexical concreteness to guide contrastive negative mining and a new margin-based Cement loss, the Slipform framework reaches state-of-the-art on compositional benchmarks for vision-language models.
-
LOGER: Local--Global Ensemble for Robust Deepfake Detection in the Wild
LOGER ensembles heterogeneous global vision models with selective local patch aggregation via multiple instance learning to achieve robust deepfake detection across varied manipulations and degradations.
-
Boosting Robust AIGI Detection with LoRA-based Pairwise Training
LoRA-based pairwise training with distortion and size simulations boosts robust AIGI detection under severe distortions, placing third in the NTIRE challenge.
-
HEDGE: Heterogeneous Ensemble for Detection of AI-GEnerated Images in the Wild
HEDGE is a heterogeneous ensemble using progressive DINOv3 training, multi-scale features, and MetaCLIP2 diversity with dual-gating fusion to achieve robust AI-generated image detection and 4th place in the NTIRE 2026 challenge.