hub

Meta clip 2: A worldwide scaling recipe

Yung-Sung Chuang, Yang Li, Dong Wang, Ching-Feng Yeh, Kehan Lyu, Ramya Raghavendra, James Glass, Lifei Huang, Jason Weston, Luke Zettlemoyer, et al · 2025 · arXiv 2507.22062

13 Pith papers cite this work. Polarity classification is still indexing.

13 Pith papers citing it

read on arXiv browse 13 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3 baseline 1

citation-polarity summary

background 3 baseline 1

representative citing papers

FLARE: Full-Modality Long-Video Audiovisual Retrieval Benchmark with User-Simulated Queries

cs.MM · 2026-05-11 · unverdicted · novelty 7.0

FLARE is a new benchmark with 399 long videos, 87k multimodal clips, and 275k user-style queries for testing audiovisual retrieval under caption and query regimes.

When Surfaces Lie: Exploiting Wrinkle-Induced Attention Shift to Attack Vision-Language Models

cs.CV · 2026-03-29 · unverdicted · novelty 7.0

A wrinkle-field perturbation method creates photorealistic non-rigid image changes that degrade state-of-the-art VLMs on image captioning and VQA more effectively than prior baselines.

PowerCLIP: Powerset Alignment for Contrastive Pre-Training

cs.CV · 2025-11-28 · conditional · novelty 7.0

PowerCLIP improves CLIP-style models by exhaustively aligning powersets of image regions to textual parse trees via efficient non-linear aggregators that approximate the full combinatorial loss.

CRAFT: Clinical Reward-Aligned Finetuning for Medical Image Synthesis

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

CRAFT adapts diffusion models to medical images via clinical reward alignment from LLMs and VLMs, improving alignment scores and cutting low-quality generations by 20.4% on average across modalities.

SearchAD: Large-Scale Rare Image Retrieval Dataset for Autonomous Driving

cs.CV · 2026-04-09 · unverdicted · novelty 6.0

SearchAD is a large-scale semantic image retrieval benchmark for rare driving scenarios that supports text-to-image and image-to-image tasks and shows text-based methods outperform image-based ones while overall performance stays limited.

Simplicity Prevails: The Emergence of Generalizable AIGI Detection in Visual Foundation Models

cs.CV · 2026-02-02 · conditional · novelty 6.0

Frozen features from vision foundation models enable a linear probe to outperform specialized AIGI detectors by over 30% on in-the-wild data due to emergent forgery knowledge from pre-training.

GRAPE: Let GRPO Supervise Query Rewriting by Ranking for Retrieval

cs.CV · 2025-09-27 · conditional · novelty 6.0

GRAPE applies GRPO to an LLM query rewriter with a corpus-relative ranking reward to improve frozen CLIP retrieval by an average 4.9% Recall@10 on shifted benchmarks without retraining or re-embedding.

MetaEmbed: Scaling Multimodal Retrieval at Test-Time with Flexible Late Interaction

cs.IR · 2025-09-22 · unverdicted · novelty 6.0

MetaEmbed trains fixed learnable Meta Tokens to produce granularity-organized multi-vector embeddings that support test-time scaling in multimodal retrieval.

Sapiens2

cs.CV · 2026-04-23 · unverdicted · novelty 5.0

Sapiens2 improves pretraining, data scale, and architecture over its predecessor to set new state-of-the-art results on human pose estimation, body-part segmentation, normal estimation, and new tasks like pointmap and albedo estimation.

Concrete Jungle: Towards Concreteness Paved Contrastive Negative Mining for Compositional Understanding

cs.LG · 2026-04-14 · unverdicted · novelty 5.0

Using lexical concreteness to guide contrastive negative mining and a new margin-based Cement loss, the Slipform framework reaches state-of-the-art on compositional benchmarks for vision-language models.

LOGER: Local--Global Ensemble for Robust Deepfake Detection in the Wild

cs.CV · 2026-04-04 · unverdicted · novelty 5.0

LOGER ensembles heterogeneous global vision models with selective local patch aggregation via multiple instance learning to achieve robust deepfake detection across varied manipulations and degradations.

Boosting Robust AIGI Detection with LoRA-based Pairwise Training

cs.CV · 2026-04-14 · unverdicted · novelty 4.0

LoRA-based pairwise training with distortion and size simulations boosts robust AIGI detection under severe distortions, placing third in the NTIRE challenge.

HEDGE: Heterogeneous Ensemble for Detection of AI-GEnerated Images in the Wild

cs.CV · 2026-04-04 · unverdicted · novelty 4.0

HEDGE is a heterogeneous ensemble using progressive DINOv3 training, multi-scale features, and MetaCLIP2 diversity with dual-gating fusion to achieve robust AI-generated image detection and 4th place in the NTIRE 2026 challenge.

citing papers explorer

Showing 13 of 13 citing papers.

FLARE: Full-Modality Long-Video Audiovisual Retrieval Benchmark with User-Simulated Queries cs.MM · 2026-05-11 · unverdicted · none · ref 8
FLARE is a new benchmark with 399 long videos, 87k multimodal clips, and 275k user-style queries for testing audiovisual retrieval under caption and query regimes.
When Surfaces Lie: Exploiting Wrinkle-Induced Attention Shift to Attack Vision-Language Models cs.CV · 2026-03-29 · unverdicted · none · ref 12
A wrinkle-field perturbation method creates photorealistic non-rigid image changes that degrade state-of-the-art VLMs on image captioning and VQA more effectively than prior baselines.
PowerCLIP: Powerset Alignment for Contrastive Pre-Training cs.CV · 2025-11-28 · conditional · none · ref 9
PowerCLIP improves CLIP-style models by exhaustively aligning powersets of image regions to textual parse trees via efficient non-linear aggregators that approximate the full combinatorial loss.
CRAFT: Clinical Reward-Aligned Finetuning for Medical Image Synthesis cs.CV · 2026-05-12 · unverdicted · none · ref 5
CRAFT adapts diffusion models to medical images via clinical reward alignment from LLMs and VLMs, improving alignment scores and cutting low-quality generations by 20.4% on average across modalities.
SearchAD: Large-Scale Rare Image Retrieval Dataset for Autonomous Driving cs.CV · 2026-04-09 · unverdicted · none · ref 9
SearchAD is a large-scale semantic image retrieval benchmark for rare driving scenarios that supports text-to-image and image-to-image tasks and shows text-based methods outperform image-based ones while overall performance stays limited.
Simplicity Prevails: The Emergence of Generalizable AIGI Detection in Visual Foundation Models cs.CV · 2026-02-02 · conditional · none · ref 8
Frozen features from vision foundation models enable a linear probe to outperform specialized AIGI detectors by over 30% on in-the-wild data due to emergent forgery knowledge from pre-training.
GRAPE: Let GRPO Supervise Query Rewriting by Ranking for Retrieval cs.CV · 2025-09-27 · conditional · none · ref 2
GRAPE applies GRPO to an LLM query rewriter with a corpus-relative ranking reward to improve frozen CLIP retrieval by an average 4.9% Recall@10 on shifted benchmarks without retraining or re-embedding.
MetaEmbed: Scaling Multimodal Retrieval at Test-Time with Flexible Late Interaction cs.IR · 2025-09-22 · unverdicted · none · ref 9
MetaEmbed trains fixed learnable Meta Tokens to produce granularity-organized multi-vector embeddings that support test-time scaling in multimodal retrieval.
Sapiens2 cs.CV · 2026-04-23 · unverdicted · none · ref 10
Sapiens2 improves pretraining, data scale, and architecture over its predecessor to set new state-of-the-art results on human pose estimation, body-part segmentation, normal estimation, and new tasks like pointmap and albedo estimation.
Concrete Jungle: Towards Concreteness Paved Contrastive Negative Mining for Compositional Understanding cs.LG · 2026-04-14 · unverdicted · none · ref 7
Using lexical concreteness to guide contrastive negative mining and a new margin-based Cement loss, the Slipform framework reaches state-of-the-art on compositional benchmarks for vision-language models.
LOGER: Local--Global Ensemble for Robust Deepfake Detection in the Wild cs.CV · 2026-04-04 · unverdicted · none · ref 8
LOGER ensembles heterogeneous global vision models with selective local patch aggregation via multiple instance learning to achieve robust deepfake detection across varied manipulations and degradations.
Boosting Robust AIGI Detection with LoRA-based Pairwise Training cs.CV · 2026-04-14 · unverdicted · none · ref 3
LoRA-based pairwise training with distortion and size simulations boosts robust AIGI detection under severe distortions, placing third in the NTIRE challenge.
HEDGE: Heterogeneous Ensemble for Detection of AI-GEnerated Images in the Wild cs.CV · 2026-04-04 · unverdicted · none · ref 5
HEDGE is a heterogeneous ensemble using progressive DINOv3 training, multi-scale features, and MetaCLIP2 diversity with dual-gating fusion to achieve robust AI-generated image detection and 4th place in the NTIRE 2026 challenge.

Meta clip 2: A worldwide scaling recipe

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer