WikiCLIP delivers an efficient contrastive baseline for open-domain visual entity recognition that improves accuracy by 16% on OVEN unseen entities and runs nearly 100 times faster than leading generative models.
Mining fine-grained image-text alignment for zero-shot captioning via text-only training.ArXiv, abs/2401.02347, 2024
2 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CV 2verdicts
UNVERDICTED 2representative citing papers
NoisyGRPO is an RL framework that perturbs visual inputs with Gaussian noise for exploration and computes trajectory advantages via Bayesian posterior fusion of noise prior and reward likelihood to improve multimodal CoT generalization.
citing papers explorer
-
WikiCLIP: An Efficient Contrastive Baseline for Open-domain Visual Entity Recognition
WikiCLIP delivers an efficient contrastive baseline for open-domain visual entity recognition that improves accuracy by 16% on OVEN unseen entities and runs nearly 100 times faster than leading generative models.
-
NoisyGRPO: Incentivizing Multimodal CoT Reasoning via Noise Injection and Bayesian Estimation
NoisyGRPO is an RL framework that perturbs visual inputs with Gaussian noise for exploration and computes trajectory advantages via Bayesian posterior fusion of noise prior and reward likelihood to improve multimodal CoT generalization.