pith. sign in

International Conference on Machine Learning , pages=

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

fields

cs.CV 4 cs.CL 1

years

2026 3 2023 2

representative citing papers

Demystifying CLIP Data

cs.CV · 2023-09-28 · accept · novelty 6.0

MetaCLIP curates balanced 400M-pair subsets from CommonCrawl that outperform CLIP data, reaching 70.8% zero-shot ImageNet accuracy on ViT-B versus CLIP's 68.3%.

Vision Transformers Need Registers

cs.CV · 2023-09-28 · unverdicted · novelty 6.0

Adding register tokens to Vision Transformers eliminates high-norm background artifacts and raises state-of-the-art performance on dense visual prediction tasks.

citing papers explorer

Showing 5 of 5 citing papers.

  • SHED: Style-Homogenized Embedding Alignment for Domain Generalization cs.CV · 2026-05-16 · conditional · none · ref 30

    SHED improves domain generalization in CLIP by aligning style-homogenized embeddings instead of raw ones, achieving state-of-the-art results on five benchmarks including a 4% gain on DomainNet.

  • Utility-Oriented Visual Evidence Selection for Multimodal Retrieval-Augmented Generation cs.CL · 2026-05-13 · unverdicted · none · ref 17

    Evidence utility is defined as information gain on the model's output distribution, with ranking by gain on a latent helpfulness variable shown equivalent to answer-space utility under mild assumptions, enabling a training-free surrogate framework that outperforms baselines.

  • Demystifying CLIP Data cs.CV · 2023-09-28 · accept · none · ref 71

    MetaCLIP curates balanced 400M-pair subsets from CommonCrawl that outperform CLIP data, reaching 70.8% zero-shot ImageNet accuracy on ViT-B versus CLIP's 68.3%.

  • Vision Transformers Need Registers cs.CV · 2023-09-28 · unverdicted · none · ref 272

    Adding register tokens to Vision Transformers eliminates high-norm background artifacts and raises state-of-the-art performance on dense visual prediction tasks.

  • ESsEN: Training Compact Discriminative Vision-Language Transformers in a Low-Resource Setting cs.CV · 2026-04-20 · unverdicted · none · ref 29

    ESsEN is a parameter-efficient two-tower vision-language transformer that matches larger models on discriminative tasks after training end-to-end with limited data and resources.