Franca introduces nested Matryoshka clustering and positional disentanglement in a transparent SSL pipeline to deliver open-source vision models competitive with closed proprietary systems.
An image is worth 16x16 words: Transformers for image recognition at scale
4 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CV 4representative citing papers
VPiT enables pretrained LLMs to perform both visual understanding and generation by predicting discrete text tokens and continuous visual tokens, with understanding data proving more effective than generation-specific data.
InternVid supplies 7M videos and LLM captions to train ViCLIP, which reaches leading zero-shot action recognition and competitive retrieval performance.
PartCo improves generalized category discovery by incorporating part-level correspondence priors that capture finer semantic structures and integrate with existing GCD methods.
citing papers explorer
-
Franca: Nested Matryoshka Clustering for Scalable Visual Representation Learning
Franca introduces nested Matryoshka clustering and positional disentanglement in a transparent SSL pipeline to deliver open-source vision models competitive with closed proprietary systems.
-
MetaMorph: Multimodal Understanding and Generation via Instruction Tuning
VPiT enables pretrained LLMs to perform both visual understanding and generation by predicting discrete text tokens and continuous visual tokens, with understanding data proving more effective than generation-specific data.
-
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation
InternVid supplies 7M videos and LLM captions to train ViCLIP, which reaches leading zero-shot action recognition and competitive retrieval performance.
-
PartCo: Part-Level Correspondence Priors Enhance Category Discovery
PartCo improves generalized category discovery by incorporating part-level correspondence priors that capture finer semantic structures and integrate with existing GCD methods.