PaLI jointly scales a 4B-parameter vision transformer with language models on a new 10B multilingual image-text dataset to reach state-of-the-art results on vision-language tasks while keeping a simple modular design.
Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
3 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CV 3representative citing papers
LanguageBind aligns video, infrared, depth, and audio to a frozen language encoder via contrastive learning on the new VIDAL-10M dataset, extending video-language pretraining to N modalities.
Proposes an online hand gesture recognition system using 3D CNNs achieving 98%+ detector accuracy and 90%+ classifier accuracy on Jester, with 37.5% Levenshtein accuracy on a homemade dataset.
citing papers explorer
-
PaLI: A Jointly-Scaled Multilingual Language-Image Model
PaLI jointly scales a 4B-parameter vision transformer with language models on a new 10B multilingual image-text dataset to reach state-of-the-art results on vision-language tasks while keeping a simple modular design.
-
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
LanguageBind aligns video, infrared, depth, and audio to a frozen language encoder via contrastive learning on the new VIDAL-10M dataset, extending video-language pretraining to N modalities.
-
Online Hand Gesture Recognition Using 3D Convolutional Neural Networks
Proposes an online hand gesture recognition system using 3D CNNs achieving 98%+ detector accuracy and 90%+ classifier accuracy on Jester, with 37.5% Levenshtein accuracy on a homemade dataset.