VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.
Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
citation-role summary
baseline 1
citation-polarity summary
roles
baseline 1polarities
baseline 1representative citing papers
citing papers explorer
-
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.
- SVSR: A Self-Verification and Self-Rectification Paradigm for Multimodal Reasoning