GenLIP pretrains ViTs to generate language tokens from visual tokens via autoregressive language modeling, matching strong baselines on multimodal tasks with less data.
Openvision 2: A family of generative pretrained visual encoders for multimodal learning
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 3roles
background 1polarities
background 1representative citing papers
Survey of Music AVQA finds specialized input processing, dedicated spatial-temporal designs, and music-specific modeling are critical for strong performance.
EXAONE 4.5 is a new open-weight multimodal model that matches general benchmarks and outperforms similar-scale models on document understanding and Korean contextual reasoning.
citing papers explorer
-
Let ViT Speak: Generative Language-Image Pre-training
GenLIP pretrains ViTs to generate language tokens from visual tokens via autoregressive language modeling, matching strong baselines on multimodal tasks with less data.
-
Music Audio-Visual Question Answering Requires Specialized Multimodal Designs
Survey of Music AVQA finds specialized input processing, dedicated spatial-temporal designs, and music-specific modeling are critical for strong performance.
-
EXAONE 4.5 Technical Report
EXAONE 4.5 is a new open-weight multimodal model that matches general benchmarks and outperforms similar-scale models on document understanding and Korean contextual reasoning.