Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.
Title resolution pending
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
method 1polarities
use method 1representative citing papers
Introduces the BAH dataset with 1,427 annotated videos for multimodal recognition of ambivalence/hesitancy in digital behavior change contexts.
Evidence for cross-modal representational convergence weakens substantially at scale and in realistic many-to-many settings, indicating models learn rich but distinct representations.
A new Triton kernel for dispatch-aware ragged attention delivers 1.88-2.51× end-to-end throughput gains over standard padded attention and 9-12% over FlashAttention-2 varlen in pruned ViTs by lowering dispatch floor to ~24μs.
citing papers explorer
-
Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding
Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.
-
BAH Dataset for Ambivalence/Hesitancy Recognition in Videos for Digital Behavioural Change
Introduces the BAH dataset with 1,427 annotated videos for multimodal recognition of ambivalence/hesitancy in digital behavior change contexts.
-
Back into Plato's Cave: Examining Cross-modal Representational Convergence at Scale
Evidence for cross-modal representational convergence weakens substantially at scale and in realistic many-to-many settings, indicating models learn rich but distinct representations.
-
Dispatch-Aware Ragged Attention for Pruned Vision Transformers
A new Triton kernel for dispatch-aware ragged attention delivers 1.88-2.51× end-to-end throughput gains over standard padded attention and 9-12% over FlashAttention-2 varlen in pruned ViTs by lowering dispatch floor to ~24μs.