Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.
Smir: Efficient synthetic data pipeline to improve multi-image reasoning
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
citation-role summary
background 1
citation-polarity summary
fields
cs.CV 2roles
background 1polarities
background 1representative citing papers
Visual-TableQA is a new open-domain benchmark of rendered table images and complex QA pairs created via multi-LLM collaborative generation, with fine-tuned models showing robust generalization to external tests.
citing papers explorer
-
Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding
Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.
-
Visual-TableQA: Open-Domain Benchmark for Reasoning over Table Images
Visual-TableQA is a new open-domain benchmark of rendered table images and complex QA pairs created via multi-LLM collaborative generation, with fine-tuned models showing robust generalization to external tests.