A vision-language framework generates text-based rigid-body scene configurations from videos using motion reasoning and optical flow, reporting 0.30 IoU on CLEVRER (7x over baselines) and transfer to 235 real videos.
Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models.arXiv e-prints, pages arXiv–2409
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.CV 2verdicts
UNVERDICTED 2representative citing papers
ViSurf unifies SFT and RLVR for LVLMs in one training stage by injecting ground-truth labels into rollouts and applying novel reward controls, outperforming standalone and two-stage baselines on diverse benchmarks.
citing papers explorer
-
$\Delta$ynamics: Language-Based Representation for Inferring Rigid-Body Dynamics From Videos
A vision-language framework generates text-based rigid-body scene configurations from videos using motion reasoning and optical flow, reporting 0.30 IoU on CLEVRER (7x over baselines) and transfer to 235 real videos.
-
ViSurf: Visual Supervised-and-Reinforcement Fine-Tuning for Large Vision-and-Language Models
ViSurf unifies SFT and RLVR for LVLMs in one training stage by injecting ground-truth labels into rollouts and applying novel reward controls, outperforming standalone and two-stage baselines on diverse benchmarks.