A vision-language framework generates text-based rigid-body scene configurations from videos using motion reasoning and optical flow, reporting 0.30 IoU on CLEVRER (7x over baselines) and transfer to 235 real videos.
In Proceedings of the IEEE/CVF International Confer- ence on Computer Vision, pages 20406–20417
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2verdicts
UNVERDICTED 2representative citing papers
LLMs given symbolic image descriptions reach mid-90s accuracy on abstract visual reasoning tasks where end-to-end VLMs stay near chance, showing representation as the primary bottleneck.
citing papers explorer
-
$\Delta$ynamics: Language-Based Representation for Inferring Rigid-Body Dynamics From Videos
A vision-language framework generates text-based rigid-body scene configurations from videos using motion reasoning and optical flow, reporting 0.30 IoU on CLEVRER (7x over baselines) and transfer to 235 real videos.
-
Symbolic Grounding Reveals Representational Bottlenecks in Abstract Visual Reasoning
LLMs given symbolic image descriptions reach mid-90s accuracy on abstract visual reasoning tasks where end-to-end VLMs stay near chance, showing representation as the primary bottleneck.