Distilling view-consistent future views and action-outcome supervision from a generative world model into a VLM via two-stage post-training improves dynamic spatial reasoning on SAT-Real, VSI-Bench and similar benchmarks while avoiding test-time world-model cost.
Spatialladder: Progressive training for spatial reasoning in vision-language models
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CV 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
World2VLM: Distilling World Model Imagination into VLMs for Dynamic Spatial Reasoning
Distilling view-consistent future views and action-outcome supervision from a generative world model into a VLM via two-stage post-training improves dynamic spatial reasoning on SAT-Real, VSI-Bench and similar benchmarks while avoiding test-time world-model cost.