Vila: On pre-training for vi- sual language models

Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Mohammad Shoeybi, Song Han · 2024

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

browse 2 citing papers

representative citing papers

SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning

cs.CV · 2026-03-28 · unverdicted · novelty 6.0

SpatialStack improves 3D spatial reasoning in vision-language models by stacking and synchronizing multi-level geometric features with the language backbone.

GraphThinker: Reinforcing Temporally Grounded Video Reasoning with Event Graph Thinking

cs.CV · 2026-02-19 · unverdicted · novelty 6.0

GraphThinker reduces temporal hallucinations in video reasoning by constructing event-based scene graphs and applying visual attention rewards in reinforcement finetuning.

citing papers explorer

Showing 2 of 2 citing papers.

SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning cs.CV · 2026-03-28 · unverdicted · none · ref 29
SpatialStack improves 3D spatial reasoning in vision-language models by stacking and synchronizing multi-level geometric features with the language backbone.
GraphThinker: Reinforcing Temporally Grounded Video Reasoning with Event Graph Thinking cs.CV · 2026-02-19 · unverdicted · none · ref 41
GraphThinker reduces temporal hallucinations in video reasoning by constructing event-based scene graphs and applying visual attention rewards in reinforcement finetuning.

Vila: On pre-training for vi- sual language models

fields

years

verdicts

representative citing papers

citing papers explorer