VGenST-Bench is a new video benchmark for MLLM spatio-temporal reasoning built via generative synthesis, a multi-agent pipeline with human oversight, a 3x2x2 taxonomy, and hierarchical tasks separating perception from reasoning.
hub Baseline reference
Scannet: Richly-annotated 3d reconstructions of indoor scenes
Baseline reference. 80% of citing Pith papers use this work as a benchmark or comparison.
hub tools
citation-role summary
citation-polarity summary
years
2026 11verdicts
UNVERDICTED 11representative citing papers
TurboVGGT uses adaptive sparse global attention with varying sparsity levels across frames and layers plus frame attention to enable faster multi-view 3D reconstruction while keeping competitive quality versus prior state-of-the-art methods.
RelFlexformers enable flexible integrable 3D RPE in attention via NU-FFT, generalizing prior methods to heterogeneous token positions with O(L log L) complexity.
ViSRA boosts MLLM 3D spatial reasoning performance by up to 28.9% on unseen tasks via a plug-and-play video-based agent that extracts explicit spatial cues from expert models without any post-training.
A closed-form scalar frame-level gate α_t derived from internal feature changes extends effective memory in recurrent 3D reconstruction and improves accuracy on long sequences up to 4541 frames.
Invaria trains point cloud encoders with next-resolution prediction to learn scale and density invariant features, yielding higher mIoU on ScanNet under lower resolution and scaled objects while using a smaller model.
SpatialForge synthesizes 10 million spatial QA pairs from in-the-wild 2D images to train VLMs for better depth ordering, layout, and viewpoint-dependent reasoning.
ST-Gen4D uses a world model that fuses global appearance and local dynamic graphs into a 4D cognition representation to guide consistent 4D Gaussian generation.
OpenGaFF adds a geometry-conditioned Gaussian Feature Field and codebook-guided attention to 3D Gaussian Splatting for spatially consistent open-vocabulary 3D semantic understanding.
StreamCacheVGGT improves streaming 3D geometry reconstruction accuracy and stability under fixed memory by using cross-layer token importance scoring and hybrid cache compression instead of pure eviction.
HY-World 2.0 generates and reconstructs high-fidelity navigable 3D Gaussian Splatting worlds from text, images, or videos via upgraded panorama, planning, expansion, and composition modules, with released code claiming open-source SOTA performance.
citing papers explorer
-
VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis
VGenST-Bench is a new video benchmark for MLLM spatio-temporal reasoning built via generative synthesis, a multi-agent pipeline with human oversight, a 3x2x2 taxonomy, and hierarchical tasks separating perception from reasoning.
-
TurboVGGT: Fast Visual Geometry Reconstruction with Adaptive Alternating Attention
TurboVGGT uses adaptive sparse global attention with varying sparsity levels across frames and layers plus frame attention to enable faster multi-view 3D reconstruction while keeping competitive quality versus prior state-of-the-art methods.
-
RelFlexformer: Efficient Attention 3D-Transformers for Integrable Relative Positional Encodings
RelFlexformers enable flexible integrable 3D RPE in attention via NU-FFT, generalizing prior methods to heterogeneous token positions with O(L log L) complexity.
-
ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models
ViSRA boosts MLLM 3D spatial reasoning performance by up to 28.9% on unseen tasks via a plug-and-play video-based agent that extracts explicit spatial cues from expert models without any post-training.
-
Rethinking the State Update Gate for Long-Sequence Recurrent 3D Reconstruction
A closed-form scalar frame-level gate α_t derived from internal feature changes extends effective memory in recurrent 3D reconstruction and improves accuracy on long sequences up to 4541 frames.
-
Invaria: Learning Scale and Density Invariance in Point Clouds via Next-Resolution Prediction
Invaria trains point cloud encoders with next-resolution prediction to learn scale and density invariant features, yielding higher mIoU on ScanNet under lower resolution and scaled objects while using a smaller model.
-
SpatialForge: Bootstrapping 3D-Aware Spatial Reasoning from Open-World 2D Images
SpatialForge synthesizes 10 million spatial QA pairs from in-the-wild 2D images to train VLMs for better depth ordering, layout, and viewpoint-dependent reasoning.
-
ST-Gen4D: Embedding 4D Spatiotemporal Cognition into World Model for 4D Generation
ST-Gen4D uses a world model that fuses global appearance and local dynamic graphs into a 4D cognition representation to guide consistent 4D Gaussian generation.
-
OpenGaFF: Open-Vocabulary Gaussian Feature Field with Codebook Attention
OpenGaFF adds a geometry-conditioned Gaussian Feature Field and codebook-guided attention to 3D Gaussian Splatting for spatially consistent open-vocabulary 3D semantic understanding.
-
StreamCacheVGGT: Streaming Visual Geometry Transformers with Robust Scoring and Hybrid Cache Compression
StreamCacheVGGT improves streaming 3D geometry reconstruction accuracy and stability under fixed memory by using cross-layer token importance scoring and hybrid cache compression instead of pure eviction.
-
HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds
HY-World 2.0 generates and reconstructs high-fidelity navigable 3D Gaussian Splatting worlds from text, images, or videos via upgraded panorama, planning, expansion, and composition modules, with released code claiming open-source SOTA performance.