The paper defines the 4DVLT task for worldline-centered 4D scene understanding, releases Instruct-4D with 129.4K QA pairs, and presents 4DTrack achieving 62.68 TGA_Top1, outperforming adapted baselines by 19.62 points.
MVGGT: Multimodal visual geometry grounded transformer for multiview 3D referring expression segmentation
3 Pith papers cite this work. Polarity classification is still indexing.
3
Pith papers citing it
citation-role summary
background 1
citation-polarity summary
fields
cs.CV 3years
2026 3roles
background 1polarities
background 1representative citing papers
AFFORDMEM improves AP50 by 3.23-3.7 points on SceneFun3D splits by using a reusable cross-scene affordance memory bank and in-scene spatial memory to guide VLMs toward actionable 3D regions.
TrackRef3D proposes a fully automatic multi-view consistent track-then-label method for open-world referring segmentation in 3D Gaussian Splatting using TSCM, visibility-aware descriptions, and hybrid contrastive training.
citing papers explorer
-
Grounding by Remembering: Cross-Scene and In-Scene Memory for 3D Functional Affordances
AFFORDMEM improves AP50 by 3.23-3.7 points on SceneFun3D splits by using a reusable cross-scene affordance memory bank and in-scene spatial memory to guide VLMs toward actionable 3D regions.