PAGE-4D: Disentangled Pose and Geometry Estimation for VGGT-4D Perception

· 2025 · cs.CV · arXiv 2510.17568

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

open full Pith review browse 4 citing papers arXiv PDF

abstract

Recent 3D feed-forward models, such as the Visual Geometry Grounded Transformer (VGGT), have shown strong capability in inferring 3D attributes of static scenes. However, since they are typically trained on static datasets, these models often struggle in real-world scenarios involving complex dynamic elements, such as moving humans or deformable objects like umbrellas. To address this limitation, we introduce PAGE-4D, a feedforward model that extends VGGT to dynamic scenes, enabling camera pose estimation, depth prediction and point cloud reconstruction - all without post-processing. A central challenge in multitask 4D reconstruction is the inherent conflict between tasks: accurate camera pose estimation requires suppressing dynamic regions, while geometry reconstruction requires modeling them. To resolve this tension, we propose a dynamics aware aggregator that disentangles static and dynamic information by predicting a dynamics-aware mask - suppressing motion cues for pose estimation while amplifying them for geometry reconstruction. Extensive experiments show that PAGE-4D consistently outperforms the original VGGT in dynamic scenarios, achieving superior results in camera pose estimation, monocular and video depth estimation, and dense point map reconstruction. Necessary code and additional demos are available at Link: https://page4d.github.io/. Keywords: VGGT-4D, 4D Perception, Dynamic Scene Reconstruction.

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

FreeOrbit4D: Training-Free Arbitrary Camera Redirection for Monocular Videos via Foreground-Complete 4D Reconstruction

cs.CV · 2026-01-26 · unverdicted · novelty 7.0

FreeOrbit4D recovers a foreground-complete 4D proxy via decoupled background and object-centric reconstruction to provide geometric guidance for large-angle camera redirection in monocular videos using conditional video diffusion.

4DVGGT-D: 4D Visual Geometry Transformer with Improved Dynamic Depth Estimation

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

A training-free progressive decoupling framework improves dynamic depth estimation in 4D reconstruction via mask-guided pose decoupling, topological subspace surgery, and Bayesian fusion, yielding better point-cloud metrics on benchmarks.

GEM-4D: Geometry-Enhanced Video World Models for Robot Manipulation

cs.CV · 2026-05-20 · unverdicted · novelty 5.0

GEM-4D is a video world model that injects 4D correspondence supervision to improve geometric consistency and robot manipulation success from 61% to 81%.

GeoWorld-VLM: Geometry from World Models for Vision-Language Models

cs.CV · 2026-05-15 · unverdicted · novelty 5.0

GeoWorld-VLM distills geometric structure from camera-conditioned world models into VLMs by aligning visual features, improving spatial reasoning by about 4% on What'sUp and VSR benchmarks across two architectures while preserving language capabilities.

citing papers explorer

Showing 4 of 4 citing papers.

FreeOrbit4D: Training-Free Arbitrary Camera Redirection for Monocular Videos via Foreground-Complete 4D Reconstruction cs.CV · 2026-01-26 · unverdicted · none · ref 54 · internal anchor
FreeOrbit4D recovers a foreground-complete 4D proxy via decoupled background and object-centric reconstruction to provide geometric guidance for large-angle camera redirection in monocular videos using conditional video diffusion.
4DVGGT-D: 4D Visual Geometry Transformer with Improved Dynamic Depth Estimation cs.CV · 2026-05-12 · unverdicted · none · ref 30 · internal anchor
A training-free progressive decoupling framework improves dynamic depth estimation in 4D reconstruction via mask-guided pose decoupling, topological subspace surgery, and Bayesian fusion, yielding better point-cloud metrics on benchmarks.
GEM-4D: Geometry-Enhanced Video World Models for Robot Manipulation cs.CV · 2026-05-20 · unverdicted · none · ref 72 · internal anchor
GEM-4D is a video world model that injects 4D correspondence supervision to improve geometric consistency and robot manipulation success from 61% to 81%.
GeoWorld-VLM: Geometry from World Models for Vision-Language Models cs.CV · 2026-05-15 · unverdicted · none · ref 56 · internal anchor
GeoWorld-VLM distills geometric structure from camera-conditioned world models into VLMs by aligning visual features, improving spatial reasoning by about 4% on What'sUp and VSR benchmarks across two architectures while preserving language capabilities.

PAGE-4D: Disentangled Pose and Geometry Estimation for VGGT-4D Perception

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer