Joint Optimization for 4D Human-Scene Reconstruction in the Wild

Bolei Zhou; Joe Lin; Wayne Wu; Zhizheng Liu

arxiv: 2501.02158 · v3 · pith:KOALJZDOnew · submitted 2025-01-04 · 💻 cs.CV

Joint Optimization for 4D Human-Scene Reconstruction in the Wild

Zhizheng Liu , Joe Lin , Wayne Wu , Bolei Zhou This is my paper

classification 💻 cs.CV

keywords humanscenehuman-scenemotionjoshreconstructionvideosdense

0 comments

read the original abstract

Reconstructing human motion and its surrounding environment is crucial for understanding human-scene interaction and predicting human movements in the scene. While much progress has been made in capturing human-scene interaction in constrained environments, those prior methods can hardly reconstruct the natural and diverse human motion and scene context from web videos. In this work, we propose JOSH, a novel optimization-based method for 4D human-scene reconstruction in the wild from monocular videos. JOSH uses techniques in both dense scene reconstruction and human mesh recovery as initialization, and then it leverages the human-scene contact constraints to jointly optimize the scene, the camera poses, and the human motion. Experiment results show JOSH achieves better results on both global human motion estimation and dense scene reconstruction by joint optimization of scene geometry and human motion. We further design a more efficient model, JOSH3R, and directly train it with pseudo-labels from web videos. JOSH3R outperforms other optimization-free methods by only training with labels predicted from JOSH, further demonstrating its accuracy and generalization ability.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

TROPHIES: Temporal Reconstruction of Places, Humans, and Cameras from Multi-view Videos
cs.CV 2026-06 unverdicted novelty 7.0

TROPHIES introduces a unified framework for human-scene-camera reconstruction from multi-view videos, achieving globally aligned and physically plausible 4D outputs on EgoHuman and EgoExo4D.
GRAFT: Geometric Refinement and Fitting Transformer for Human Scene Reconstruction
cs.CV 2026-04 unverdicted novelty 6.0

GRAFT amortizes human-scene fitting into a recurrent transformer that predicts interaction gradients via body-anchored geometric probes, delivering optimization-level interaction quality at 50x lower runtime.
RHINO: Reconstructing Human Interactions with Novel Objects from Monocular Videos
cs.CV 2026-05 unverdicted novelty 5.0

RHINO recovers 3D human, novel manipulated object, and static scene from monocular video by stabilizing SfM with foundation models, separating motions, and refining with compositional neural SDFs plus contact priors.
Natural Human Motion Recovery by Aligning High-Order Temporal Dynamics from Monocular Videos
cs.CV 2026-05 unverdicted novelty 4.0

HTD-Refine uses a temporal transformer (PVA-Net) to predict high-order dynamics and refines HMR outputs via optimization for more natural motion.