Geo-Align applies RL with a perceptual reward derived from 3D camera trajectory estimation to improve controllability and fidelity in video generation without paired training data.
hub
Grounding image matching in 3d with mast3r
21 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
GenRecon lifts object-level generative priors to scene-scale reconstruction by chunking scenes and using projection-based conditioning on multi-view features, claiming 16% better results than prior methods.
NoPo4D is the first feed-forward system for dynamic 4D Gaussian splatting from unposed multi-view videos, using velocity decomposition supervised by optical flow and a bidirectional motion encoder.
EvoScene-VLA maintains an action-updated scene prior across control chunks in VLA policies, raising success rates on RoboTwin tasks from 87.2% to 89.1% fixed and 86.1% to 88.5% randomized while outperforming baselines on a real robot.
CRePE supplies depth-aware positional distributions along curved rays for stable unified-camera control in frozen video DiT models.
WildSplatter jointly learns 3D Gaussians and appearance embeddings from unconstrained photo collections to enable fast feed-forward reconstruction and flexible lighting control in 3D Gaussian Splatting.
A new benchmark with real lunar stereo ground truth and analog data shows that sim-to-real fine-tuned monocular depth models achieve large in-domain gains but minimal generalization to actual lunar images.
Affostruction reconstructs full 3D object geometry from partial RGBD views and grounds text-based affordances on both visible and unobserved surfaces, reporting large gains over prior methods.
FastForward represents scenes as collections of 3D-anchored image features and performs camera pose estimation via feed-forward correspondence prediction, achieving competitive accuracy with minimal mapping time.
SADGE is a new fused similarity metric combining DINOv3 appearance and MASt3R geometry via constrained bilinear interaction that correlates with downstream synthetic-to-real performance at Pearson r=0.88 across multiple benchmarks.
SpaceMind++ adds an explicit voxelized allocentric cognitive map and coordinate-guided fusion to video MLLMs, claiming SOTA on VSI-Bench and improved out-of-distribution generalization on three other 3D benchmarks.
FluSplat trains a model with geometric alignment constraints on multi-view edits to produce consistent 3D scene edits from sparse views in a single forward pass without test-time optimization.
Reframing head pose estimation as relative pose prediction between image pairs enables a synthetic-only trained model to outperform absolute regression methods on real benchmarks.
A feed-forward video latent transformer that predicts time-varying 3D Gaussian primitives from one image to produce controllable 4D scenes with appearance, geometry, and motion.
A causal transformer with key-value caching and distillation from a bidirectional VGGT model enables efficient online 4D geometry reconstruction from videos.
RoDyGS separates static and dynamic elements in monocular videos using Gaussian splatting with regularization and introduces the Kubric-MRig benchmark for pose-free dynamic novel view synthesis.
Introduces an eager-mode PyTorch BA library with GPU-accelerated sparse ops claiming 18.5-23x speedups over GTSAM, g2o, and Ceres.
ViewCrafter tames video diffusion models with point-based 3D guidance and iterative trajectory planning to produce high-fidelity novel views from single or sparse images.
PatchPoison injects 12x12 pixel checkerboard patches into multi-view images to disrupt SfM feature matching, causing 3DGS reconstructions to diverge with 6.8x higher LPIPS error on NeRF-Synthetic while remaining unobtrusive.
UniDepthV2 predicts metric 3D points directly from single images using a self-promptable camera module, pseudo-spherical representation, and new losses for improved cross-domain generalization.
Splatt3R is a feed-forward network that predicts 3D Gaussian splats directly from uncalibrated stereo image pairs by extending MASt3R with appearance attributes and a two-stage training procedure.
citing papers explorer
-
Geo-Align: Video Generation Alignment via Metric Geometry Reward
Geo-Align applies RL with a perceptual reward derived from 3D camera trajectory estimation to improve controllability and fidelity in video generation without paired training data.
-
GenRecon: Bridging Generative Priors for Multi-View 3D Scene Reconstruction
GenRecon lifts object-level generative priors to scene-scale reconstruction by chunking scenes and using projection-based conditioning on multi-view features, claiming 16% better results than prior methods.
-
No Pose, No Problem in 4D: Feed-Forward Dynamic Gaussians from Unposed Multi-View Videos
NoPo4D is the first feed-forward system for dynamic 4D Gaussian splatting from unposed multi-view videos, using velocity decomposition supervised by optical flow and a bidirectional motion encoder.
-
EvoScene-VLA: Evolving Scene Beliefs Inside the Action Decoder for Chunked Robot Control
EvoScene-VLA maintains an action-updated scene prior across control chunks in VLA policies, raising success rates on RoboTwin tasks from 87.2% to 89.1% fixed and 86.1% to 88.5% randomized while outperforming baselines on a real robot.
-
CRePE: Curved Ray Expectation Positional Encoding for Unified-Camera-Controlled Video Generation
CRePE supplies depth-aware positional distributions along curved rays for stable unified-camera control in frozen video DiT models.
-
WildSplatter: Feed-forward 3D Gaussian Splatting with Appearance Control from Unconstrained Images
WildSplatter jointly learns 3D Gaussians and appearance embeddings from unconstrained photo collections to enable fast feed-forward reconstruction and flexible lighting control in 3D Gaussian Splatting.
-
LuMon: A Comprehensive Benchmark and Development Suite with Novel Datasets for Lunar Monocular Depth Estimation
A new benchmark with real lunar stereo ground truth and analog data shows that sim-to-real fine-tuned monocular depth models achieve large in-domain gains but minimal generalization to actual lunar images.
-
Affostruction: 3D Affordance Grounding with Generative Reconstruction
Affostruction reconstructs full 3D object geometry from partial RGBD views and grounds text-based affordances on both visible and unobserved surfaces, reporting large gains over prior methods.
-
A Scene is Worth a Thousand Features: Feed-Forward Camera Localization from a Collection of Image Features
FastForward represents scenes as collections of 3D-anchored image features and performs camera pose estimation via feed-forward correspondence prediction, achieving competitive accuracy with minimal mapping time.
-
SADGE: Structure and Appearance Domain Gap Estimation of Synthetic and Real Data
SADGE is a new fused similarity metric combining DINOv3 appearance and MASt3R geometry via constrained bilinear interaction that correlates with downstream synthetic-to-real performance at Pearson r=0.88 across multiple benchmarks.
-
SpaceMind++: Toward Allocentric Cognitive Maps for Spatially Grounded Video MLLMs
SpaceMind++ adds an explicit voxelized allocentric cognitive map and coordinate-guided fusion to video MLLMs, claiming SOTA on VSI-Bench and improved out-of-distribution generalization on three other 3D benchmarks.
-
FluSplat: Sparse-View 3D Editing without Test-Time Optimization
FluSplat trains a model with geometric alignment constraints on multi-view edits to produce consistent 3D scene edits from sparse views in a single forward pass without test-time optimization.
-
VGGT-HPE: Reframing Head Pose Estimation as Relative Pose Prediction
Reframing head pose estimation as relative pose prediction between image pairs enables a synthetic-only trained model to outperform absolute regression methods on real benchmarks.
-
Diff4Splat: Controllable 4D Scene Generation with Latent Dynamic Reconstruction Models
A feed-forward video latent transformer that predicts time-varying 3D Gaussian primitives from one image to produce controllable 4D scenes with appearance, geometry, and motion.
-
Streaming 4D Visual Geometry Transformer
A causal transformer with key-value caching and distillation from a bidirectional VGGT model enables efficient online 4D geometry reconstruction from videos.
-
RoDyGS: Robust Dynamic Gaussian Splatting for Casual Videos
RoDyGS separates static and dynamic elements in monocular videos using Gaussian splatting with regularization and introduces the Kubric-MRig benchmark for pose-free dynamic novel view synthesis.
-
Bundle Adjustment in the Eager Mode
Introduces an eager-mode PyTorch BA library with GPU-accelerated sparse ops claiming 18.5-23x speedups over GTSAM, g2o, and Ceres.
-
ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis
ViewCrafter tames video diffusion models with point-based 3D guidance and iterative trajectory planning to produce high-fidelity novel views from single or sparse images.
-
PatchPoison: Poisoning Multi-View Datasets to Degrade 3D Reconstruction
PatchPoison injects 12x12 pixel checkerboard patches into multi-view images to disrupt SfM feature matching, causing 3DGS reconstructions to diverge with 6.8x higher LPIPS error on NeRF-Synthetic while remaining unobtrusive.
-
UniDepthV2: Universal Monocular Metric Depth Estimation Made Simpler
UniDepthV2 predicts metric 3D points directly from single images using a self-promptable camera module, pseudo-spherical representation, and new losses for improved cross-domain generalization.
-
Splatt3R: Zero-shot Gaussian Splatting from Uncalibrated Image Pairs
Splatt3R is a feed-forward network that predicts 3D Gaussian splats directly from uncalibrated stereo image pairs by extending MASt3R with appearance attributes and a two-stage training procedure.