DEVIS-GRPO applies online policy gradients with an accumulative small-to-large view sampling strategy and multi-level rewards to improve trajectory-controlled extreme view video generation, reporting gains on Kubric-4D, iPhone, and DL3DV datasets.
hub Canonical reference
Sv4d: Dynamic 3d content generation with multi-frame and multi-view consistency
Canonical reference. 80% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 23representative citing papers
A training-free method reformulates camera control as geometric displacement fields applied via differentiable latent resampling, enabling control and bias probing in video diffusion models.
DreamStereo uses GAPW, PBDP, and SASI to enable real-time stereo video inpainting at 25 FPS for HD videos by reducing over 70% redundant computation while maintaining quality.
Action Images turn robot arm motions into interpretable multiview pixel videos, letting video backbones serve as zero-shot policies for end-to-end robot learning.
OmniCamera disentangles video content and camera motion for multi-task generation with arbitrary camera control via the OmniCAM hybrid dataset and Dual-level Curriculum Co-Training.
SimWorlds presents a multi-agent system with planner-coder-reviewer workflow, layered scene protocol, and runtime inspection tools to create dynamic 4D scenes from text, plus the 4DBuildBench benchmark showing outperformance over baselines.
HAT-4D presents an agentic VLM-plus-human-in-the-loop pipeline for monocular 4D multi-object interaction reconstruction and releases the MVOIK-4D benchmark.
TriMotion is a modality-agnostic framework that maps video, pose, and text descriptions of the same camera trajectory into a shared motion embedding space, trained with a new triplet dataset and latent consistency objective, to produce videos that follow the target trajectory.
StreamForce presents a unified causal model for force-controllable streaming video generation using a new force representation and distillation pipeline, claiming SOTA force adherence and 16.6 FPS performance.
E³C is a video diffusion model that disentangles persistent 3D scene structure via point-cloud memory from human dynamics via ego-exo pose controls for improved egocentric video generation on the Nymeria dataset.
A training-free Spatio-Temporal Attention Chain framework accelerates 4D mesh generation 13x, improves quality, scales to 16x longer videos, and supports downstream tracking and camera estimation.
Velox compresses dynamic point clouds into latent tokens that support geometry via 4D surface modeling and appearance via 3D Gaussians, showing strong results on video-to-4D generation, tracking, and image-to-4D cloth simulation.
Embody4D generates novel-view videos from monocular robot videos via a 3D-aware synthesis pipeline, confidence-aware expert modulation, and interaction-aware attention for embodied 4D world modeling.
Vista4D re-synthesizes dynamic videos from new viewpoints by grounding them in a 4D point cloud built with static segmentation and multiview training.
A feed-forward video latent transformer that predicts time-varying 3D Gaussian primitives from one image to produce controllable 4D scenes with appearance, geometry, and motion.
A framework disentangles local joint motion from global movement, trains a 2D local generator on text-2D pairs, then fine-tunes on 3D data to output view-consistent 3D motions.
PAIWorld adds explicit geometric cross-view mechanisms and 3D distillation to DiT world models to achieve multi-view 3D consistency in robotic manipulation benchmarks.
A multi-view video diffusion model conditioned on relative camera poses via extended RoPE generates dense synchronized views from sparse inputs for 4D Gaussian splatting reconstruction, claiming SOTA results on human datasets and generalization to animals.
CP4D generates physically consistent 4D scenes via compositional integration of pre-trained 3D models, hybrid simulator-diffusion motion synthesis, and automated scene composition.
SkelMo introduces a category-agnostic diffusion framework for skeletal motion generation from 2D videos, trained on a new dataset of ~20,000 rigged 3D animations with a structural-semantic injection mechanism.
Presents Instant3D for rapid text/image-to-3D generation via multi-view diffusion plus feed-forward reconstruction, and FastMap for 10x faster structure-from-motion with comparable accuracy.
A geometry-aware 4D video generation model trained with cross-view pointmap alignment to produce spatio-temporally consistent future videos from novel viewpoints for robot manipulation.
QuadLink generates anisotropic quad-dominant meshes from point clouds via autoregressive anchor prediction and centroid-conditioned linking, with a Tri-to-Quad data converter and quad-first assembly.
citing papers explorer
-
DEVIS-GRPO: Unleashing GRPO on Dynamic Extreme View Synthesis
DEVIS-GRPO applies online policy gradients with an accumulative small-to-large view sampling strategy and multi-level rewards to improve trajectory-controlled extreme view video generation, reporting gains on Kubric-4D, iPhone, and DL3DV datasets.
-
Probing into Camera Control of Video Models
A training-free method reformulates camera control as geometric displacement fields applied via differentiable latent resampling, enabling control and bias probing in video diffusion models.
-
DreamStereo: Towards Real-Time Stereo Inpainting for HD Videos
DreamStereo uses GAPW, PBDP, and SASI to enable real-time stereo video inpainting at 25 FPS for HD videos by reducing over 70% redundant computation while maintaining quality.
-
Action Images: End-to-End Policy Learning via Multiview Video Generation
Action Images turn robot arm motions into interpretable multiview pixel videos, letting video backbones serve as zero-shot policies for end-to-end robot learning.
-
OmniCamera: A Unified Framework for Multi-task Video Generation with Arbitrary Camera Control
OmniCamera disentangles video content and camera motion for multi-task generation with arbitrary camera control via the OmniCAM hybrid dataset and Dual-level Curriculum Co-Training.
-
SimWorlds: A Multi-Agent System for Dynamic 3D Scene Creation
SimWorlds presents a multi-agent system with planner-coder-reviewer workflow, layered scene protocol, and runtime inspection tools to create dynamic 4D scenes from text, plus the 4DBuildBench benchmark showing outperformance over baselines.
-
HAT-4D: Lifting Monocular Video for 4D Multi-Object Interactions via Human-Agent Collaboration
HAT-4D presents an agentic VLM-plus-human-in-the-loop pipeline for monocular 4D multi-object interaction reconstruction and releases the MVOIK-4D benchmark.
-
TriMotion: Modality-Agnostic Camera Control for Video Generation
TriMotion is a modality-agnostic framework that maps video, pose, and text descriptions of the same camera trajectory into a shared motion embedding space, trained with a new triplet dataset and latent consistency objective, to produce videos that follow the target trajectory.
-
Streaming Video Generation with Streaming Force Control
StreamForce presents a unified causal model for force-controllable streaming video generation using a new force representation and distillation pipeline, claiming SOTA force adherence and 16.6 FPS performance.
-
E$^3$C: Video Generation with 3D Environmental Memory and Ego-Exo Human Pose Control
E³C is a video diffusion model that disentangles persistent 3D scene structure via point-cloud memory from human dynamics via ego-exo pose controls for improved egocentric video generation on the Nymeria dataset.
-
Fast 4D Mesh Generation by Spatio-Temporal Attention Chains
A training-free Spatio-Temporal Attention Chain framework accelerates 4D mesh generation 13x, improves quality, scales to 16x longer videos, and supports downstream tracking and camera estimation.
-
Velox: Learning Representations of 4D Geometry and Appearance
Velox compresses dynamic point clouds into latent tokens that support geometry via 4D surface modeling and appearance via 3D Gaussians, showing strong results on video-to-4D generation, tracking, and image-to-4D cloth simulation.
-
Embody4D: A Generalist Data Engine for Embodied 4D World Modeling
Embody4D generates novel-view videos from monocular robot videos via a 3D-aware synthesis pipeline, confidence-aware expert modulation, and interaction-aware attention for embodied 4D world modeling.
-
Vista4D: Video Reshooting with 4D Point Clouds
Vista4D re-synthesizes dynamic videos from new viewpoints by grounding them in a 4D point cloud built with static segmentation and multiview training.
-
PAIWorld: A 3D-Consistent World Foundation Model for Robotic Manipulation
PAIWorld adds explicit geometric cross-view mechanisms and 3D distillation to DiT world models to achieve multi-view 3D consistency in robotic manipulation benchmarks.
-
Flex4DHuman: Flexible Multi-view Video Diffusion for 4D Human Reconstruction
A multi-view video diffusion model conditioned on relative camera poses via extended RoPE generates dense synchronized views from sparse inputs for 4D Gaussian splatting reconstruction, claiming SOTA results on human datasets and generalization to animals.
-
CP4D: Compositional Physics-aware 4D Scene Generation
CP4D generates physically consistent 4D scenes via compositional integration of pre-trained 3D models, hybrid simulator-diffusion motion synthesis, and automated scene composition.
-
SkelMo: Universal Skeletal Motion Generation for 3D Rigged Shapes
SkelMo introduces a category-agnostic diffusion framework for skeletal motion generation from 2D videos, trained on a new dataset of ~20,000 rigged 3D animations with a structural-semantic injection mechanism.
-
Efficient 3D Content Reconstruction and Generation
Presents Instant3D for rapid text/image-to-3D generation via multi-view diffusion plus feed-forward reconstruction, and FastMap for 10x faster structure-from-motion with comparable accuracy.
-
QuadLink: Autoregressive Quad-Dominant Mesh Generation via Point-Relation Learning
QuadLink generates anisotropic quad-dominant meshes from point clouds via autoregressive anchor prediction and centroid-conditioned linking, with a Tri-to-Quad data converter and quad-first assembly.