TrackCraft3R is the first method to repurpose a video diffusion transformer as a feed-forward dense 3D tracker via dual-latent representations and temporal RoPE alignment, achieving SOTA performance with lower compute.
hub Canonical reference
arXiv preprint arXiv:2601.03782 (2026)
Canonical reference. 89% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
years
2026 29representative citing papers
JOPAT jointly models pixels, point tracks, and actions in a diffusion transformer and reports gains over pixel-only baselines on long-horizon robot tasks with occlusion and off-screen motion.
LangTail uses entity-level semantic priors from language models aligned via contrastive learning in a hierarchical clustering setup to resolve long-tail ambiguity, yielding +13.5, +12.9, and +8.9 mIoU gains on ScanNet-v2, S3DIS, and nuScenes.
RoboFlow4D is an end-to-end lightweight flow world model that predicts multi-frame 3D flows from visual observations and textual instructions to provide explicit planning for real-time robotic manipulation.
RLA-WM predicts residual latent actions via flow matching to create visual feature world models that outperform prior feature-based and diffusion approaches while enabling offline video-based robot RL.
SCSP is a cascaded optimization framework using a surrogate contact model and discrete-continuous search to enable simultaneous contact selection and planning for robust contact-rich manipulation.
DexSIM is a bi-directional video diffusion model with hand trajectory embedding and spatial memory cache for real-time dexterous hand-object simulation at 15 FPS.
GEM-4D improves video world models for robot manipulation by distilling 4D geometric correspondences into training and adding an inverse dynamics module, achieving SOTA geometric consistency and 81% real-world success.
GaussianDream is a feed-forward 3D Gaussian world model plug-in that conditions VLA policies on learned 3D spatial and future evolution representations for improved robotic manipulation performance.
GuidedVLA improves VLA generalization by supervising individual attention heads with manually defined auxiliary signals for three task-relevant factors.
RigidFormer learns mesh-free rigid dynamics from point clouds using object-centric anchors, Anchor-Vertex Pooling, Anchor-based RoPE, and differentiable Kabsch alignment to enforce rigidity.
Embody4D generates novel-view videos from monocular robot videos via a 3D-aware synthesis pipeline, confidence-aware expert modulation, and interaction-aware attention for embodied 4D world modeling.
X-WAM unifies robotic action execution and 4D world synthesis by adapting video diffusion priors with a lightweight depth branch and asynchronous noise sampling, achieving 79-91% success on robot benchmarks.
The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and proposes Epistemic World Models as a new category for scientific discovery agents.
LAMP extracts continuous 3D inter-object transformations from image editing to serve as geometry-aware priors for zero-shot open-world robotic manipulation.
DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment transfer with 10-30 minutes of data.
PALM improves long-horizon robotic manipulation success by distilling affordance representations for object interaction and predicting within-subtask progress in a VLA model.
3DPWM completes partial point clouds then learns dynamics on the completed 3D scenes to produce reliable long-horizon rollouts for model-based robotic planning.
Survey organizing world models for robotic manipulation into representation families, a functional taxonomy, and infrastructure roles across pretraining, post-training, and inference, while reviewing 34 datasets and evaluation protocols.
PhyWorld improves temporal consistency and physical plausibility in video world models via flow matching fine-tuning followed by DPO on physics preference pairs, with reported gains on VBench and a custom physical-faithfulness benchmark.
Nano World Models supplies a unified minimalist codebase and evaluation framework for studying diffusion forcing in video prediction across control, games, and robot domains.
ChronoFlow-Policy uses a unified ChronoFlow representation of past-current-future dynamics learned jointly with actions in a diffusion policy, outperforming baselines on 14 simulated and 5 real manipulation tasks.
Proposes a world model-empowered SCD integration framework for complex unmanned systems using AoI-driven sensing, a predictive hybrid latent world model, and a multi-granularity knowledge graph.
The paper delivers a multi-axis taxonomy for world models that maps architectures, training families, reasoning strategies, and domains from early cognitive foundations through systems such as Dreamer, MuZero, and Sora while noting evaluation gaps.
citing papers explorer
- From World Models to World Action Models: A Concise Tutorial for Robotics
- DVG-WM: Disentangled Video Generation Enables Efficient Embodied World Model for Robotic Manipulation
- Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond
- From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data