hub Canonical reference

arXiv preprint arXiv:2601.03782 (2026)

· 2026 · arXiv 2601.03782

Canonical reference. 89% of citing Pith papers cite this work as background.

35 Pith papers citing it

Background 89% of classified citations

read on arXiv browse 35 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 8 method 1

citation-polarity summary

background 8 use method 1

representative citing papers

TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking

cs.CV · 2026-05-12 · unverdicted · novelty 8.0

TrackCraft3R is the first method to repurpose a video diffusion transformer as a feed-forward dense 3D tracker via dual-latent representations and temporal RoPE alignment, achieving SOTA performance with lower compute.

Point Tracking Improves World Action Models

cs.RO · 2026-05-22 · unverdicted · novelty 7.0

JOPAT jointly models pixels, point tracks, and actions in a diffusion transformer and reports gains over pixel-only baselines on long-horizon robot tasks with occlusion and off-screen motion.

Resolving Long-Tail Ambiguity in Unsupervised 3D Point Cloud Segmentation with Language Priors

cs.CV · 2026-05-20 · unverdicted · novelty 7.0

LangTail uses entity-level semantic priors from language models aligned via contrastive learning in a hierarchical clustering setup to resolve long-tail ambiguity, yielding +13.5, +12.9, and +8.9 mIoU gains on ScanNet-v2, S3DIS, and nuScenes.

RoboFlow4D: A Lightweight Flow World Model Toward Real-Time Flow-Guided Robotic Manipulation

cs.RO · 2026-05-17 · unverdicted · novelty 7.0

RoboFlow4D is an end-to-end lightweight flow world model that predicts multi-frame 3D flows from visual observations and textual instructions to provide explicit planning for real-time robotic manipulation.

Learning Visual Feature-Based World Models via Residual Latent Action

cs.CV · 2026-05-08 · unverdicted · novelty 7.0

RLA-WM predicts residual latent actions via flow matching to create visual feature world models that outperform prior feature-based and diffusion approaches while enabling offline video-based robot RL.

MolmoMotion: Forecasting Point Trajectories in 3D with Language Instruction

cs.CV · 2026-06-17 · unverdicted · novelty 6.0

Introduces a new task of goal-conditioned 3D point motion forecasting along with a 1.16M-video dataset, a 111-category benchmark, and a model that outperforms baselines while transferring to robotics and video generation.

Geometric Action Model for Robot Policy Learning

cs.RO · 2026-06-15 · unverdicted · novelty 6.0

GAM splits a geometric foundation model to enable language-conditioned future geometry prediction and action decoding for robot policies, claiming superior performance on manipulation benchmarks.

Unified Motion-Action Modeling for Heterogeneous Robot Learning

cs.RO · 2026-06-15 · unverdicted · novelty 6.0

UMA treats object motion and robot actions as co-evolving variables under a masked generative objective with hindsight relabeling and contrastive disentanglement to support multi-task pretraining and deployment across heterogeneous robot data.

NEXUS: Neural Energy Fields for Physically Consistent Contact-Rich 3D Object Dynamics

cs.CV · 2026-06-12 · unverdicted · novelty 6.0

NEXUS introduces a graph-based neural energy-field model that derives forces from scalar energy and dissipation terms to achieve physically consistent contact-rich 3D dynamics.

DynaTok: Token-Based 4D Reconstruction from Partial Point Clouds

cs.CV · 2026-06-10 · unverdicted · novelty 6.0

DynaTok introduces a token-based framework for correspondence-free 4D reconstruction from partial point cloud sequences via latent encoding, transformer aggregation, residual decoupling, and flow-matching decoding.

Simultaneous Contact Selection and Planning for Contact-Rich Manipulation with Cascaded Optimization

cs.RO · 2026-05-27 · unverdicted · novelty 6.0

SCSP is a cascaded optimization framework using a surrogate contact model and discrete-continuous search to enable simultaneous contact selection and planning for robust contact-rich manipulation.

DexSIM: Real-time Dexterous Simulation with Unified Causal Video Diffusion

cs.CV · 2026-05-23 · unverdicted · novelty 6.0

DexSIM is a bi-directional video diffusion model with hand trajectory embedding and spatial memory cache for real-time dexterous hand-object simulation at 15 FPS.

GEM-4D: Geometry-Enhanced Video World Models for Robot Manipulation

cs.CV · 2026-05-20 · unverdicted · novelty 6.0 · 2 refs

GEM-4D improves video world models for robot manipulation by distilling 4D geometric correspondences into training and adding an inverse dynamics module, achieving SOTA geometric consistency and 81% real-world success.

GaussianDream: A Feed-Forward 3D Gaussian World Model for Robotic Manipulation

cs.RO · 2026-05-20 · unverdicted · novelty 6.0

GaussianDream is a feed-forward 3D Gaussian world model plug-in that conditions VLA policies on learned 3D spatial and future evolution representations for improved robotic manipulation performance.

GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization

cs.RO · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

GuidedVLA improves VLA generalization by supervising individual attention heads with manually defined auxiliary signals for three task-relevant factors.

RigidFormer: Learning Rigid Dynamics using Transformers

cs.CV · 2026-05-09 · unverdicted · novelty 6.0

RigidFormer learns mesh-free rigid dynamics from point clouds using object-centric anchors, Anchor-Vertex Pooling, Anchor-based RoPE, and differentiable Kabsch alignment to enforce rigidity.

Embody4D: A Generalist Data Engine for Embodied 4D World Modeling

cs.CV · 2026-05-03 · unverdicted · novelty 6.0 · 2 refs

Embody4D generates novel-view videos from monocular robot videos via a 3D-aware synthesis pipeline, confidence-aware expert modulation, and interaction-aware attention for embodied 4D world modeling.

Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising

cs.RO · 2026-04-29 · unverdicted · novelty 6.0 · 2 refs

X-WAM unifies robotic action execution and 4D world synthesis by adapting video diffusion priors with a lightweight depth branch and asynchronous noise sampling, achieving 79-91% success on robot benchmarks.

Human Cognition in Machines: A Unified Perspective of World Models

cs.RO · 2026-04-17 · unverdicted · novelty 6.0

The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and proposes Epistemic World Models as a new category for scientific discovery agents.

LAMP: Lift Image-Editing as General 3D Priors for Open-world Manipulation

cs.CV · 2026-04-09 · unverdicted · novelty 6.0

LAMP extracts continuous 3D inter-object transformations from image editing to serve as geometry-aware priors for zero-shot open-world robotic manipulation.

World Action Models are Zero-shot Policies

cs.RO · 2026-02-17 · unverdicted · novelty 6.0

DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment transfer with 10-30 minutes of data.

PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation

cs.RO · 2026-01-11 · unverdicted · novelty 6.0

PALM improves long-horizon robotic manipulation success by distilling affordance representations for object interaction and predicting within-subtask progress in a VLA model.

3D Point World Models: Point Completion Enables More Accurate Dynamics Learning

cs.RO · 2026-06-30 · unverdicted · novelty 5.0

3DPWM completes partial point clouds then learns dynamics on the completed 3D scenes to produce reliable long-horizon rollouts for model-based robotic planning.

DVG-WM: Disentangled Video Generation Enables Efficient Embodied World Model for Robotic Manipulation

cs.RO · 2026-06-30 · unverdicted · novelty 5.0 · 2 refs

DVG-WM disentangles dynamics learning from visual synthesis via flow matching and latent degradation to deliver faster, higher-quality video predictions for robotic manipulation.

citing papers explorer

Showing 0 of 0 citing papers after filters.

No citing papers match the current filters.

arXiv preprint arXiv:2601.03782 (2026)

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer