{"total":17,"items":[{"citing_arxiv_id":"2605.12587","ref_index":39,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking","primary_cat":"cs.CV","submitted_at":"2026-05-12T17:59:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"TrackCraft3R is the first method to repurpose a video diffusion transformer as a feed-forward dense 3D tracker via dual-latent representations and temporal RoPE alignment, achieving SOTA performance with lower compute.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"MapAnything: Universal feed-forward metric 3D reconstruction.arXiv preprint arXiv:2509.13414, 2025. [38] Jisoo Kim, Jungbin Cho, Sanghyeok Chu, Ananya Bal, Jinhyung Kim, Gunhee Lee, Sihaeng Lee, Se- ung Hwan Kim, Bohyung Han, Hyunmin Lee, et al. Pri4R: Learning world dynamics for vision-language- action models with privileged 4D representation.arXiv preprint arXiv:2603.01549, 2026. [39] Po-Chen Ko, Jiayuan Mao, Yilun Du, Shao-Hua Sun, and Joshua B. Tenenbaum. Learning to act from actionless videos through dense correspondences.arXiv preprint arXiv:2310.08576, 2023. [40] Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. HunyuanVideo: A systematic framework for large video generative models."},{"citing_arxiv_id":"2605.12090","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"World Action Models: The Next Frontier in Embodied AI","primary_cat":"cs.RO","submitted_at":"2026-05-12T13:10:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"World-Gymnast [53], RWM-U [54], World4RL [55], VIPER [ 56] PhysWorld [57], Diffusion Reward [58], GenReward [59] Evaluation Ctrl-World [38], Veo Robotics [60], Interactive World Simulator [61] WorldEval [62], WorldGym [63], dWorldEval [64] Architecture Cascaded W AM Explicit UniPi [6], VLP [ 7], RoboEnvision [9], ThisThat [ 65], TesserAct [66], MVISTA-4D [67] Say ,Dream,and Act [10], Gen2Act [68], A VDC [8], Im2Flow2Act [69], 3DFlowAction [70] NovaFlow [71], Dream2Flow [72], Dreamitate [ 73], 4DGen [ 74], RIGVid [75], L VP [76] Vidar [77], Veo-Act [78], pi0.7 [ 79], V AG [80] Implicit VPP [11], VILP [ 81], Video Policy [13], ARDuP [ 82], mimic-video [ 12], LAP A [15], villa-X [ 83], S-V AM [14], OmniVTA [84], MWM [85] Joint W AM"},{"citing_arxiv_id":"2605.07474","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ForgeVLA: Federated Vision-Language-Action Learning without Language Annotations","primary_cat":"cs.CV","submitted_at":"2026-05-08T09:20:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ForgeVLA enables federated VLA model training from unlabeled vision-action pairs by recovering language via embodied classifiers and using contrastive planning plus adaptive aggregation to avoid feature collapse.","context_count":1,"top_context_role":"other","top_context_polarity":"unclear","context_text":"[31] Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024. 11 [32] Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025. [33] Po-Chen Ko, Jiayuan Mao, Yilun Du, Shao-Hua Sun, and Joshua B. Tenenbaum. Learning to act from actionless videos through dense correspondences.arXiv preprint arXiv:2310.08576, 2023. [34] Ilya Kostrikov, Denis Yarats, and Rob Fergus. Image augmentation is all you need: Regularizing deep reinforcement learning from pixels. InInternational Conference on Learning Representations (ICLR),"},{"citing_arxiv_id":"2605.03637","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Bridging the Embodiment Gap: Disentangled Cross-Embodiment Video Editing","primary_cat":"cs.RO","submitted_at":"2026-05-05T11:09:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A dual-contrastive disentanglement method factorizes videos into independent task and embodiment latents, then uses a parameter-efficient adapter on a frozen video diffusion model to synthesize robot executions from single human demonstrations without paired data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.28185","ref_index":41,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling","primary_cat":"cs.CV","submitted_at":"2026-04-30T17:59:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemphasizing perceptual quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.15938","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VADF: Vision-Adaptive Diffusion Policy Framework for Efficient Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2026-04-17T10:56:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VADF adds an Adaptive Loss Network for hard-negative training sampling and a Hierarchical Vision Task Segmenter for adaptive noise scheduling during inference to speed convergence and reduce timeouts in diffusion robotic policies.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Diffusion Policy [3] was the first to demonstrate that the iterative denoising mechanism of diffusion models outperforms traditional Gaussian policies in high- dimensional continuous control tasks, enabling smoother, more stable, and more diverse action distributions. Subsequent studies have extended this framework to various domains of robotic manipulation, including trajectory generation [9] [34], grasp planning [12] [20], 4D spatiotemporal awareness [17]and visual data aug- mentation for vision-based manipulation [32], providing new pathways for com- plex task decomposition, generalizable control, and multimodal perception. Diffusion Models. Diffusion models are generative models that learn data dis- tributions through a two-stage noising-denoising Markov process."},{"citing_arxiv_id":"2604.11386","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ComSim: Building Scalable Real-World Robot Data Generation via Compositional Simulation","primary_cat":"cs.RO","submitted_at":"2026-04-13T12:25:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Compositional Simulation generates scalable real-world robot training data by combining classical simulation with neural simulation in a closed-loop real-sim-real augmentation pipeline.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"4 World Simulator for Robotic Manipulation Scalable robot learning [2,7,9,29,63] depends on abundant, realistic data, yet collecting real-world trajectories via human demonstrations is slow and labor- intensive, limiting broad access. Generative video models [1,50] offer a cost- effective way to synthesize policy training data. UniPi [14] and AVDC [21] cast robot planning as text-to-video generation (AVDC further estimates inverse dynamics with a pretrained flow network); UniSim [53] learns a unified real-world simulator across text and control inputs; RoboDreamer [61] targets compositional generalization via text parsing; and IRASim [62] performs trajectory-conditioned video generation but focuses on arm motion only."},{"citing_arxiv_id":"2604.06168","ref_index":30,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Action Images: End-to-End Policy Learning via Multiview Video Generation","primary_cat":"cs.CV","submitted_at":"2026-04-07T17:59:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Action Images turn robot arm motions into interpretable multiview pixel videos, letting video backbones serve as zero-shot policies for end-to-end robot learning.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"arXiv preprint arXiv:2601.16163 (2026) [28] Kim, M.J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., Sanketi, P., et al.: Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246 (2024) [29] Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013) [30] Ko, P.C., Mao, J., Du, Y., Sun, S.H., Tenenbaum, J.B.: Learning to act from actionless videos through dense correspondences. arXiv preprint arXiv:2310.08576 (2023) [31] Lee, J., Duan, J., Fang, H., Deng, Y., Liu, S., Li, B., Fang, B., Zhang, J., Wang, Y.R., Lee, S., et al.: Molmoact: Action reasoning models that can reason in space. arXiv preprint arXiv:2508."},{"citing_arxiv_id":"2604.04974","ref_index":50,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data","primary_cat":"cs.RO","submitted_at":"2026-04-04T15:37:11+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.03181","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model","primary_cat":"cs.RO","submitted_at":"2026-04-03T16:57:06+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MV-VDP jointly predicts multi-view RGB and heatmap videos via diffusion to achieve data-efficient, robust robotic manipulation policies.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"and action generation separately, and those that train them jointly. Separate training.Many methods adopt a two-stage pipeline, first learning to predict future visual observations and then mapping visual representations to actions [ 24-27]. Variants differ in the intermediate representations used, including RGB videos [24, 25], human demonstration videos [28], optical flow [29], and 2D point trajectories [30]. Joint training.More recent work jointly predicts future videos and actions within a unified frame- work [9, 31-37], enabling tighter coupling between perception and control. Our method also follows this paradigm, but differs by implicitly incorporating 3D structural priors, aligning the fintuning with the video pretraining and leveraging internet-scale pretrained video foundation models."},{"citing_arxiv_id":"2512.15840","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Large Video Planner Enables Generalizable Robot Control","primary_cat":"cs.RO","submitted_at":"2025-12-17T18:35:54+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A video foundation model trained on human demonstrations generates zero-shot plans that convert to executable robot actions on novel scenes and tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.01773","ref_index":35,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"IGen: Scalable Data Generation for Robot Learning from Open-World Images","primary_cat":"cs.RO","submitted_at":"2025-12-01T15:15:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"IGen generates realistic visuomotor training data including actions and temporally coherent visuals from unstructured open-world images via 3D reconstruction and VLM reasoning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.00990","ref_index":60,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Robotic Manipulation by Imitating Generated Videos Without Physical Demonstrations","primary_cat":"cs.RO","submitted_at":"2025-07-01T17:39:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RIGVid shows that filtered AI-generated videos can serve as effective supervision for complex robotic manipulation tasks without any real demonstrations.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"ter than the generation of a more sparse and high-level rep- resentation. Next, given a generated video, one may ask whether 6D object-level tracking is necessary, given its up- front requirement of an object mesh. To address this ques- tion, we compare against a broad range of alternative track- ing approaches - sparse point tracking [15], dense optical flow [60], 3D feature fields [58], and generated goal super- vision [14] - and show consistently higher success rates. In summary, our key contributions are: (1) We propose a framework that enables robots to perform open-world ma- nipulation tasks without any real-world demonstrations - only generated videos. (2) We show high-quality generated videos perform on par with real videos for robotic imitation,"},{"citing_arxiv_id":"2503.00200","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Unified Video Action Model","primary_cat":"cs.RO","submitted_at":"2025-02-28T21:38:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"UVA learns a joint video-action latent representation with decoupled diffusion decoding heads, enabling a single model to perform accurate fast policy learning, forward/inverse dynamics, and video generation without performance loss versus task-specific methods.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"For the action diffusion decoder, all latent tokens in Zt+1 are aggregated using a convolutional layer, followed by an MLP layer, to produce an action latent. This latent encodes both visual and action-related information for the current step and serves as the condition for the action diffusion model to generate the action chunk At. We use the diffusion head (base size) from [27] for both action and video prediction. During training, the decoders learn to predict the noise added to noisy action chunks or video patches. The action diffusion loss [21, 37, 39] is defined as: Laction(Z, A) = Eϵ,k [︂ ∥ϵ − ϵθ(A(k)|k, Z)∥2 ]︂ , where A(k) represents the noisy actions, ϵ is the added noise, k is the diffusion timestep, Z is the joint video-action latent"},{"citing_arxiv_id":"2411.04983","ref_index":31,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning","primary_cat":"cs.RO","submitted_at":"2024-11-07T18:54:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DINO-WM builds world models on pre-trained DINOv2 features to enable zero-shot planning from offline data without rewards or demonstrations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2401.03568","ref_index":130,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Agent AI: Surveying the Horizons of Multimodal Interaction","primary_cat":"cs.AI","submitted_at":"2024-01-07T19:11:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The paper defines Agent AI as interactive multimodal systems that perceive grounded data and generate embodied actions, arguing this approach can mitigate hallucinations in foundation models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2401.00025","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Any-point Trajectory Modeling for Policy Learning","primary_cat":"cs.RO","submitted_at":"2023-12-28T23:34:43+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ATM pre-trains models to predict trajectories of any points in videos, then uses those predictions to learn strong visuomotor policies from minimal action labels, beating baselines by 80% on 130+ tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}