{"total":21,"items":[{"citing_arxiv_id":"2605.19319","ref_index":29,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SWEET: Sparse World Modeling with Image Editing for Embodied Task Execution","primary_cat":"cs.CV","submitted_at":"2026-05-19T03:54:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SWEET is a one-shot sparse visual planning framework that progressively generates manipulation keyframes via image editing conditioned on language and spatial guidance, then converts them to actions with a diffusion predictor, showing better fidelity and lower cost than video models on DROID and Rob","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15735","ref_index":27,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"UAM: A Dual-Stream Perspective on Forgetting in VLA Training","primary_cat":"cs.CV","submitted_at":"2026-05-15T08:45:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"UAM adds a Dorsal Expert initialized from a generative model and trained on visual dynamics prediction to preserve over 95% of VLM multimodal ability in VLA training while achieving top success rates on manipulation tasks including OOD cases.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12167","ref_index":31,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation","primary_cat":"cs.RO","submitted_at":"2026-05-12T14:15:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12090","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"World Action Models: The Next Frontier in Embodied AI","primary_cat":"cs.RO","submitted_at":"2026-05-12T13:10:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"UniPi [6], VLP [ 7], RoboEnvision [9], ThisThat [ 65], TesserAct [66], MVISTA-4D [67] Say ,Dream,and Act [10], Gen2Act [68], A VDC [8], Im2Flow2Act [69], 3DFlowAction [70] NovaFlow [71], Dream2Flow [72], Dreamitate [ 73], 4DGen [ 74], RIGVid [75], L VP [76] Vidar [77], Veo-Act [78], pi0.7 [ 79], V AG [80] Implicit VPP [11], VILP [ 81], Video Policy [13], ARDuP [ 82], mimic-video [ 12], LAP A [15], villa-X [ 83], S-V AM [14], OmniVTA [84], MWM [85] Joint W AM Autoregression GR1 [86], grmg [ 87], GR2 [88], Co TVLA [89], WorldVLA [90], rynnvla2 [91] VLA-JEP A [92], F1-VLA [93] Diffusion-based P AD [21], VideoVLA [94], UWM [20], DreamZero [ 17], CosmosPolicy [16], FLARE [95], UV A [96] FRAPPE [97], CoV AR [98], LDA1B [99], W A V [100], DUST [101], LingBotV A [18], AIM [ 102]"},{"citing_arxiv_id":"2605.07514","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Is the Future Compatible? Diagnosing Dynamic Consistency in World Action Models","primary_cat":"cs.RO","submitted_at":"2026-05-08T09:44:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Action-state consistency in World Action Models distinguishes successful from failed imagined futures and supports value-free selection of better rollouts via consensus among predictions.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"The overhead of computing selection weights and consistency scores is negligible, taking approximately0.7ms forN= 8. 7 5.1 Results on RoboCasa Table 1:Results on RoboCasa.\" ∗\" denotes our reim- plementation; all other results are taken from [22].TTS indicates test-time scaling. Method TTS Average SR (%) UV A [26]✗50.0 DP-VLA [16]✗57.3 UWM [45]✗60.8 π0 [6]✗62.5 GR00T-N1.5 [5]✗64.1 Video-Policy [27]✗66.0 FLARE [44]✗66.4 Cosmos-Policy∗ [22]✗66.6 + Value-Prediction∗ ✓67.4 + Consistency-Consensus (ours)✓67.3 + Consistency-Exploring (ours)✓68.0 We follow the evaluation protocol of Cosmos-Policy [ 22] on RoboCasa [ 29], which contains 24 kitchen manipulation tasks performed by a single Franka Emika Panda robot arm. For each task, we evaluate 50 trials and report the average"},{"citing_arxiv_id":"2605.06222","ref_index":13,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"When to Trust Imagination: Adaptive Action Execution for World Action Models","primary_cat":"cs.RO","submitted_at":"2026-05-07T13:18:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A verifier called Future Forward Dynamics Causal Attention enables adaptive action execution in World Action Models, reducing model inferences by 69% and improving success rates in robotic tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"(ICRA), pages 4845-4852. IEEE, 2025. [12] Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, et al. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026. [13] Shuang Li, Yihuai Gao, Dorsa Sadigh, and Shuran Song. Unified video action model.arXiv preprint arXiv:2503.00200, 2025. [14] Junbang Liang, Pavel Tokmakov, Ruoshi Liu, Sruthi Sudhakar, Paarth Shah, Rares Ambrus, and Carl V ondrick. Video generators are robot policies.arXiv preprint arXiv:2508.00795, 2025. [15] Yuanchang Liang, Xiaobo Wang, Kai Wang, Shuo Wang, Xiaojiang Peng, Haoyu Chen, David Kim Huat Chua, and Prahlad Vadakkepat. Adaptive action chunking at inference-time for vision-language-action"},{"citing_arxiv_id":"2605.06192","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"EA-WM: Event-Aware Generative World Model with Structured Kinematic-to-Visual Action Fields","primary_cat":"cs.CV","submitted_at":"2026-05-07T13:06:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"EA-WM generates more accurate robot world rollouts by projecting actions as structured visual fields in camera space and using event-aware bidirectional fusion to better capture interaction dynamics.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Multi-view video diffusion policy: A 3d spatio-temporal-aware video action model.arXiv preprint arXiv:2604.03181, 2026. [9] Shuang Li, Yihuai Gao, Dorsa Sadigh, and Shuran Song. Unified video action model.arXiv preprint arXiv:2503.00200, 2025. [10] Yaxuan Li, Yichen Zhu, Junjie Wen, Chaomin Shen, and Yi Xu. Worldeval: World model as real-world robot policies evaluator.arXiv preprint arXiv:2505.19017, 2025. [11] Junbang Liang, Pavel Tokmakov, Ruoshi Liu, Sruthi Sudhakar, Paarth Shah, Rares Ambrus, and Carl V ondrick. Video generators are robot policies.arXiv preprint arXiv:2508.00795, 2025. [12] OpenAI. Video generation models as world simulators. https://openai.com/index/ video-generation-models-as-world-simulators/, 2024. Technical report/blog post. [13] Jonas Pai, Liam Achenbach, Victoriano Montesinos, Benedek Forrai, Oier Mees, and Elvis"},{"citing_arxiv_id":"2605.00080","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"World Model for Robot Learning: A Comprehensive Survey","primary_cat":"cs.RO","submitted_at":"2026-04-30T14:35:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"A comprehensive survey that organizes the literature on world models in robot learning, their roles in policy learning, planning, simulation, and video-based generation, with connections to navigation, driving, datasets, and benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00078","ref_index":58,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Being-H0.7: A Latent World-Action Model from Egocentric Videos","primary_cat":"cs.RO","submitted_at":"2026-04-30T14:16:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[56] Shuang Li, Yihuai Gao, Dorsa Sadigh, and Shuran Song. Unified video action model.arXiv preprint arXiv:2503.00200, 2025. [57] Chuning Zhu, Raymond Yu, Siyuan Feng, Benjamin Burchfiel, Paarth Shah, and Abhishek Gupta. Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets.arXiv preprint arXiv:2504.02792, 2025. [58] Junbang Liang, Pavel Tokmakov, Ruoshi Liu, Sruthi Sudhakar, Paarth Shah, Rares Ambrus, and Carl Vondrick. Video generators are robot policies.arXiv preprint arXiv:2508.00795, 2025. [59] Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, et al. Motus: A unified latent action world model."},{"citing_arxiv_id":"2604.15483","ref_index":101,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"${\\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities","primary_cat":"cs.LG","submitted_at":"2026-04-16T19:18:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"π₀.₇ is a steerable generalist robotic model that uses rich multimodal prompts including language, subgoal images, and performance metadata to achieve out-of-the-box generalization across tasks and robot bodies.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[99] Junbang Liang, Pavel Tokmakov, Ruoshi Liu, Sruthi Sudhakar, Paarth Shah, Rares Ambrus, and Carl V on- drick. Video generators are robot policies.arXiv preprint arXiv:2508.00795, 2025. 3 [100] Tianyuan Yuan, Zibin Dong, Yicheng Liu, and Hang Zhao. Fast-wam: Do world action models need test-time future imagination?arXiv preprint arXiv:2603.16666, 2026. [101] Yilun Du, Mengjiao Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Joshua B. Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text- guided video generation.Advances in Neural Informa- tion Processing Systems (NeurIPS), 2023. 3 [102] Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Max- imilian Nickel, and Matt Le. Flow matching for"},{"citing_arxiv_id":"2604.13645","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"A Mechanistic Analysis of Sim-and-Real Co-Training in Generative Robot Policies","primary_cat":"cs.RO","submitted_at":"2026-04-15T09:14:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Sim-and-real co-training for robot policies is driven primarily by balanced cross-domain representation alignment and secondarily by domain-dependent action reweighting.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"s∗(xt, t) =−ϵ ∗(xt, t)/σt = 1 σ2 t [αtE[x0|xt]−x t].(11) And we have E[x0|xt] = PN i=1 p(xt|x0 =x i)·x i PN i=j p(xt|x0 =x j) (12) wherep(x t|x0 i ) =N(x t;α txi, σ2 t Id). Then suppose we have source and target datasets as DT ={x i}N i=1 and DS ={x j}M j=1, co-train a diffusion model with mixing ratiow, this gives us the training objective as: Lw(t) :=w· L DT + (1−w)· L DS (13) Similarly, we can get the analytical optimal score function as: s∗ w(xt, t) = ˆwt ·s ∗ t (xt, t) + ˆws ·s ∗ s(xt, t)(14) where ˆwt := wpt(xt) wpt(xt) + (1−w)p s(xt) (15) Proof. To findf(B) that minimizes L(f) =wE t[∥A−f(B)∥ 2]+(1−w)E s[∥A−f(B)∥ 2], we define a mixture probability densityp w(a, b) =wp t(a, b) + (1−w)p s(a, b). By expressing the expectations as integrals, we have:"},{"citing_arxiv_id":"2604.12908","ref_index":36,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Robotic Manipulation is Vision-to-Geometry Mapping ($f(v) \\rightarrow G$): Vision-Geometry Backbones over Language and Video Models","primary_cat":"cs.RO","submitted_at":"2026-04-14T15:57:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Vision-geometry backbones using pretrained 3D world models outperform vision-language and video models for robotic manipulation by enabling direct mapping from visual input to geometric actions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.11135","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"AIM: Intent-Aware Unified world action Modeling with Spatial Value Maps","primary_cat":"cs.RO","submitted_at":"2026-04-13T07:48:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AIM predicts aligned spatial value maps inside a shared video-generation transformer to produce reliable robot actions, reaching 94% success on RoboTwin 2.0 with larger gains on long-horizon and contact-rich tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.08168","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ViVa: A Video-Generative Value Model for Robot Reinforcement Learning","primary_cat":"cs.RO","submitted_at":"2026-04-09T12:28:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ViVa turns a video generator into a value model for robot RL that jointly forecasts future states and task value, yielding better performance on real-world box assembly when integrated with RECAP.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"action model via instruction-driven routing & sparsification.arXiv preprint arXiv:2508.21046, 2025. 1 [27] Xinghang Li, Peiyan Li, Long Qian, Minghuan Liu, Dong Wang, Jirong Liu, Bingyi Kang, Xiao Ma, Xinlong Wang, Di Guo, et al. What matters in building vision-language-action models for generalist robots. Nature Machine Intelligence, pages 1-15, 2026. 1 [28] Junbang Liang, Pavel Tokmakov, Ruoshi Liu, Sruthi Sudhakar, Paarth Shah, Rares Ambrus, and Carl Vondrick. Video generators are robot policies.arXiv preprint arXiv:2508.00795, 2025. 5 15 ViVa: A Video-Generative Value Model for Robot Reinforcement Learning [29] Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang"},{"citing_arxiv_id":"2604.06168","ref_index":37,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Action Images: End-to-End Policy Learning via Multiview Video Generation","primary_cat":"cs.CV","submitted_at":"2026-04-07T17:59:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Action Images turn robot arm motions into interpretable multiview pixel videos, letting video backbones serve as zero-shot policies for end-to-end robot learning.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":", Touati, A., et al.: Bfm-zero: A promptable behavioral foundation model for humanoid control using unsupervised reinforcement learning. arXiv preprint arXiv:2511.04131 (2025) [36] Li, Z., Zhang, M., Wu, T., Tan, J., Wang, J., Lin, D.: Ss4d: Native 4d generative model via structured spacetime latents. ACM Transactions on Graphics (TOG)44(6), 1-12 (2025) [37] Liang, J., Tokmakov, P., Liu, R., Sudhakar, S., Shah, P., Ambrus, R., Vondrick, C.: Video generators are robot policies. arXiv preprint arXiv:2508.00795 (2025) [38] Lightricks:Ltxstudio.Online(2024),https://app.ltx.studio/,accessed: 2026-02 [39] Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. arXiv preprint arXiv:2210."},{"citing_arxiv_id":"2604.04502","ref_index":31,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Veo-Act: How Far Can Frontier Video Models Advance Generalizable Robot Manipulation?","primary_cat":"cs.RO","submitted_at":"2026-04-06T07:57:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Veo-3 video predictions enable approximate task-level robot trajectories in zero-shot settings but require hierarchical integration with low-level VLA policies for reliable manipulation performance.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"low-level policy for dexterous interaction; otherwise it continues consuming the planned action queue. The system can switch back and resume the remaining planned actions to complete the task. reserve the wrist camera for low-level policy execution after switching. To enable large-scale data collection and realistic evaluation, we build a high-fidelity IsaacLab simulation[31, 33, 34] that mirrors the physical setup. 2) Dataset:To train the multi-head IDM, we collect 300k frame-pair samples in simulation. Each dataset contains tra- jectories of 100 to 200 steps, where the robot performs random motions interleaved with grasp and release actions. At each step, in addition to recording the global-view camera image, we record the corresponding 21-dimensional single-"},{"citing_arxiv_id":"2603.16666","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Fast-WAM: Do World Action Models Need Test-time Future Imagination?","primary_cat":"cs.CV","submitted_at":"2026-03-17T15:33:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Fast-WAM shows that explicit future imagination at test time is not required for strong WAM performance; video modeling during training provides the main benefit.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"An important direction for future work is to study the effect of larger-scale pretraining data and model scaling on this design. 9 References [1] Jonas Pai, Liam Achenbach, Victoriano Montesinos, Benedek Forrai, Oier Mees, and Elvis Nava. mimic-video: Video-action models for generalizable robot control beyond vlas.arXiv preprint 2512.15692, 2025. [2] Junbang Liang, Pavel Tokmakov, Ruoshi Liu, Sruthi Sudhakar, Paarth Shah, Rares Ambrus, and Carl V ondrick. Video generators are robot policies.arXiv preprint arXiv:2508.00795, 2025. [3] Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, Yujun Shen, and Yinghao Xu. Causal world modeling for robot control."},{"citing_arxiv_id":"2603.09030","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"PlayWorld: Learning Robot World Models from Autonomous Play","primary_cat":"cs.RO","submitted_at":"2026-03-09T23:58:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PlayWorld learns high-fidelity robot world models from unsupervised self-play, producing physically consistent video predictions that outperform models trained on human data and enabling 65% better real-world policy performance via model-based RL.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.00110","ref_index":34,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Learning Physics from Pretrained Video Models: A Multimodal Continuous and Sequential World Interaction Models for Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2026-02-18T14:58:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PhysGen uses video models to learn physics for robots, outperforming baselines by up to 13.8% on Libero and matching specialized models in real-world tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.15922","ref_index":64,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"World Action Models are Zero-shot Policies","primary_cat":"cs.RO","submitted_at":"2026-02-17T15:04:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment transfer with 10-30 minutes of data.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Dreamitate: Real-world visuomotor policy learning via video generation. In8th Annual Conference on Robot Learning, 2024. URLhttps://openreview.net/forum?id=InT87E5sr4. 5 [63] Junbang Liang, Pavel Tokmakov, Ruoshi Liu, Sruthi Sudhakar, Paarth Shah, Rares Ambrus, and Carl Vondrick. Video generators are robot policies.arXiv preprint arXiv:2508.00795, 2025. 2, 5 [64] Yue Liao, Pengfei Zhou, Siyuan Huang, Donglin Yang, Shengcong Chen, Yuxin Jiang, Yue Hu, Jingbin Cai, Si Liu, Jianlan Luo, et al. Genie envisioner: A unified world foundation platform for robotic manipulation. arXiv preprint arXiv:2508.05635, 2025. 3, 5, 7 [65] Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for"},{"citing_arxiv_id":"2512.15692","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs","primary_cat":"cs.RO","submitted_at":"2025-12-17T18:47:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"mimic-video combines internet video pretraining with a flow-matching decoder to achieve state-of-the-art robotic manipulation performance with 10x better sample efficiency than vision-language-action models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}