pith. sign in

arxiv: 2605.19319 · v1 · pith:WSECIQZZnew · submitted 2026-05-19 · 💻 cs.CV

SWEET: Sparse World Modeling with Image Editing for Embodied Task Execution

Pith reviewed 2026-05-20 07:00 UTC · model grok-4.3

classification 💻 cs.CV
keywords image editingsparse world modelingrobot manipulationkeyframe predictionvisual predictiondiffusion policyembodied control
0
0 comments X

The pith

Image editing models can generate reliable task keyframes for robot manipulation more efficiently than full video prediction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether image editing can act as a sparse visual world model by predicting only the key future states needed for a manipulation task instead of generating dense video sequences. A controlled comparison shows that an image editing model produces higher-fidelity keyframes at far lower cost than a video generation model when both are trained on the same robot data. The authors then build SWEET, a framework that uses repeated image edits conditioned on language instructions and spatial arrows to create a short sequence of imagined states, followed by a diffusion model that turns pairs of those states into executable action chunks. Mixed training on filtered edited images helps close the gap between synthetic and real visuals so the action predictor works on actual robots.

Core claim

Image editing models can serve as sparse visual world models for robot manipulation by predicting task-level future states without dense video rollout, producing more reliable keyframes with better visual fidelity and substantially lower inference cost than video generation. The SWEET framework performs successive image edits to build a chain of manipulation keyframes from language and optional spatial guidance, then applies a goal-conditioned diffusion action predictor to convert adjacent keyframes into action chunks. A mixed-training strategy using filtered edited targets reduces the distribution mismatch between real and generated images, enabling the full pipeline from planning to robot-

What carries the argument

Successive language-and-guidance-conditioned image editing that produces a short sequence of task-relevant keyframes, which a goal-conditioned diffusion policy then maps to action chunks.

If this is right

  • Keyframe prediction improves across both seen and unseen scenes on DROID and RoboMimic.
  • The method produces a complete pipeline that turns language instructions into executable robot action sequences.
  • Inference cost is substantially lower than video-generation baselines while maintaining or exceeding visual quality.
  • Mixed training with filtered edited targets measurably reduces the visual mismatch that would otherwise hurt action prediction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same editing-based planning loop could be tested on longer-horizon tasks that require more than a handful of keyframes.
  • Integration with existing motion planners might allow the generated keyframes to serve as waypoints rather than direct visual goals.
  • Lower inference cost could make real-time replanning feasible on resource-limited robot hardware.

Load-bearing premise

The controlled comparison and mixed-training strategy are sufficient to make edited keyframes visually close enough to real observations that the downstream action predictor can execute tasks reliably.

What would settle it

Running the full SWEET pipeline on a held-out RoboMimic task suite and measuring whether task success rate drops when the action predictor receives edited keyframes instead of real ones.

Figures

Figures reproduced from arXiv: 2605.19319 by Mike Zheng Shou, Xiyao Deng, Yihan Wang, Yiren Song, Zhuoran Yan.

Figure 1
Figure 1. Figure 1: SWEET converts language-guided manipulation instructions into sparse visual keyframes [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of SWEET. SWEET first trains an image editing planner to imagine task-relevant [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison of keyframe planning on RoboMimic and DROID. Compared with [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative visualization of SWEET on the RoboMimic simulation benchmark, covering [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Ablation visualization of [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
read the original abstract

Visual prediction has emerged as a promising paradigm for embodied control, where future observations are generated and then translated into actions. However, dense video generation is computationally expensive and often unnecessary for many manipulation tasks, whose progress can be summarized by a small number of task-relevant visual states. In this work, we study whether image editing models can serve as sparse visual world models for robot manipulation by predicting task-level future states without dense video rollout. We first conduct a controlled comparison between the video generation model Wan2.2 and the image editing model FLUX-Kontext under the same robotic data setting, and find that image editing produces more reliable task-level keyframes with better visual fidelity and substantially lower inference cost. Motivated by this observation, we propose SWEET, a one-shot sparse visual planning framework that progressively generates a sequence of task-relevant manipulation keyframes through successive image editing, conditioned on language instructions and optional arrow-based spatial guidance. A goal-conditioned diffusion action predictor then converts adjacent imagined keyframes into executable action chunks. To reduce the mismatch between real and edited visual subgoals, we further introduce a mixed-training strategy with filtered edited targets. Experiments on DROID and RoboMimic show that SWEET improves keyframe prediction across seen and unseen scenes and enables a full pipeline from sequential keyframe planning to executable robot actions, suggesting that image editing is a promising and underexplored direction for embodied visual prediction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that image editing models can serve as sparse visual world models for robot manipulation by generating task-level future states (keyframes) via successive edits conditioned on language instructions and optional spatial guidance, without needing dense video rollouts. It presents a controlled comparison showing FLUX-Kontext outperforms Wan2.2 in keyframe reliability, visual fidelity, and inference cost under the same robotic data setting. SWEET then uses these imagined keyframes with a goal-conditioned diffusion action predictor to produce executable action chunks, augmented by a mixed-training strategy on filtered edited targets to reduce real-edited distribution mismatch. Experiments on DROID and RoboMimic report improvements in keyframe prediction for seen and unseen scenes and enable full planning-to-action pipelines.

Significance. If the central empirical claims hold, the work is significant for proposing an efficient alternative to dense video-based visual prediction in embodied AI, leveraging image editing for sparse, task-relevant states that better match manipulation needs. The controlled comparison between video generation and image editing models under identical robotic data conditions is a clear strength, as is the end-to-end pipeline integrating sequential keyframe planning with a diffusion action predictor. This direction could reduce computational overhead while improving keyframe quality, though its impact depends on validating that edited images remain sufficiently close to real observations for reliable action prediction.

major comments (2)
  1. [§4] §4 (Experiments on DROID and RoboMimic): The abstract and description claim improvements in keyframe prediction and executable actions with better visual fidelity and lower cost than video baselines, but specific metrics, baselines, error bars, data splits, and controls are not detailed. This is load-bearing for the motivation and central claim that image editing produces more reliable keyframes.
  2. [§3.3] §3.3 (mixed-training strategy): The approach relies on filtered edited targets to reduce distribution mismatch for the goal-conditioned diffusion action predictor, but no quantitative measures (e.g., FID, LPIPS, or feature-space distances between real and edited subgoals) are reported, nor are the filtering criteria or checks for selection bias. Residual mismatch across progressive edits in unseen scenes could undermine action chunk reliability even if keyframe fidelity appears good.
minor comments (1)
  1. [Abstract] Abstract: The phrase 'substantially lower inference cost' is stated without accompanying numbers or direct comparison tables, which would help readers assess the practical advantage over video generation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential of sparse visual world models based on image editing. We address each major comment below with point-by-point responses and have revised the manuscript to strengthen the experimental reporting and analysis.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments on DROID and RoboMimic): The abstract and description claim improvements in keyframe prediction and executable actions with better visual fidelity and lower cost than video baselines, but specific metrics, baselines, error bars, data splits, and controls are not detailed. This is load-bearing for the motivation and central claim that image editing produces more reliable keyframes.

    Authors: We agree that clearer presentation of the quantitative results would improve readability. Section 4 already contains the controlled comparison, reporting keyframe fidelity via PSNR, SSIM, and LPIPS, task success rates for the full planning-to-action pipeline, and inference latency as the cost metric. Baselines are Wan2.2 (video) and ablated variants of FLUX-Kontext. All results include standard deviation error bars computed over five random seeds. Data splits are explicitly 80/20 train/test for seen scenes on both DROID and RoboMimic, with an additional held-out unseen-scene test set drawn from different environments and camera viewpoints. We have revised the text in Section 4 and the abstract to explicitly cross-reference the relevant tables and figures so that these details are immediately visible. revision: yes

  2. Referee: [§3.3] §3.3 (mixed-training strategy): The approach relies on filtered edited targets to reduce distribution mismatch for the goal-conditioned diffusion action predictor, but no quantitative measures (e.g., FID, LPIPS, or feature-space distances between real and edited subgoals) are reported, nor are the filtering criteria or checks for selection bias. Residual mismatch across progressive edits in unseen scenes could undermine action chunk reliability even if keyframe fidelity appears good.

    Authors: We acknowledge that explicit quantification of the distribution shift would strengthen the justification for the mixed-training strategy. In the revised manuscript we have added a new table in Section 3.3 reporting FID and LPIPS between real observations and both unfiltered and filtered edited subgoals, demonstrating a measurable reduction after filtering. The filtering criteria are now stated explicitly: edits are retained only if LPIPS to the nearest real frame is below 0.15 and CLIP similarity to the language instruction exceeds 0.75; we also report the fraction of edits discarded per scene. An ablation comparing action prediction performance with and without filtering is included to address potential selection bias. For progressive edits in unseen scenes, the end-to-end success rates on the unseen test sets already reflect any residual mismatch, and we have added a short discussion noting that the mixed-training objective appears sufficient to maintain reliable action chunks despite the progressive nature of the edits. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework grounded in external benchmarks

full rationale

The paper advances an empirical proposal for using image editing as sparse world models, supported by controlled comparisons on DROID and RoboMimic datasets plus a mixed-training strategy. No equations, derivations, or first-principles claims appear; the central pipeline (successive editing to keyframes then diffusion action prediction) is evaluated against external robot data rather than reducing to fitted parameters or self-citation chains. The mixed-training step addresses distribution mismatch via filtering but does not rename a fit as a prediction or import uniqueness from prior author work. This is a standard non-circular empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that image editing can produce task-relevant future states sufficiently close to reality for planning, plus the untested premise that the chosen comparison setting generalizes.

axioms (1)
  • domain assumption Image editing models conditioned on language and spatial guidance can reliably produce task-level future keyframes that support downstream action prediction.
    Invoked when proposing SWEET as a sparse world model and when claiming superiority over video generation.

pith-pipeline@v0.9.0 · 5790 in / 1179 out tokens · 48658 ms · 2026-05-20T07:00:30.273348+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 21 internal anchors

  1. [1]

    Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

    Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, et al. Do as i can, not as i say: Grounding language in robotic affordances.arXiv preprint arXiv:2204.01691, 2022

  2. [2]

    Is Conditional Generative Modeling all you need for Decision-Making?

    Anurag Ajay, Yilun Du, Abhi Gupta, Joshua Tenenbaum, Tommi Jaakkola, and Pulkit Agrawal. Is conditional generative modeling all you need for decision-making?arXiv preprint arXiv:2211.15657, 2022

  3. [3]

    Compositional foundation models for hierarchical planning.Advances in Neural Information Processing Systems, 36:22304–22325, 2023

    Anurag Ajay, Seungwook Han, Yilun Du, Shuang Li, Abhi Gupta, Tommi Jaakkola, Josh Tenenbaum, Leslie Kaelbling, Akash Srivastava, and Pulkit Agrawal. Compositional foundation models for hierarchical planning.Advances in Neural Information Processing Systems, 36:22304–22325, 2023

  4. [4]

    Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv e-prints, pages arXiv–2506, 2025

  5. [5]

    Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models

    Kevin Black, Mitsuhiko Nakamoto, Pranav Atreya, Homer Walke, Chelsea Finn, Aviral Kumar, and Sergey Levine. Zero-shot robotic manipulation with pretrained image-editing diffusion models.arXiv preprint arXiv:2310.10639, 2023

  6. [6]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

  7. [7]

    Robocat: A self-improving foundation agent for robotic manipulation,

    Konstantinos Bousmalis, Giulia Vezzani, Dushyant Rao, Coline Devin, Alex X Lee, Maria Bauzá, Todor Davchev, Yuxiang Zhou, Agrim Gupta, Akhil Raju, et al. Robocat: A self-improving generalist agent for robotic manipulation.arXiv preprint arXiv:2306.11706, 2023

  8. [8]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

  9. [9]

    Instructpix2pix: Learning to follow image editing instructions

    Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18392–18402, 2023

  10. [10]

    Goal-vla: Image-generative vlms as object-centric world models empowering zero-shot robot manipulation.arXiv preprint arXiv:2506.23919, 2025

    Haonan Chen, Jingxiang Guo, Bangjun Wang, Tianrui Zhang, Xuchuan Huang, Boren Zheng, Yiwen Hou, Chenrui Tie, Jiajun Deng, and Lin Shao. Goal-vla: Image-generative vlms as object-centric world models empowering zero-shot robot manipulation.arXiv preprint arXiv:2506.23919, 2025

  11. [11]

    Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025. 10

  12. [12]

    Open X-Embodiment: Robotic Learning Datasets and RT-X Models

    OX-Embodiment Collaboration, Abby O’Neill, Abdul Rehman, Abhinav Gupta, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, et al. Open x-embodiment: Robotic learning datasets and rt-x models.arXiv preprint arXiv:2310.08864, 1(2), 2023

  13. [13]

    PaLM-E: An Embodied Multimodal Language Model

    Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model.arXiv preprint arXiv:2303.03378, 2023

  14. [14]

    Learning universal policies via text-guided video generation.Advances in neural information processing systems, 36:9156–9172, 2023

    Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation.Advances in neural information processing systems, 36:9156–9172, 2023

  15. [15]

    Video language planning

    Yilun Du, Sherry Yang, Pete Florence, Fei Xia, Ayzaan Wahid, Pierre Sermanet, Tianhe Yu, Pieter Abbeel, Joshua B Tenenbaum, Leslie Pack Kaelbling, et al. Video language planning. InThe Twelfth International Conference on Learning Representations, 2023

  16. [16]

    Deep visual foresight for planning robot motion

    Chelsea Finn and Sergey Levine. Deep visual foresight for planning robot motion. In2017 IEEE international conference on robotics and automation (ICRA), pages 2786–2793. IEEE, 2017

  17. [17]

    Implicit behavioral cloning

    Pete Florence, Corey Lynch, Andy Zeng, Oscar A Ramirez, Ayzaan Wahid, Laura Downs, Adrian Wong, Johnny Lee, Igor Mordatch, and Jonathan Tompson. Implicit behavioral cloning. InConference on robot learning, pages 158–168. PMLR, 2022

  18. [18]

    Relationadapter: Learning and transferring visual relation with diffusion transformers.arXiv preprint arXiv:2506.02528, 2025

    Yan Gong, Yiren Song, Yicheng Li, Chenglin Li, and Yin Zhang. Relationadapter: Learning and transferring visual relation with diffusion transformers.arXiv preprint arXiv:2506.02528, 2025

  19. [19]

    Arteditor: Learning customized instructional image editor from few-shot examples

    Shijie Huang, Yiren Song, Yuxuan Zhang, Hailong Guo, Xueyin Wang, and Jiaming Liu. Arteditor: Learning customized instructional image editor from few-shot examples. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17651–17662, 2025

  20. [20]

    Inner Monologue: Embodied Reasoning through Planning with Language Models

    Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, et al. Inner monologue: Embodied reasoning through planning with language models.arXiv preprint arXiv:2207.05608, 2022

  21. [21]

    VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

    Wenlong Huang, Chen Wang, Ruohan Zhang, Yunzhu Li, Jiajun Wu, and Li Fei-Fei. V oxposer: Composable 3d value maps for robotic manipulation with language models.arXiv preprint arXiv:2307.05973, 2023

  22. [22]

    Bc-z: Zero-shot task generalization with robotic imitation learning

    Eric Jang, Alex Irpan, Mohi Khansari, Daniel Kappler, Frederik Ebert, Corey Lynch, Sergey Levine, and Chelsea Finn. Bc-z: Zero-shot task generalization with robotic imitation learning. Inconference on Robot Learning, pages 991–1002. PMLR, 2022

  23. [23]

    Planning with Diffusion for Flexible Behavior Synthesis

    Michael Janner, Yilun Du, Joshua B Tenenbaum, and Sergey Levine. Planning with diffusion for flexible behavior synthesis.arXiv preprint arXiv:2205.09991, 2022

  24. [24]

    Vima: Robot manipulation with multimodal prompts

    Yunfan Jiang, Agrim Gupta, Zichen Zhang, Guanzhi Wang, Yongqiang Dou, Yanjun Chen, Li Fei-Fei, Anima Anandkumar, Yuke Zhu, and Linxi Fan. Vima: Robot manipulation with multimodal prompts. 2023

  25. [25]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

  26. [26]

    Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv preprint arXiv:2506.15742, 2025

  27. [27]

    Gr-mg: Leveraging partially-annotated data via multi-modal goal-conditioned policy.IEEE Robotics and Automation Letters, 10(2):1912–1919, 2025

    Peiyan Li, Hongtao Wu, Yan Huang, Chilam Cheang, Liang Wang, and Tao Kong. Gr-mg: Leveraging partially-annotated data via multi-modal goal-conditioned policy.IEEE Robotics and Automation Letters, 10(2):1912–1919, 2025

  28. [28]

    Code as policies: Language model programs for embodied control

    Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for embodied control. In2023 IEEE International conference on robotics and automation (ICRA), pages 9493–9500. IEEE, 2023

  29. [29]

    Video Generators are Robot Policies

    Junbang Liang, Pavel Tokmakov, Ruoshi Liu, Sruthi Sudhakar, Paarth Shah, Rares Ambrus, and Carl V ondrick. Video generators are robot policies.arXiv preprint arXiv:2508.00795, 2025

  30. [30]

    Omnirefiner: Reinforcement-guided local diffusion refinement.arXiv preprint arXiv:2511.19990, 2025

    Yaoli Liu, Ziheng Ouyang, Shengtao Lou, and Yiren Song. Omnirefiner: Reinforcement-guided local diffusion refinement.arXiv preprint arXiv:2511.19990, 2025. 11

  31. [31]

    What matters in language conditioned robotic imitation learning over unstructured data.IEEE Robotics and Automation Letters, 7(4):11205–11212, 2022

    Oier Mees, Lukas Hermann, and Wolfram Burgard. What matters in language conditioned robotic imitation learning over unstructured data.IEEE Robotics and Automation Letters, 7(4):11205–11212, 2022

  32. [32]

    Calvin: A benchmark for language- conditioned policy learning for long-horizon robot manipulation tasks.IEEE Robotics and Automation Letters, 7(3):7327–7334, 2022

    Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wolfram Burgard. Calvin: A benchmark for language- conditioned policy learning for long-horizon robot manipulation tasks.IEEE Robotics and Automation Letters, 7(3):7327–7334, 2022

  33. [33]

    A Generalist Agent

    Scott Reed, Konrad Zolna, Emilio Parisotto, Sergio Gomez Colmenarejo, Alexander Novikov, Gabriel Barth-Maron, Mai Gimenez, Yury Sulsky, Jackie Kay, Jost Tobias Springenberg, et al. A generalist agent. arXiv preprint arXiv:2205.06175, 2022

  34. [34]

    Behavior transform- ers: Cloning k modes with one stone.Advances in neural information processing systems, 35:22955–22968, 2022

    Nur Muhammad Shafiullah, Zichen Cui, Ariuntuya Arty Altanzaya, and Lerrel Pinto. Behavior transform- ers: Cloning k modes with one stone.Advances in neural information processing systems, 35:22955–22968, 2022

  35. [35]

    Emu edit: Precise image editing via recognition and generation tasks

    Shelly Sheynin, Adam Polyak, Uriel Singer, Yuval Kirstain, Amit Zohar, Oron Ashual, Devi Parikh, and Yaniv Taigman. Emu edit: Precise image editing via recognition and generation tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8871–8879, 2024

  36. [36]

    Cliport: What and where pathways for robotic manipula- tion

    Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Cliport: What and where pathways for robotic manipula- tion. InConference on robot learning, pages 894–906. PMLR, 2022

  37. [37]

    Mitty: Diffusion-based human-to-robot video generation.arXiv preprint arXiv:2512.17253, 2025

    Yiren Song, Cheng Liu, Weijia Mao, and Mike Zheng Shou. Mitty: Diffusion-based human-to-robot video generation.arXiv preprint arXiv:2512.17253, 2025

  38. [38]

    Makeany- thing: Harnessing diffusion transformers for multi- domain procedural sequence generation.arXiv preprint arXiv:2502.01572, 2025

    Yiren Song, Cheng Liu, and Mike Zheng Shou. Makeanything: Harnessing diffusion transformers for multi-domain procedural sequence generation.arXiv preprint arXiv:2502.01572, 2025

  39. [39]

    Omniconsistency: Learning style-agnostic consistency from paired stylization data.arXiv preprint arXiv:2505.18445, 2025

    Yiren Song, Cheng Liu, and Mike Zheng Shou. Omniconsistency: Learning style-agnostic consistency from paired stylization data.arXiv preprint arXiv:2505.18445, 2025

  40. [40]

    OmniHumanoid: Streaming Cross-Embodiment Video Generation with Paired-Free Adaptation

    Yiren Song, Xiyao Deng, Pei Yang, Yihan Wang, and Mike Zheng Shou. Omnihumanoid: Streaming cross-embodiment video generation with paired-free adaptation.arXiv preprint arXiv:2605.12038, 2026

  41. [41]

    Octo: An Open-Source Generalist Robot Policy

    Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024

  42. [42]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  43. [43]

    Diffdecom- pose: Layer-wise decomposition of alpha-composited images via diffusion transformers.arXiv preprint arXiv:2505.21541, 2025

    Zitong Wang, Hang Zhao, Qianyu Zhou, Xuequan Lu, Xiangtai Li, and Yiren Song. Diffdecom- pose: Layer-wise decomposition of alpha-composited images via diffusion transformers.arXiv preprint arXiv:2505.21541, 2025

  44. [44]

    Qwen-Image Technical Report

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

  45. [45]

    Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation

    Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, and Tao Kong. Unleashing large-scale video generative pre-training for visual robot manipulation. arXiv preprint arXiv:2312.13139, 2023

  46. [46]

    Omnigen: Unified image generation

    Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xingrun Xing, Ruiran Yan, Chaofan Li, Shuting Wang, Tiejun Huang, and Zheng Liu. Omnigen: Unified image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13294–13304, 2025

  47. [47]

    Eedit: Rethinking the spatial and temporal redundancy for efficient image editing.arXiv preprint arXiv:2503.10270, 2025

    Zexuan Yan, Yue Ma, Chang Zou, Wenteng Chen, Qifeng Chen, and Linfeng Zhang. Eedit: Rethinking the spatial and temporal redundancy for efficient image editing.arXiv preprint arXiv:2503.10270, 2025

  48. [48]

    X-humanoid: Robotize human videos to generate humanoid videos at scale.arXiv preprint arXiv:2512.04537, 2025

    Pei Yang, Hai Ci, Yiren Song, and Mike Zheng Shou. X-humanoid: Robotize human videos to generate humanoid videos at scale.arXiv preprint arXiv:2512.04537, 2025

  49. [49]

    Fast-WAM: Do World Action Models Need Test-time Future Imagination?

    Tianyuan Yuan, Zibin Dong, Yicheng Liu, and Hang Zhao. Fast-wam: Do world action models need test-time future imagination?arXiv preprint arXiv:2603.16666, 2026

  50. [50]

    Magicbrush: A manually annotated dataset for instruction-guided image editing.Advances in Neural Information Processing Systems, 36:31428–31449, 2023

    Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. Magicbrush: A manually annotated dataset for instruction-guided image editing.Advances in Neural Information Processing Systems, 36:31428–31449, 2023. 12

  51. [51]

    Action Images: End-to-End Policy Learning via Multiview Video Generation

    Haoyu Zhen, Zixian Gao, Qiao Sun, Yilin Zhao, Yuncong Yang, Yilun Du, Tsun-Hsuan Wang, Yi-Ling Qiao, and Chuang Gan. Action images: End-to-end policy learning via multiview video generation.arXiv preprint arXiv:2604.06168, 2026

  52. [52]

    Rt-2: Vision-language-action models transfer web knowledge to robotic control

    Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023. 13