SWEET: Sparse World Modeling with Image Editing for Embodied Task Execution
Pith reviewed 2026-05-20 07:00 UTC · model grok-4.3
The pith
Image editing models can generate reliable task keyframes for robot manipulation more efficiently than full video prediction.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Image editing models can serve as sparse visual world models for robot manipulation by predicting task-level future states without dense video rollout, producing more reliable keyframes with better visual fidelity and substantially lower inference cost than video generation. The SWEET framework performs successive image edits to build a chain of manipulation keyframes from language and optional spatial guidance, then applies a goal-conditioned diffusion action predictor to convert adjacent keyframes into action chunks. A mixed-training strategy using filtered edited targets reduces the distribution mismatch between real and generated images, enabling the full pipeline from planning to robot-
What carries the argument
Successive language-and-guidance-conditioned image editing that produces a short sequence of task-relevant keyframes, which a goal-conditioned diffusion policy then maps to action chunks.
If this is right
- Keyframe prediction improves across both seen and unseen scenes on DROID and RoboMimic.
- The method produces a complete pipeline that turns language instructions into executable robot action sequences.
- Inference cost is substantially lower than video-generation baselines while maintaining or exceeding visual quality.
- Mixed training with filtered edited targets measurably reduces the visual mismatch that would otherwise hurt action prediction.
Where Pith is reading between the lines
- The same editing-based planning loop could be tested on longer-horizon tasks that require more than a handful of keyframes.
- Integration with existing motion planners might allow the generated keyframes to serve as waypoints rather than direct visual goals.
- Lower inference cost could make real-time replanning feasible on resource-limited robot hardware.
Load-bearing premise
The controlled comparison and mixed-training strategy are sufficient to make edited keyframes visually close enough to real observations that the downstream action predictor can execute tasks reliably.
What would settle it
Running the full SWEET pipeline on a held-out RoboMimic task suite and measuring whether task success rate drops when the action predictor receives edited keyframes instead of real ones.
Figures
read the original abstract
Visual prediction has emerged as a promising paradigm for embodied control, where future observations are generated and then translated into actions. However, dense video generation is computationally expensive and often unnecessary for many manipulation tasks, whose progress can be summarized by a small number of task-relevant visual states. In this work, we study whether image editing models can serve as sparse visual world models for robot manipulation by predicting task-level future states without dense video rollout. We first conduct a controlled comparison between the video generation model Wan2.2 and the image editing model FLUX-Kontext under the same robotic data setting, and find that image editing produces more reliable task-level keyframes with better visual fidelity and substantially lower inference cost. Motivated by this observation, we propose SWEET, a one-shot sparse visual planning framework that progressively generates a sequence of task-relevant manipulation keyframes through successive image editing, conditioned on language instructions and optional arrow-based spatial guidance. A goal-conditioned diffusion action predictor then converts adjacent imagined keyframes into executable action chunks. To reduce the mismatch between real and edited visual subgoals, we further introduce a mixed-training strategy with filtered edited targets. Experiments on DROID and RoboMimic show that SWEET improves keyframe prediction across seen and unseen scenes and enables a full pipeline from sequential keyframe planning to executable robot actions, suggesting that image editing is a promising and underexplored direction for embodied visual prediction.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that image editing models can serve as sparse visual world models for robot manipulation by generating task-level future states (keyframes) via successive edits conditioned on language instructions and optional spatial guidance, without needing dense video rollouts. It presents a controlled comparison showing FLUX-Kontext outperforms Wan2.2 in keyframe reliability, visual fidelity, and inference cost under the same robotic data setting. SWEET then uses these imagined keyframes with a goal-conditioned diffusion action predictor to produce executable action chunks, augmented by a mixed-training strategy on filtered edited targets to reduce real-edited distribution mismatch. Experiments on DROID and RoboMimic report improvements in keyframe prediction for seen and unseen scenes and enable full planning-to-action pipelines.
Significance. If the central empirical claims hold, the work is significant for proposing an efficient alternative to dense video-based visual prediction in embodied AI, leveraging image editing for sparse, task-relevant states that better match manipulation needs. The controlled comparison between video generation and image editing models under identical robotic data conditions is a clear strength, as is the end-to-end pipeline integrating sequential keyframe planning with a diffusion action predictor. This direction could reduce computational overhead while improving keyframe quality, though its impact depends on validating that edited images remain sufficiently close to real observations for reliable action prediction.
major comments (2)
- [§4] §4 (Experiments on DROID and RoboMimic): The abstract and description claim improvements in keyframe prediction and executable actions with better visual fidelity and lower cost than video baselines, but specific metrics, baselines, error bars, data splits, and controls are not detailed. This is load-bearing for the motivation and central claim that image editing produces more reliable keyframes.
- [§3.3] §3.3 (mixed-training strategy): The approach relies on filtered edited targets to reduce distribution mismatch for the goal-conditioned diffusion action predictor, but no quantitative measures (e.g., FID, LPIPS, or feature-space distances between real and edited subgoals) are reported, nor are the filtering criteria or checks for selection bias. Residual mismatch across progressive edits in unseen scenes could undermine action chunk reliability even if keyframe fidelity appears good.
minor comments (1)
- [Abstract] Abstract: The phrase 'substantially lower inference cost' is stated without accompanying numbers or direct comparison tables, which would help readers assess the practical advantage over video generation.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for recognizing the potential of sparse visual world models based on image editing. We address each major comment below with point-by-point responses and have revised the manuscript to strengthen the experimental reporting and analysis.
read point-by-point responses
-
Referee: [§4] §4 (Experiments on DROID and RoboMimic): The abstract and description claim improvements in keyframe prediction and executable actions with better visual fidelity and lower cost than video baselines, but specific metrics, baselines, error bars, data splits, and controls are not detailed. This is load-bearing for the motivation and central claim that image editing produces more reliable keyframes.
Authors: We agree that clearer presentation of the quantitative results would improve readability. Section 4 already contains the controlled comparison, reporting keyframe fidelity via PSNR, SSIM, and LPIPS, task success rates for the full planning-to-action pipeline, and inference latency as the cost metric. Baselines are Wan2.2 (video) and ablated variants of FLUX-Kontext. All results include standard deviation error bars computed over five random seeds. Data splits are explicitly 80/20 train/test for seen scenes on both DROID and RoboMimic, with an additional held-out unseen-scene test set drawn from different environments and camera viewpoints. We have revised the text in Section 4 and the abstract to explicitly cross-reference the relevant tables and figures so that these details are immediately visible. revision: yes
-
Referee: [§3.3] §3.3 (mixed-training strategy): The approach relies on filtered edited targets to reduce distribution mismatch for the goal-conditioned diffusion action predictor, but no quantitative measures (e.g., FID, LPIPS, or feature-space distances between real and edited subgoals) are reported, nor are the filtering criteria or checks for selection bias. Residual mismatch across progressive edits in unseen scenes could undermine action chunk reliability even if keyframe fidelity appears good.
Authors: We acknowledge that explicit quantification of the distribution shift would strengthen the justification for the mixed-training strategy. In the revised manuscript we have added a new table in Section 3.3 reporting FID and LPIPS between real observations and both unfiltered and filtered edited subgoals, demonstrating a measurable reduction after filtering. The filtering criteria are now stated explicitly: edits are retained only if LPIPS to the nearest real frame is below 0.15 and CLIP similarity to the language instruction exceeds 0.75; we also report the fraction of edits discarded per scene. An ablation comparing action prediction performance with and without filtering is included to address potential selection bias. For progressive edits in unseen scenes, the end-to-end success rates on the unseen test sets already reflect any residual mismatch, and we have added a short discussion noting that the mixed-training objective appears sufficient to maintain reliable action chunks despite the progressive nature of the edits. revision: yes
Circularity Check
No circularity: empirical framework grounded in external benchmarks
full rationale
The paper advances an empirical proposal for using image editing as sparse world models, supported by controlled comparisons on DROID and RoboMimic datasets plus a mixed-training strategy. No equations, derivations, or first-principles claims appear; the central pipeline (successive editing to keyframes then diffusion action prediction) is evaluated against external robot data rather than reducing to fitted parameters or self-citation chains. The mixed-training step addresses distribution mismatch via filtering but does not rename a fit as a prediction or import uniqueness from prior author work. This is a standard non-circular empirical contribution.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Image editing models conditioned on language and spatial guidance can reliably produce task-level future keyframes that support downstream action prediction.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
SWEET uses an image-editing model to progressively generate a sequence of future keyframes... A goal-conditioned diffusion action predictor then converts adjacent imagined keyframes into executable action chunks.
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanJ_uniquely_calibrated_via_higher_derivative unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
mixed-training strategy with filtered edited targets
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, et al. Do as i can, not as i say: Grounding language in robotic affordances.arXiv preprint arXiv:2204.01691, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[2]
Is Conditional Generative Modeling all you need for Decision-Making?
Anurag Ajay, Yilun Du, Abhi Gupta, Joshua Tenenbaum, Tommi Jaakkola, and Pulkit Agrawal. Is conditional generative modeling all you need for decision-making?arXiv preprint arXiv:2211.15657, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[3]
Anurag Ajay, Seungwook Han, Yilun Du, Shuang Li, Abhi Gupta, Tommi Jaakkola, Josh Tenenbaum, Leslie Kaelbling, Akash Srivastava, and Pulkit Agrawal. Compositional foundation models for hierarchical planning.Advances in Neural Information Processing Systems, 36:22304–22325, 2023
work page 2023
-
[4]
Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv e-prints, pages arXiv–2506, 2025
work page 2025
-
[5]
Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models
Kevin Black, Mitsuhiko Nakamoto, Pranav Atreya, Homer Walke, Chelsea Finn, Aviral Kumar, and Sergey Levine. Zero-shot robotic manipulation with pretrained image-editing diffusion models.arXiv preprint arXiv:2310.10639, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
Robocat: A self-improving foundation agent for robotic manipulation,
Konstantinos Bousmalis, Giulia Vezzani, Dushyant Rao, Coline Devin, Alex X Lee, Maria Bauzá, Todor Davchev, Yuxiang Zhou, Agrim Gupta, Akhil Raju, et al. Robocat: A self-improving generalist agent for robotic manipulation.arXiv preprint arXiv:2306.11706, 2023
-
[8]
RT-1: Robotics Transformer for Real-World Control at Scale
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[9]
Instructpix2pix: Learning to follow image editing instructions
Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18392–18402, 2023
work page 2023
-
[10]
Haonan Chen, Jingxiang Guo, Bangjun Wang, Tianrui Zhang, Xuchuan Huang, Boren Zheng, Yiwen Hou, Chenrui Tie, Jiajun Deng, and Lin Shao. Goal-vla: Image-generative vlms as object-centric world models empowering zero-shot robot manipulation.arXiv preprint arXiv:2506.23919, 2025
-
[11]
Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025. 10
work page 2025
-
[12]
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
OX-Embodiment Collaboration, Abby O’Neill, Abdul Rehman, Abhinav Gupta, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, et al. Open x-embodiment: Robotic learning datasets and rt-x models.arXiv preprint arXiv:2310.08864, 1(2), 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[13]
PaLM-E: An Embodied Multimodal Language Model
Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model.arXiv preprint arXiv:2303.03378, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[14]
Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation.Advances in neural information processing systems, 36:9156–9172, 2023
work page 2023
-
[15]
Yilun Du, Sherry Yang, Pete Florence, Fei Xia, Ayzaan Wahid, Pierre Sermanet, Tianhe Yu, Pieter Abbeel, Joshua B Tenenbaum, Leslie Pack Kaelbling, et al. Video language planning. InThe Twelfth International Conference on Learning Representations, 2023
work page 2023
-
[16]
Deep visual foresight for planning robot motion
Chelsea Finn and Sergey Levine. Deep visual foresight for planning robot motion. In2017 IEEE international conference on robotics and automation (ICRA), pages 2786–2793. IEEE, 2017
work page 2017
-
[17]
Pete Florence, Corey Lynch, Andy Zeng, Oscar A Ramirez, Ayzaan Wahid, Laura Downs, Adrian Wong, Johnny Lee, Igor Mordatch, and Jonathan Tompson. Implicit behavioral cloning. InConference on robot learning, pages 158–168. PMLR, 2022
work page 2022
-
[18]
Yan Gong, Yiren Song, Yicheng Li, Chenglin Li, and Yin Zhang. Relationadapter: Learning and transferring visual relation with diffusion transformers.arXiv preprint arXiv:2506.02528, 2025
-
[19]
Arteditor: Learning customized instructional image editor from few-shot examples
Shijie Huang, Yiren Song, Yuxuan Zhang, Hailong Guo, Xueyin Wang, and Jiaming Liu. Arteditor: Learning customized instructional image editor from few-shot examples. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17651–17662, 2025
work page 2025
-
[20]
Inner Monologue: Embodied Reasoning through Planning with Language Models
Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, et al. Inner monologue: Embodied reasoning through planning with language models.arXiv preprint arXiv:2207.05608, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[21]
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
Wenlong Huang, Chen Wang, Ruohan Zhang, Yunzhu Li, Jiajun Wu, and Li Fei-Fei. V oxposer: Composable 3d value maps for robotic manipulation with language models.arXiv preprint arXiv:2307.05973, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[22]
Bc-z: Zero-shot task generalization with robotic imitation learning
Eric Jang, Alex Irpan, Mohi Khansari, Daniel Kappler, Frederik Ebert, Corey Lynch, Sergey Levine, and Chelsea Finn. Bc-z: Zero-shot task generalization with robotic imitation learning. Inconference on Robot Learning, pages 991–1002. PMLR, 2022
work page 2022
-
[23]
Planning with Diffusion for Flexible Behavior Synthesis
Michael Janner, Yilun Du, Joshua B Tenenbaum, and Sergey Levine. Planning with diffusion for flexible behavior synthesis.arXiv preprint arXiv:2205.09991, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[24]
Vima: Robot manipulation with multimodal prompts
Yunfan Jiang, Agrim Gupta, Zichen Zhang, Guanzhi Wang, Yongqiang Dou, Yanjun Chen, Li Fei-Fei, Anima Anandkumar, Yuke Zhu, and Linxi Fan. Vima: Robot manipulation with multimodal prompts. 2023
work page 2023
-
[25]
OpenVLA: An Open-Source Vision-Language-Action Model
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[26]
Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv preprint arXiv:2506.15742, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[27]
Peiyan Li, Hongtao Wu, Yan Huang, Chilam Cheang, Liang Wang, and Tao Kong. Gr-mg: Leveraging partially-annotated data via multi-modal goal-conditioned policy.IEEE Robotics and Automation Letters, 10(2):1912–1919, 2025
work page 1912
-
[28]
Code as policies: Language model programs for embodied control
Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for embodied control. In2023 IEEE International conference on robotics and automation (ICRA), pages 9493–9500. IEEE, 2023
work page 2023
-
[29]
Video Generators are Robot Policies
Junbang Liang, Pavel Tokmakov, Ruoshi Liu, Sruthi Sudhakar, Paarth Shah, Rares Ambrus, and Carl V ondrick. Video generators are robot policies.arXiv preprint arXiv:2508.00795, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[30]
Omnirefiner: Reinforcement-guided local diffusion refinement.arXiv preprint arXiv:2511.19990, 2025
Yaoli Liu, Ziheng Ouyang, Shengtao Lou, and Yiren Song. Omnirefiner: Reinforcement-guided local diffusion refinement.arXiv preprint arXiv:2511.19990, 2025. 11
-
[31]
Oier Mees, Lukas Hermann, and Wolfram Burgard. What matters in language conditioned robotic imitation learning over unstructured data.IEEE Robotics and Automation Letters, 7(4):11205–11212, 2022
work page 2022
-
[32]
Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wolfram Burgard. Calvin: A benchmark for language- conditioned policy learning for long-horizon robot manipulation tasks.IEEE Robotics and Automation Letters, 7(3):7327–7334, 2022
work page 2022
-
[33]
Scott Reed, Konrad Zolna, Emilio Parisotto, Sergio Gomez Colmenarejo, Alexander Novikov, Gabriel Barth-Maron, Mai Gimenez, Yury Sulsky, Jackie Kay, Jost Tobias Springenberg, et al. A generalist agent. arXiv preprint arXiv:2205.06175, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[34]
Nur Muhammad Shafiullah, Zichen Cui, Ariuntuya Arty Altanzaya, and Lerrel Pinto. Behavior transform- ers: Cloning k modes with one stone.Advances in neural information processing systems, 35:22955–22968, 2022
work page 2022
-
[35]
Emu edit: Precise image editing via recognition and generation tasks
Shelly Sheynin, Adam Polyak, Uriel Singer, Yuval Kirstain, Amit Zohar, Oron Ashual, Devi Parikh, and Yaniv Taigman. Emu edit: Precise image editing via recognition and generation tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8871–8879, 2024
work page 2024
-
[36]
Cliport: What and where pathways for robotic manipula- tion
Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Cliport: What and where pathways for robotic manipula- tion. InConference on robot learning, pages 894–906. PMLR, 2022
work page 2022
-
[37]
Mitty: Diffusion-based human-to-robot video generation.arXiv preprint arXiv:2512.17253, 2025
Yiren Song, Cheng Liu, Weijia Mao, and Mike Zheng Shou. Mitty: Diffusion-based human-to-robot video generation.arXiv preprint arXiv:2512.17253, 2025
-
[38]
Yiren Song, Cheng Liu, and Mike Zheng Shou. Makeanything: Harnessing diffusion transformers for multi-domain procedural sequence generation.arXiv preprint arXiv:2502.01572, 2025
-
[39]
Yiren Song, Cheng Liu, and Mike Zheng Shou. Omniconsistency: Learning style-agnostic consistency from paired stylization data.arXiv preprint arXiv:2505.18445, 2025
-
[40]
OmniHumanoid: Streaming Cross-Embodiment Video Generation with Paired-Free Adaptation
Yiren Song, Xiyao Deng, Pei Yang, Yihan Wang, and Mike Zheng Shou. Omnihumanoid: Streaming cross-embodiment video generation with paired-free adaptation.arXiv preprint arXiv:2605.12038, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[41]
Octo: An Open-Source Generalist Robot Policy
Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[42]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
Zitong Wang, Hang Zhao, Qianyu Zhou, Xuequan Lu, Xiangtai Li, and Yiren Song. Diffdecom- pose: Layer-wise decomposition of alpha-composited images via diffusion transformers.arXiv preprint arXiv:2505.21541, 2025
-
[44]
Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[45]
Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation
Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, and Tao Kong. Unleashing large-scale video generative pre-training for visual robot manipulation. arXiv preprint arXiv:2312.13139, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[46]
Omnigen: Unified image generation
Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xingrun Xing, Ruiran Yan, Chaofan Li, Shuting Wang, Tiejun Huang, and Zheng Liu. Omnigen: Unified image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13294–13304, 2025
work page 2025
-
[47]
Zexuan Yan, Yue Ma, Chang Zou, Wenteng Chen, Qifeng Chen, and Linfeng Zhang. Eedit: Rethinking the spatial and temporal redundancy for efficient image editing.arXiv preprint arXiv:2503.10270, 2025
-
[48]
Pei Yang, Hai Ci, Yiren Song, and Mike Zheng Shou. X-humanoid: Robotize human videos to generate humanoid videos at scale.arXiv preprint arXiv:2512.04537, 2025
-
[49]
Fast-WAM: Do World Action Models Need Test-time Future Imagination?
Tianyuan Yuan, Zibin Dong, Yicheng Liu, and Hang Zhao. Fast-wam: Do world action models need test-time future imagination?arXiv preprint arXiv:2603.16666, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[50]
Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. Magicbrush: A manually annotated dataset for instruction-guided image editing.Advances in Neural Information Processing Systems, 36:31428–31449, 2023. 12
work page 2023
-
[51]
Action Images: End-to-End Policy Learning via Multiview Video Generation
Haoyu Zhen, Zixian Gao, Qiao Sun, Yilin Zhao, Yuncong Yang, Yilun Du, Tsun-Hsuan Wang, Yi-Ling Qiao, and Chuang Gan. Action images: End-to-end policy learning via multiview video generation.arXiv preprint arXiv:2604.06168, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[52]
Rt-2: Vision-language-action models transfer web knowledge to robotic control
Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023. 13
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.