Recognition: 2 theorem links
Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE
Pith reviewed 2026-05-08 18:24 UTC · model grok-4.3
The pith
A unified autoregressive-diffusion model uses a sparse Mixture-of-Experts Diffusion Transformer to match top video generation and editing quality while activating only a small fraction of its parameters.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Mamoda2.5 is a unified AR-Diffusion framework that integrates multimodal understanding and generation in one model. Equipping the Diffusion Transformer backbone with a fine-grained Mixture-of-Experts design of 128 experts and Top-8 routing produces a 25B-parameter model that activates only 3B parameters, cutting training costs while increasing capacity. This yields top-tier generation performance on VBench 2.0 and sets a new record in video editing quality that matches current leading proprietary systems. A joint few-step distillation and reinforcement learning framework then compresses the 30-step editing model into a 4-step version, delivering up to 95.9 times faster inference than open-s-
What carries the argument
The fine-grained Mixture-of-Experts design with 128 experts and Top-8 routing applied to the Diffusion Transformer backbone, which scales total capacity to 25B parameters while limiting active parameters to 3B during operation.
Load-bearing premise
The assumption that high scores on specific video benchmarks and internal advertising success rates demonstrate robust general performance across diverse multimodal tasks without detailed failure analysis or broader testing protocols.
What would settle it
An independent evaluation showing Mamoda2.5 falling short of reported video editing quality on additional public test sets or failing to maintain the claimed inference speedup on standard hardware.
read the original abstract
We present Mamoda2.5, a unified AR-Diffusion framework that seamlessly integrates multimodal understanding and generation within a single architecture. To efficiently enhance the model's generation capability, we equip the Diffusion Transformer backbone with a fine-grained Mixture-of-Experts (MoE) design (128 experts, Top-8 routing), yielding a 25B-parameter model that activates only 3B parameters, significantly reducing training costs while scaling up the model capacity. Mamoda2.5 achieves top-tier generation performance on VBench 2.0 and sets a new record in video editing quality, surpassing evaluated open-source models and matching the performance of current top-tier proprietary models, including the Kling O1 on OpenVE-Bench. Furthermore, we introduce a joint few-step distillation and reinforcement learning framework that compresses the 30-step editing model into a 4-step model and greatly accelerates model inference. Compared to open-source baselines, Mamoda2.5 achieves up to $95.9\times$ faster video editing inference. In real-world applications, Mamoda2.5 has been successfully deployed for content moderation and creative restoration tasks in advertising scenarios, achieving a 98% success rate in internal advertising video editing scenario.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents Mamoda2.5, a unified AR-Diffusion framework for multimodal understanding and generation. It equips a Diffusion Transformer backbone with a fine-grained MoE design (128 experts, Top-8 routing) to create a 25B-parameter model that activates only 3B parameters. The model is reported to achieve top-tier generation performance on VBench 2.0, set a new record in video editing quality on OpenVE-Bench (matching Kling O1 while surpassing open-source models), and deliver up to 95.9× faster video editing inference via a joint few-step distillation and reinforcement learning framework that reduces the model from 30 to 4 steps. It also claims a 98% success rate in internal advertising video editing scenarios.
Significance. If the empirical claims hold under rigorous evaluation, the work would demonstrate a practical route to scaling unified multimodal models via MoE in DiT architectures while controlling active parameters and inference cost. The combination of AR-Diffusion unification, expert routing, and distillation+RL for few-step editing could influence efficient deployment in generation and editing tasks. However, the absence of detailed protocols, baselines, ablations, and statistical reporting in the manuscript prevents assessment of whether the architectural choices drive the gains or if they stem from unstated data or training differences.
major comments (2)
- [Experimental Results] Experimental Results section: The central claims of top-tier VBench 2.0 performance, a new OpenVE-Bench record matching Kling O1, and 95.9× speedup lack any reported evaluation protocols, prompt sets, exact baseline model versions and configurations, hardware specifications for timing, error bars, or ablation studies isolating the DiT-MoE routing and the distillation+RL contributions. These omissions are load-bearing because the numerical superiority cannot be reproduced or attributed to the proposed techniques.
- [Model Description] Model Description section: The statement that the 128-expert Top-8 MoE model activates only 3B parameters out of 25B is presented without an equation, routing formula, or parameter-count breakdown showing how activation sparsity is achieved or how it reduces training costs relative to a dense counterpart.
minor comments (1)
- [Introduction] The abstract and introduction refer to an 'AR-Diffusion framework' but provide no diagram or equations clarifying how the autoregressive and diffusion components are integrated within the single architecture.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments highlight important areas for improving reproducibility and clarity, which we will address in the revised manuscript.
read point-by-point responses
-
Referee: [Experimental Results] Experimental Results section: The central claims of top-tier VBench 2.0 performance, a new OpenVE-Bench record matching Kling O1, and 95.9× speedup lack any reported evaluation protocols, prompt sets, exact baseline model versions and configurations, hardware specifications for timing, error bars, or ablation studies isolating the DiT-MoE routing and the distillation+RL contributions. These omissions are load-bearing because the numerical superiority cannot be reproduced or attributed to the proposed techniques.
Authors: We agree that the current manuscript lacks sufficient detail on evaluation protocols, which limits reproducibility. In the revised version, we will add a dedicated subsection on evaluation protocols that specifies the prompt sets used for VBench 2.0 and OpenVE-Bench, the exact versions and configurations of all baseline models (including Kling O1 where applicable), hardware specifications for all reported timing measurements, and statistical reporting with error bars from multiple runs where available. We will also include ablation studies that isolate the contributions of the DiT-MoE routing and the joint distillation+RL framework to the observed performance gains and speedup. These additions will strengthen attribution of results to the proposed methods. revision: yes
-
Referee: [Model Description] Model Description section: The statement that the 128-expert Top-8 MoE model activates only 3B parameters out of 25B is presented without an equation, routing formula, or parameter-count breakdown showing how activation sparsity is achieved or how it reduces training costs relative to a dense counterpart.
Authors: We acknowledge that the parameter activation and sparsity details are presented too informally. In the revision, we will insert the explicit routing formula for the Top-8 selection among 128 experts, along with a parameter-count breakdown (via equation or table) that derives the 3B active parameters from the total 25B and quantifies the resulting training cost savings relative to a dense equivalent. This will make the efficiency claims fully transparent and verifiable. revision: yes
Circularity Check
No circularity: purely empirical model and benchmark claims with no derivations
full rationale
The paper describes an architectural choice (DiT-MoE with 128 experts, Top-8 routing, joint distillation+RL for few-step inference) and reports empirical results on VBench 2.0, OpenVE-Bench, and internal advertising tasks. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citations appear in the provided text. Central claims rest on benchmark numbers rather than any chain that reduces to its own inputs by construction. This is the expected non-finding for an applied systems paper whose validity hinges on external reproducibility, not internal tautology.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard assumptions of diffusion model training and MoE expert routing stability
Reference graph
Works this paper leans on
-
[1]
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation, 2025. URLhttps://arxiv.org/abs/2408.12528
work page internal anchor Pith review arXiv 2025
-
[2]
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, Le Xue, Caiming Xiong, and Ran Xu. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset, 2025. URLhttps://arxiv.org/abs/2505.09568
work page Pith review arXiv 2025
-
[3]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, et al. HunyuanVideo: A Systematic Framework For Large Video Generative Models.arXiv e-prints, art. arXiv:2412.03603, December 2024. doi: 10.48550/arXiv.2412.03603
-
[4]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025
work page Pith review arXiv 2025
-
[5]
Scalable Diffusion Models with Transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers, 2023. URLhttps://arxiv.org/ abs/2212.09748
work page internal anchor Pith review arXiv 2023
-
[6]
Huafeng Shi, Jianzhong Liang, Rongchang Xie, Xian Wu, Cheng Chen, and Chang Liu. Aquarius: A family of industry-level video generation models for marketing scenarios.arXiv preprint arXiv:2505.10584, 2025
-
[7]
Sora: Creating video from text.https://openai.com/index/sora/, 2024
OpenAI. Sora: Creating video from text.https://openai.com/index/sora/, 2024. Accessed: 2025
2024
-
[8]
Xumeng Han, Longhui Wei, Zhiyang Dou, Zipeng Wang, Chenhui Qiang, Xin He, Yingfei Sun, Zhenjun Han, and Qi Tian. Vimoe: An empirical study of designing vision mixture-of-experts, 2024. URLhttps://arxiv.org/abs/ 2410.15732
-
[9]
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outra- geously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer.arXiv e-prints, art. arXiv:1701.06538, January 2017. doi: 10.48550/arXiv.1701.06538
work page internal anchor Pith review doi:10.48550/arxiv.1701.06538 2017
-
[10]
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
William Fedus, Barret Zoph, and Noam Shazeer. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity.arXiv e-prints, art. arXiv:2101.03961, January 2021. doi: 10.48550/arXiv.2101.03961
work page internal anchor Pith review doi:10.48550/arxiv.2101.03961 2021
-
[11]
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
Damai Dai, Chengqi Deng, Chenggang Zhao, R.X. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y. Wu, Zhenda Xie, Y.K. Li, Panpan Huang, Fuli Luo, Chong Ruan, Zhifang Sui, and Wenfeng Liang. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models, 2024. URLhttps://arxiv.org/ abs/2401.06066
work page internal anchor Pith review arXiv 2024
-
[12]
Scaling diffusion transformers to 16 billion parameters.arXiv preprint arXiv:2407.11633, 2024
Zhengcong Fei, Mingyuan Fan, Changqian Yu, Debang Li, and Junshi Huang. Scaling diffusion transformers to 16 billion parameters, 2024. URLhttps://arxiv.org/abs/2407.11633
-
[13]
Expert race: A flexible routing strategy for scaling diffusion transformer with mixture of experts, 2025
Yike Yuan, Ziyu Wang, Zihao Huang, Defa Zhu, Xun Zhou, Jingyi Yu, and Qiyang Min. Expert race: A flexible routing strategy for scaling diffusion transformer with mixture of experts, 2025. URLhttps://arxiv.org/abs/2503. 16057
2025
-
[14]
arXiv preprint arXiv:2503.14487 (2025)
Minglei Shi, Ziyang Yuan, Haotian Yang, Xintao Wang, Mingwu Zheng, Xin Tao, Wenliang Zhao, Wenzhao Zheng, Jie Zhou, Jiwen Lu, Pengfei Wan, Di Zhang, and Kun Gai. Diffmoe: Dynamic token selection for scalable diffusion transformers, 2025. URLhttps://arxiv.org/abs/2503.14487
-
[15]
Tim Brooks, Aleksander Holynski, and Alexei A. Efros. Instructpix2pix: Learning to follow image editing instructions, 2023
2023
-
[16]
Step1X-Edit: A Practical Framework for General Image Editing
Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, Guopeng Li, Yuang Peng, Quan Sun, Jingwei Wu, Yan Cai, Zheng Ge, Ranchen Ming, Lei Xia, Xianfang Zeng, Yibo Zhu, Binxing Jiao, Xiangyu Zhang, Gang Yu, and Daxin Jiang. Step1x-edit: A practical framework for general image editing.arXiv pre...
work page internal anchor Pith review arXiv 2025
-
[17]
VINO: A uni- fied visual generator with interleaved omnimodal context
Junyi Chen, Tong He, Zhoujie Fu, Pengfei Wan, Kun Gai, and Weicai Ye. Vino: A unified visual generator with interleaved omnimodal context.arXiv preprint arXiv:2601.02358, 2026
-
[18]
arXiv preprint arXiv:2507.06119 , year=
Zhiyu Tan, Hao Yang, Luozheng Qin, Jia Gong, Mengping Yang, and Hao Li. Omni-video: Democratizing unified video understanding and generation.arXiv preprint arXiv:2507.06119, 2025. 29 Technical Report
-
[19]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rom- bach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023
work page internal anchor Pith review arXiv 2023
-
[20]
Ming Tao et al. Mogao: An omni foundation model for interleaved multi-modal generation.arXiv preprint arXiv:2505.05472, 2025
-
[21]
HunyuanVideo 1.5 Technical Report
Bing Wu, Chang Zou, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Jiangfeng Xiong, Jie Jiang, Penghao Zhao, Qi Tian, Songtao Liu, Weijie Kong, Weiyan Wang, Xuefei Zhe, Yanxin Long, Yuanbo Peng, Zuozhuo Dai, et al. Hunyuanvideo 1.5 technical report, 2025. URLhttps://arxiv.org/abs/2511.18870
work page internal anchor Pith review arXiv 2025
-
[22]
Longcat-video technical report
Xunliang Cai, Qilong Huang, Zhuoliang Kang, Hongyu Li, Shijun Liang, Liya Ma, Siyu Ren, Xiaoming Wei, Rixu Xie, and Tong Zhang. Longcat-video technical report, 2025. URLhttps://arxiv.org/abs/2510.22200
-
[23]
Kling: A high-quality video generation model.https://klingai.com, 2024
Kuaishou Technology. Kling: A high-quality video generation model.https://klingai.com, 2024. Accessed: 2025
2024
-
[24]
Mammothmoda2: A unified ar-diffusion framework for multimodal understanding and generation
Tao Shen, Xin Wan, Taicai Chen, Rui Zhang, Junwen Pan, Dawei Lu, Fanding Lei, Zhilin Lu, Yunfei Yang, Chen Cheng, et al. Mammothmoda2: A unified ar-diffusion framework for multimodal understanding and generation. arXiv preprint arXiv:2511.18262, 2025
-
[25]
Yahui Liu, Yang Yue, Jingyuan Zhang, Chenxi Sun, Yang Zhou, Wencong Zeng, Ruiming Tang, and Guorui Zhou. Efficient training of diffusion mixture-of-experts models: A practical recipe.arXiv preprint arXiv:2512.01252, 2025
-
[26]
Qi Cai, Jingwen Chen, Yang Chen, Yehao Li, Fuchen Long, Yingwei Pan, Zhaofan Qiu, Yiheng Zhang, Fengbin Gao, Peihan Xu, Yimeng Wang, Kai Yu, Wenxuan Chen, Ziwei Feng, Zijian Gong, Jianzhuang Pan, Yi Peng, Rui Tian, Siyu Wang, Bo Zhao, Ting Yao, and Tao Mei. HiDream-I1: A High-Efficient Image Generative Foundation Model with Sparse Diffusion Transformer.ar...
-
[27]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024
work page Pith review arXiv 2024
-
[29]
Sigmoid gating is more sample efficient than softmax gating in mixture of experts
Huy Nguyen, Pedram Akbarian, Nhat Ho, and Alessandro Rinaldo. Sigmoid gating is more sample efficient than softmax gating in mixture of experts. InAdvances in Neural Information Processing Systems (NeurIPS), 2024
2024
-
[30]
Auxiliary-loss-free load balancing strategy for mixture-of-experts.arXiv preprint arXiv:2408.15664,
Lean Wang, Huazuo Gao, Chenggang Zhao, Xu Sun, and Damai Dai. Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts.arXiv e-prints, art. arXiv:2408.15664, August 2024. doi: 10.48550/arXiv.2408.15664
-
[31]
Sparse upcycling: Training mixture-of-experts from dense checkpoints
Aran Komatsuzaki, Joan Puigcerver, James Lee-Thorp, Carlos Riquelme Ruiz, Basil Mustafa, Joshua Ainslie, Yi Tay, Mostafa Dehghani, and Neil Houlsby. Sparse upcycling: Training mixture-of-experts from dense checkpoints. In International Conference on Learning Representations (ICLR), 2023
2023
-
[32]
OmniGen: Unified image generation
Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xingrun Xing, Ruiran Yan, Shuting Wang, Tiejun Huang, and Zheng Liu. OmniGen: Unified image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025
2025
-
[33]
ByT5: Towards a token-free future with pre-trained byte-to-byte models
Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, and Colin Raffel. ByT5: Towards a token-free future with pre-trained byte-to-byte models. InTransactions of the Association for Computational Linguistics (TACL), volume 10, pages 291–306, 2022
2022
-
[34]
Reconstruct Picture i with the highest possible fidelity
Yi Zhang et al. Bee: A high-quality corpus and full-stack suite to unlock advanced fully open mllms.arXiv preprint arXiv:2510.13795, 2025
-
[35]
Qingyan Bai, Qiuyu Wang, Hao Ouyang, Yue Yu, Hanlin Wang, Wen Wang, Ka Leong Cheng, Shuailei Ma, Yanhong Zeng, Zichen Liu, et al. Scaling instruction-based video editing with a high-quality synthetic dataset.arXiv preprint arXiv:2510.15742, 2025
-
[36]
Zhongwei Zhang, Fuchen Long, Wei Li, Zhaofan Qiu, Wu Liu, Ting Yao, and Tao Mei. Region-Constraint In-Context Generation for Instructional Video Editing.arXiv preprint arXiv:2512.17650, 2025. 30 Technical Report
-
[37]
Haoyang He, Jie Wang, Jiangning Zhang, Zhucun Xue, Xingyuan Bu, Qiangpeng Yang, Shilei Wen, and Lei Xie. Openve-3m: A large-scale high-quality dataset for instruction-guided video editing.arXiv preprint arXiv:2512.07826, 2025
-
[38]
Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in-one video creation and editing.arXiv preprint arXiv:2503.07598, 2025
-
[39]
One-step diffusion with distribution matching distillation.arXiv preprint arXiv:2311.18828, 2023
Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Frédo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation.arXiv preprint arXiv:2311.18828, 2024
-
[40]
DengyangJiangetal. Distributionmatchingdistillationmeetsreinforcementlearning.arXivpreprintarXiv:2511.13649, 2025
-
[41]
DiffusionNFT: Online Diffusion Reinforcement with Forward Process
Kaiwen Zheng, Huayu Chen, Haotian Ye, Haoxiang Wang, Qinsheng Zhang, Kai Jiang, Hang Su, Stefano Ermon, Jun Zhu, and Ming-Yu Liu. Diffusionnft: Online diffusion reinforcement with forward process.arXiv preprint arXiv:2509.16117, 2025
work page internal anchor Pith review arXiv 2025
-
[42]
W. Feng, W. Constable, and Y. Mao. Getting started with fully sharded data parallel (FSDP2).PyTorch Official Tutorials, March 2022
2022
-
[43]
Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, and Yuxiong He. DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale.arXiv e-prints, art. arXiv:2201.05596, January 2022. doi: 10.48550/arXiv.2201.05596
-
[44]
40 Kevin Frans, Danijar Hafner, Sergey Levine, and Pieter Abbeel
Jiarui Fang and Shangchun Zhao. USP: A Unified Sequence Parallelism Approach for Long Context Generative AI. arXiv e-prints, art. arXiv:2405.07719, May 2024. doi: 10.48550/arXiv.2405.07719
-
[45]
40 Kevin Frans, Danijar Hafner, Sergey Levine, and Pieter Abbeel
Jiarui Fang and Shangchun Zhao. Usp: A unified sequence parallelism approach for long context generative ai, 2024. URLhttps://arxiv.org/abs/2405.07719
-
[46]
VBench: Comprehensive benchmark suite for video generative models
Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
2024
-
[47]
VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness
Dian Zheng, Ziqi Huang, Hongbo Liu, Kai Zou, Yinan He, Fan Zhang, Lulu Gu, Yuanhan Zhang, Jingwen He, Wei-Shi Zheng, Yu Qiao, and Ziwei Liu. VBench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness. arXiv preprint arXiv:2503.21755, 2025
work page internal anchor Pith review arXiv 2025
-
[48]
Fan Bao, Chendong Xiang, Gang Yue, Guande He, Hongzhou Zhu, Kaiwen Zheng, Min Zhao, Shilong Liu, Yaole Wang, and Jun Zhu. Vidu: a highly consistent, dynamic and skilled text-to-video generator with diffusion models.arXiv preprint arXiv:2405.04233, 2024
-
[49]
Seedance 1.0: Exploring the Boundaries of Video Generation Models
Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, et al. Seedance 1.0: Exploring the Boundaries of Video Generation Models.arXiv e-prints, art. arXiv:2506.09113, June 2025. doi: 10.48550/arXiv.2506.09113
work page internal anchor Pith review doi:10.48550/arxiv.2506.09113 2025
-
[50]
Veo 3 technical report.https://deepmind.google/technologies/veo, 2025
Google DeepMind. Veo 3 technical report.https://deepmind.google/technologies/veo, 2025. Accessed: 2025
2025
-
[51]
Five-bench: A fine-grained video editing benchmark for evaluating emerging diffusion and rectified flow models
Minghan Li, Chenxi Xie, Yichen Wu, Lei Zhang, and Mengyu Wang. Five-bench: A fine-grained video editing benchmark for evaluating emerging diffusion and rectified flow models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16672–16681, 2025
2025
-
[52]
Insvie-1m: Effective instruction-based video editing with elaborate dataset construction
Yuhui Wu, Liyi Chen, Ruibin Li, Shihao Wang, Chenxi Xie, and Lei Zhang. Insvie-1m: Effective instruction-based video editing with elaborate dataset construction. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16692–16701, 2025
2025
-
[53]
Lucy edit: High-fidelity text-guided video editing
Decart AI. Lucy edit: High-fidelity text-guided video editing. https://huggingface.co/decart-ai/ Lucy-Edit-Dev, 2025
2025
-
[54]
Xinyao Liao, Xianfang Zeng, Ziye Song, Zhoujie Fu, Gang Yu, and Guosheng Lin. In-context learning with unpaired clips for instruction-based video editing.arXiv preprint arXiv:2510.14648, 2025
-
[55]
Pixverse-r1: Next-generation real-time world model.https://pixverse.ai, 2026
PixVerse Team. Pixverse-r1: Next-generation real-time world model.https://pixverse.ai, 2026. 31 Technical Report
2026
-
[56]
Yiqi Lin, Guoqiang Liang, Ziyun Zeng, Zechen Bai, Yanzhe Chen, and Mike Zheng Shou. Kiwi-edit: Versatile video editing via instruction and reference guidance, 2026. URLhttps://arxiv.org/abs/2603.02175
work page internal anchor Pith review arXiv 2026
-
[57]
Cong Wei, Quande Liu, Zixuan Ye, Qiulin Wang, Xintao Wang, Pengfei Wan, Kun Gai, and Wenhu Chen. Univideo: Unified understanding, generation, and editing for videos.arXiv preprint arXiv:2510.08377, 2025
-
[58]
Kaihang Pan, Qi Tian, Jianwei Zhang, Weijie Kong, Jiangfeng Xiong, Yanxin Long, Shixue Zhang, Haiyi Qiu, Tan Wang, Zheqi Lv, et al. Omniweaving: Towards unified video generation with free-form composition and reasoning. arXiv preprint arXiv:2603.24458, 2026
-
[59]
TokenFlow: Consistent diffusion features for consistent video editing
Michal Geyer, Omer Bar-Tal, Shai Bagon, and Tali Dekel. TokenFlow: Consistent diffusion features for consistent video editing. InInternational Conference on Learning Representations (ICLR), 2024
2024
-
[60]
Space-time diffusion features for zero-shot text-driven motion transfer
Danah Yatim, Rafail Fridman, Omer Bar-Tal, Yoni Kasten, and Tali Dekel. Space-time diffusion features for zero-shot text-driven motion transfer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
2024
-
[61]
VidToMe: Video token merging for zero-shot video editing
Xirui Li, Chao Ma, Xiaokang Yang, and Ming-Hsuan Yang. VidToMe: Video token merging for zero-shot video editing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
2024
-
[62]
AnyV2V: A plug-and-play framework for any video-to-video editing tasks.Transactions on Machine Learning Research (TMLR), 2024
Max Ku, Cong Wei, Weiming Ren, Huan Yang, and Wenhu Chen. AnyV2V: A plug-and-play framework for any video-to-video editing tasks.Transactions on Machine Learning Research (TMLR), 2024
2024
-
[63]
VideoGrain: Modulating space-time attention for multi-grained video editing
Yuxuan Xu et al. VideoGrain: Modulating space-time attention for multi-grained video editing. InInternational Conference on Learning Representations (ICLR), 2025
2025
-
[64]
Junke Wang et al. Omni-Video 2: Scaling MLLM-conditioned diffusion for unified video generation and editing.arXiv preprint arXiv:2602.08820, 2025
-
[65]
Context Unrolling in Omni Models
Ceyuan Yang, Zhijie Lin, Yang Zhao, Fei Xiao, Hao He, Qi Zhao, Chaorui Deng, Kunchang Li, Zihan Ding, Yuwei Guo, Fuyun Wang, Fangqi Zhu, Xiaonan Nie, Shenhan Zhu, Shanchuan Lin, Hongsheng Li, Weilin Huang, Guang Shi, and Haoqi Fan. Context unrolling in omni models.arXiv preprint arXiv:2604.21921, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[66]
Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36, 2024
Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36, 2024
2024
-
[67]
SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers
Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, et al. Sana: Efficient high-resolution image synthesis with linear diffusion transformers.arXiv preprint arXiv:2410.10629, 2024
work page internal anchor Pith review arXiv 2024
-
[68]
FLUX.1: A text-to-image generation model.https://blackforestlabs.ai, 2024
Black Forest Labs. FLUX.1: A text-to-image generation model.https://blackforestlabs.ai, 2024. Accessed: 2025
2024
-
[69]
Improving image generation with better captions.Computer Science
James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions.Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023
2023
-
[70]
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Robin Rombach, et al. Scaling rectified flow transformers for high-resolution image synthesis.arXiv preprint arXiv:2403.03206, 2024
work page internal anchor Pith review arXiv 2024
-
[71]
Feng Chen et al. HiDream-I1: A high-efficient image generative foundation model with sparse diffusion transformer. arXiv preprint arXiv:2505.22705, 2025
-
[72]
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025
work page internal anchor Pith review arXiv 2025
-
[73]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024
work page Pith review arXiv 2024
-
[74]
Yuying Guo et al. Seedream 3.0 technical report.arXiv preprint arXiv:2504.11346, 2025
work page internal anchor Pith review arXiv 2025
-
[75]
ImgEdit: A Unified Image Editing Dataset and Benchmark
Yang Ye, Xianyi He, Zongjian Li, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Bohan Hou, and Li Yuan. Imgedit: A unified image editing dataset and benchmark.arXiv preprint arXiv:2505.20275, 2025. 32 Technical Report
work page internal anchor Pith review arXiv 2025
-
[76]
Gemini: A Family of Highly Capable Multimodal Models
Gemini, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023
work page Pith review arXiv 2023
-
[77]
OpenAI. Gpt-4o system card, 2024. URLhttps://arxiv.org/abs/2410.21276
work page Pith review arXiv 2024
-
[78]
Seedream 4.0: Toward Next-generation Multimodal Image Generation
Yu Gao et al. Seedream 4.0: Toward next-generation multimodal image generation.arXiv preprint arXiv:2509.20427, 2025
work page internal anchor Pith review arXiv 2025
-
[79]
UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation
Bin Lin, Yuwei Niu, et al. Uniworld-v1: High-resolution semantic encoders for unified visual understanding and generation.arXiv preprint arXiv:2506.03147, 2025
work page internal anchor Pith review arXiv 2025
-
[80]
Omnigen2: Exploration to advanced multimodal generation, 2025
Junjie Zhou, Shitao Xiao, Yueze Wang, Tiejun Huang, and Zheng Liu. Omnigen2: Exploration to advanced multimodal generation, 2025
2025
-
[81]
Emerging Properties in Unified Multimodal Pretraining
Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025
work page internal anchor Pith review arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.