arxiv: 2605.02641 · v1 · submitted 2026-05-04 · 💻 cs.CV

Recognition: 2 theorem links

Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE

Yangming Shi , Shixiang Zhu , Tao Shen , Zhimiao Yu , Dengsheng Chen , Taicai Chen , Yunfei Yang , Juan Zhou

show 9 more authors

Chen Cheng Liang Ma Xibin Wu Benxuan Yan Ge Li Tuoyu Zhang Dan Li Chang Liu Zhenbang Sun

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:24 UTC · model grok-4.3

classification 💻 cs.CV

keywords unified multimodal modelvideo generationvideo editingmixture of expertsdiffusion transformerfew-step distillationautoregressive diffusionsparse activation

0 comments

The pith

A unified autoregressive-diffusion model uses a sparse Mixture-of-Experts Diffusion Transformer to match top video generation and editing quality while activating only a small fraction of its parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Mamoda2.5 as a single architecture that performs both multimodal understanding and generation by combining autoregressive and diffusion processes. It upgrades the Diffusion Transformer backbone with a fine-grained Mixture-of-Experts structure of 128 experts routed to the top 8, yielding a 25 billion parameter model that activates only 3 billion parameters at runtime. This configuration produces strong video generation results on standard benchmarks and a new high mark in video editing quality. The authors add a joint distillation and reinforcement learning stage that shrinks the editing process from 30 steps to 4 steps. The outcome is up to 95.9 times faster inference than open baselines, plus reliable deployment in advertising content tasks.

Core claim

Mamoda2.5 is a unified AR-Diffusion framework that integrates multimodal understanding and generation in one model. Equipping the Diffusion Transformer backbone with a fine-grained Mixture-of-Experts design of 128 experts and Top-8 routing produces a 25B-parameter model that activates only 3B parameters, cutting training costs while increasing capacity. This yields top-tier generation performance on VBench 2.0 and sets a new record in video editing quality that matches current leading proprietary systems. A joint few-step distillation and reinforcement learning framework then compresses the 30-step editing model into a 4-step version, delivering up to 95.9 times faster inference than open-s-

What carries the argument

The fine-grained Mixture-of-Experts design with 128 experts and Top-8 routing applied to the Diffusion Transformer backbone, which scales total capacity to 25B parameters while limiting active parameters to 3B during operation.

Load-bearing premise

The assumption that high scores on specific video benchmarks and internal advertising success rates demonstrate robust general performance across diverse multimodal tasks without detailed failure analysis or broader testing protocols.

What would settle it

An independent evaluation showing Mamoda2.5 falling short of reported video editing quality on additional public test sets or failing to maintain the claimed inference speedup on standard hardware.

read the original abstract

We present Mamoda2.5, a unified AR-Diffusion framework that seamlessly integrates multimodal understanding and generation within a single architecture. To efficiently enhance the model's generation capability, we equip the Diffusion Transformer backbone with a fine-grained Mixture-of-Experts (MoE) design (128 experts, Top-8 routing), yielding a 25B-parameter model that activates only 3B parameters, significantly reducing training costs while scaling up the model capacity. Mamoda2.5 achieves top-tier generation performance on VBench 2.0 and sets a new record in video editing quality, surpassing evaluated open-source models and matching the performance of current top-tier proprietary models, including the Kling O1 on OpenVE-Bench. Furthermore, we introduce a joint few-step distillation and reinforcement learning framework that compresses the 30-step editing model into a 4-step model and greatly accelerates model inference. Compared to open-source baselines, Mamoda2.5 achieves up to $95.9\times$ faster video editing inference. In real-world applications, Mamoda2.5 has been successfully deployed for content moderation and creative restoration tasks in advertising scenarios, achieving a 98% success rate in internal advertising video editing scenario.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Mamoda2.5 is a straightforward engineering extension that scales a unified multimodal diffusion model with a large DiT-MoE and then compresses it via distillation plus RL, but the headline performance numbers rest on thin experimental reporting.

read the letter

The main point is that this paper takes a 25B-parameter DiT backbone, adds a 128-expert Top-8 MoE so only 3B parameters activate at once, and then applies joint few-step distillation with reinforcement learning to drop video editing from 30 steps to 4. They claim this matches top proprietary models on OpenVE-Bench, hits strong VBench 2.0 numbers, delivers up to 95.9x faster inference than open-source baselines, and runs at 98% success in internal advertising video tasks. Those are the concrete moves: the specific MoE scale and the combined compression recipe are what they actually ship as new.

Referee Report

2 major / 1 minor

Summary. The paper presents Mamoda2.5, a unified AR-Diffusion framework for multimodal understanding and generation. It equips a Diffusion Transformer backbone with a fine-grained MoE design (128 experts, Top-8 routing) to create a 25B-parameter model that activates only 3B parameters. The model is reported to achieve top-tier generation performance on VBench 2.0, set a new record in video editing quality on OpenVE-Bench (matching Kling O1 while surpassing open-source models), and deliver up to 95.9× faster video editing inference via a joint few-step distillation and reinforcement learning framework that reduces the model from 30 to 4 steps. It also claims a 98% success rate in internal advertising video editing scenarios.

Significance. If the empirical claims hold under rigorous evaluation, the work would demonstrate a practical route to scaling unified multimodal models via MoE in DiT architectures while controlling active parameters and inference cost. The combination of AR-Diffusion unification, expert routing, and distillation+RL for few-step editing could influence efficient deployment in generation and editing tasks. However, the absence of detailed protocols, baselines, ablations, and statistical reporting in the manuscript prevents assessment of whether the architectural choices drive the gains or if they stem from unstated data or training differences.

major comments (2)

[Experimental Results] Experimental Results section: The central claims of top-tier VBench 2.0 performance, a new OpenVE-Bench record matching Kling O1, and 95.9× speedup lack any reported evaluation protocols, prompt sets, exact baseline model versions and configurations, hardware specifications for timing, error bars, or ablation studies isolating the DiT-MoE routing and the distillation+RL contributions. These omissions are load-bearing because the numerical superiority cannot be reproduced or attributed to the proposed techniques.
[Model Description] Model Description section: The statement that the 128-expert Top-8 MoE model activates only 3B parameters out of 25B is presented without an equation, routing formula, or parameter-count breakdown showing how activation sparsity is achieved or how it reduces training costs relative to a dense counterpart.

minor comments (1)

[Introduction] The abstract and introduction refer to an 'AR-Diffusion framework' but provide no diagram or equations clarifying how the autoregressive and diffusion components are integrated within the single architecture.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important areas for improving reproducibility and clarity, which we will address in the revised manuscript.

read point-by-point responses

Referee: [Experimental Results] Experimental Results section: The central claims of top-tier VBench 2.0 performance, a new OpenVE-Bench record matching Kling O1, and 95.9× speedup lack any reported evaluation protocols, prompt sets, exact baseline model versions and configurations, hardware specifications for timing, error bars, or ablation studies isolating the DiT-MoE routing and the distillation+RL contributions. These omissions are load-bearing because the numerical superiority cannot be reproduced or attributed to the proposed techniques.

Authors: We agree that the current manuscript lacks sufficient detail on evaluation protocols, which limits reproducibility. In the revised version, we will add a dedicated subsection on evaluation protocols that specifies the prompt sets used for VBench 2.0 and OpenVE-Bench, the exact versions and configurations of all baseline models (including Kling O1 where applicable), hardware specifications for all reported timing measurements, and statistical reporting with error bars from multiple runs where available. We will also include ablation studies that isolate the contributions of the DiT-MoE routing and the joint distillation+RL framework to the observed performance gains and speedup. These additions will strengthen attribution of results to the proposed methods. revision: yes
Referee: [Model Description] Model Description section: The statement that the 128-expert Top-8 MoE model activates only 3B parameters out of 25B is presented without an equation, routing formula, or parameter-count breakdown showing how activation sparsity is achieved or how it reduces training costs relative to a dense counterpart.

Authors: We acknowledge that the parameter activation and sparsity details are presented too informally. In the revision, we will insert the explicit routing formula for the Top-8 selection among 128 experts, along with a parameter-count breakdown (via equation or table) that derives the 3B active parameters from the total 25B and quantifies the resulting training cost savings relative to a dense equivalent. This will make the efficiency claims fully transparent and verifiable. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical model and benchmark claims with no derivations

full rationale

The paper describes an architectural choice (DiT-MoE with 128 experts, Top-8 routing, joint distillation+RL for few-step inference) and reports empirical results on VBench 2.0, OpenVE-Bench, and internal advertising tasks. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citations appear in the provided text. Central claims rest on benchmark numbers rather than any chain that reduces to its own inputs by construction. This is the expected non-finding for an applied systems paper whose validity hinges on external reproducibility, not internal tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard machine-learning training assumptions and benchmark evaluations; no new theoretical entities or fitted parameters beyond the model size and routing choice are introduced in the abstract.

axioms (1)

domain assumption Standard assumptions of diffusion model training and MoE expert routing stability
Implicit in all large-scale generative model papers but not stated or justified here.

pith-pipeline@v0.9.0 · 5570 in / 1248 out tokens · 56992 ms · 2026-05-08T18:24:03.156004+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

85 extracted references · 59 canonical work pages · 23 internal anchors

[1]

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation, 2025. URLhttps://arxiv.org/abs/2408.12528

work page internal anchor Pith review arXiv 2025
[2]

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, Le Xue, Caiming Xiong, and Ran Xu. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset, 2025. URLhttps://arxiv.org/abs/2505.09568

work page Pith review arXiv 2025
[3]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, et al. HunyuanVideo: A Systematic Framework For Large Video Generative Models.arXiv e-prints, art. arXiv:2412.03603, December 2024. doi: 10.48550/arXiv.2412.03603

work page Pith review doi:10.48550/arxiv.2412.03603 2024
[4]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page Pith review arXiv 2025
[5]

Scalable Diffusion Models with Transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers, 2023. URLhttps://arxiv.org/ abs/2212.09748

work page internal anchor Pith review arXiv 2023
[6]

Aquarius: A family of industry-level video generation models for marketing scenarios.arXiv preprint arXiv:2505.10584, 2025

Huafeng Shi, Jianzhong Liang, Rongchang Xie, Xian Wu, Cheng Chen, and Chang Liu. Aquarius: A family of industry-level video generation models for marketing scenarios.arXiv preprint arXiv:2505.10584, 2025

work page arXiv 2025
[7]

Sora: Creating video from text.https://openai.com/index/sora/, 2024

OpenAI. Sora: Creating video from text.https://openai.com/index/sora/, 2024. Accessed: 2025

2024
[8]

Vimoe: An empirical study of designing vision mixture-of-experts.arXiv preprint arXiv:2410.15732, 2024

Xumeng Han, Longhui Wei, Zhiyang Dou, Zipeng Wang, Chenhui Qiang, Xin He, Yingfei Sun, Zhenjun Han, and Qi Tian. Vimoe: An empirical study of designing vision mixture-of-experts, 2024. URLhttps://arxiv.org/abs/ 2410.15732

work page arXiv 2024
[9]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outra- geously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer.arXiv e-prints, art. arXiv:1701.06538, January 2017. doi: 10.48550/arXiv.1701.06538

work page internal anchor Pith review doi:10.48550/arxiv.1701.06538 2017
[10]

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

William Fedus, Barret Zoph, and Noam Shazeer. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity.arXiv e-prints, art. arXiv:2101.03961, January 2021. doi: 10.48550/arXiv.2101.03961

work page internal anchor Pith review doi:10.48550/arxiv.2101.03961 2021
[11]

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

Damai Dai, Chengqi Deng, Chenggang Zhao, R.X. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y. Wu, Zhenda Xie, Y.K. Li, Panpan Huang, Fuli Luo, Chong Ruan, Zhifang Sui, and Wenfeng Liang. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models, 2024. URLhttps://arxiv.org/ abs/2401.06066

work page internal anchor Pith review arXiv 2024
[12]

Scaling diffusion transformers to 16 billion parameters.arXiv preprint arXiv:2407.11633, 2024

Zhengcong Fei, Mingyuan Fan, Changqian Yu, Debang Li, and Junshi Huang. Scaling diffusion transformers to 16 billion parameters, 2024. URLhttps://arxiv.org/abs/2407.11633

work page arXiv 2024
[13]

Expert race: A flexible routing strategy for scaling diffusion transformer with mixture of experts, 2025

Yike Yuan, Ziyu Wang, Zihao Huang, Defa Zhu, Xun Zhou, Jingyi Yu, and Qiyang Min. Expert race: A flexible routing strategy for scaling diffusion transformer with mixture of experts, 2025. URLhttps://arxiv.org/abs/2503. 16057

2025
[14]

arXiv preprint arXiv:2503.14487 (2025)

Minglei Shi, Ziyang Yuan, Haotian Yang, Xintao Wang, Mingwu Zheng, Xin Tao, Wenliang Zhao, Wenzhao Zheng, Jie Zhou, Jiwen Lu, Pengfei Wan, Di Zhang, and Kun Gai. Diffmoe: Dynamic token selection for scalable diffusion transformers, 2025. URLhttps://arxiv.org/abs/2503.14487

work page arXiv 2025
[15]

Tim Brooks, Aleksander Holynski, and Alexei A. Efros. Instructpix2pix: Learning to follow image editing instructions, 2023

2023
[16]

Step1X-Edit: A Practical Framework for General Image Editing

Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, Guopeng Li, Yuang Peng, Quan Sun, Jingwei Wu, Yan Cai, Zheng Ge, Ranchen Ming, Lei Xia, Xianfang Zeng, Yibo Zhu, Binxing Jiao, Xiangyu Zhang, Gang Yu, and Daxin Jiang. Step1x-edit: A practical framework for general image editing.arXiv pre...

work page internal anchor Pith review arXiv 2025
[17]

VINO: A uni- fied visual generator with interleaved omnimodal context

Junyi Chen, Tong He, Zhoujie Fu, Pengfei Wan, Kun Gai, and Weicai Ye. Vino: A unified visual generator with interleaved omnimodal context.arXiv preprint arXiv:2601.02358, 2026

work page arXiv 2026
[18]

arXiv preprint arXiv:2507.06119 , year=

Zhiyu Tan, Hao Yang, Luozheng Qin, Jia Gong, Mengping Yang, and Hao Li. Omni-video: Democratizing unified video understanding and generation.arXiv preprint arXiv:2507.06119, 2025. 29 Technical Report

work page arXiv 2025
[19]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rom- bach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

work page internal anchor Pith review arXiv 2023
[20]

Mogao: An omni foundation model for interleaved multi-modal generation.arXiv preprint arXiv:2505.05472, 2025

Ming Tao et al. Mogao: An omni foundation model for interleaved multi-modal generation.arXiv preprint arXiv:2505.05472, 2025

work page arXiv 2025
[21]

HunyuanVideo 1.5 Technical Report

Bing Wu, Chang Zou, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Jiangfeng Xiong, Jie Jiang, Penghao Zhao, Qi Tian, Songtao Liu, Weijie Kong, Weiyan Wang, Xuefei Zhe, Yanxin Long, Yuanbo Peng, Zuozhuo Dai, et al. Hunyuanvideo 1.5 technical report, 2025. URLhttps://arxiv.org/abs/2511.18870

work page internal anchor Pith review arXiv 2025
[22]

Longcat-video technical report

Xunliang Cai, Qilong Huang, Zhuoliang Kang, Hongyu Li, Shijun Liang, Liya Ma, Siyu Ren, Xiaoming Wei, Rixu Xie, and Tong Zhang. Longcat-video technical report, 2025. URLhttps://arxiv.org/abs/2510.22200

work page arXiv 2025
[23]

Kling: A high-quality video generation model.https://klingai.com, 2024

Kuaishou Technology. Kling: A high-quality video generation model.https://klingai.com, 2024. Accessed: 2025

2024
[24]

Mammothmoda2: A unified ar-diffusion framework for multimodal understanding and generation

Tao Shen, Xin Wan, Taicai Chen, Rui Zhang, Junwen Pan, Dawei Lu, Fanding Lei, Zhilin Lu, Yunfei Yang, Chen Cheng, et al. Mammothmoda2: A unified ar-diffusion framework for multimodal understanding and generation. arXiv preprint arXiv:2511.18262, 2025

work page arXiv 2025
[25]

Efficient training of diffusion mixture-of-experts models: A practical recipe.arXiv preprint arXiv:2512.01252, 2025

Yahui Liu, Yang Yue, Jingyuan Zhang, Chenxi Sun, Yang Zhou, Wencong Zeng, Ruiming Tang, and Guorui Zhou. Efficient training of diffusion mixture-of-experts models: A practical recipe.arXiv preprint arXiv:2512.01252, 2025

work page arXiv 2025
[26]

Hidream-i1: A high-efficient image generative foundation model with sparse diffusion transformer.arXiv preprint arXiv:2505.22705, 2025

Qi Cai, Jingwen Chen, Yang Chen, Yehao Li, Fuchen Long, Yingwei Pan, Zhaofan Qiu, Yiheng Zhang, Fengbin Gao, Peihan Xu, Yimeng Wang, Kai Yu, Wenxuan Chen, Ziwei Feng, Zijian Gong, Jianzhuang Pan, Yi Peng, Rui Tian, Siyu Wang, Bo Zhao, Ting Yao, and Tao Mei. HiDream-I1: A High-Efficient Image Generative Foundation Model with Sparse Diffusion Transformer.ar...

work page doi:10.48550/arxiv.2505.22705 2025
[27]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page Pith review arXiv 2024
[29]

Sigmoid gating is more sample efficient than softmax gating in mixture of experts

Huy Nguyen, Pedram Akbarian, Nhat Ho, and Alessandro Rinaldo. Sigmoid gating is more sample efficient than softmax gating in mixture of experts. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024
[30]

Auxiliary-loss-free load balancing strategy for mixture-of-experts.arXiv preprint arXiv:2408.15664,

Lean Wang, Huazuo Gao, Chenggang Zhao, Xu Sun, and Damai Dai. Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts.arXiv e-prints, art. arXiv:2408.15664, August 2024. doi: 10.48550/arXiv.2408.15664

work page doi:10.48550/arxiv.2408.15664 2024
[31]

Sparse upcycling: Training mixture-of-experts from dense checkpoints

Aran Komatsuzaki, Joan Puigcerver, James Lee-Thorp, Carlos Riquelme Ruiz, Basil Mustafa, Joshua Ainslie, Yi Tay, Mostafa Dehghani, and Neil Houlsby. Sparse upcycling: Training mixture-of-experts from dense checkpoints. In International Conference on Learning Representations (ICLR), 2023

2023
[32]

OmniGen: Unified image generation

Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xingrun Xing, Ruiran Yan, Shuting Wang, Tiejun Huang, and Zheng Liu. OmniGen: Unified image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025
[33]

ByT5: Towards a token-free future with pre-trained byte-to-byte models

Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, and Colin Raffel. ByT5: Towards a token-free future with pre-trained byte-to-byte models. InTransactions of the Association for Computational Linguistics (TACL), volume 10, pages 291–306, 2022

2022
[34]

Reconstruct Picture i with the highest possible fidelity

Yi Zhang et al. Bee: A high-quality corpus and full-stack suite to unlock advanced fully open mllms.arXiv preprint arXiv:2510.13795, 2025

work page arXiv 2025
[35]

CoRR , volume =

Qingyan Bai, Qiuyu Wang, Hao Ouyang, Yue Yu, Hanlin Wang, Wen Wang, Ka Leong Cheng, Shuailei Ma, Yanhong Zeng, Zichen Liu, et al. Scaling instruction-based video editing with a high-quality synthetic dataset.arXiv preprint arXiv:2510.15742, 2025

work page arXiv 2025
[36]

CoRR , volume =

Zhongwei Zhang, Fuchen Long, Wei Li, Zhaofan Qiu, Wu Liu, Ting Yao, and Tao Mei. Region-Constraint In-Context Generation for Instructional Video Editing.arXiv preprint arXiv:2512.17650, 2025. 30 Technical Report

work page arXiv 2025
[37]

CoRR , volume =

Haoyang He, Jie Wang, Jiangning Zhang, Zhucun Xue, Xingyuan Bu, Qiangpeng Yang, Shilei Wen, and Lei Xie. Openve-3m: A large-scale high-quality dataset for instruction-guided video editing.arXiv preprint arXiv:2512.07826, 2025

work page arXiv 2025
[38]

CoRR , volume =

Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in-one video creation and editing.arXiv preprint arXiv:2503.07598, 2025

work page arXiv 2025
[39]

One-step diffusion with distribution matching distillation.arXiv preprint arXiv:2311.18828, 2023

Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Frédo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation.arXiv preprint arXiv:2311.18828, 2024

work page arXiv 2024
[40]

Distribution matching distillation meets reinforcement learning.arXiv preprint arXiv:2511.13649, 2025

DengyangJiangetal. Distributionmatchingdistillationmeetsreinforcementlearning.arXivpreprintarXiv:2511.13649, 2025

work page arXiv 2025
[41]

DiffusionNFT: Online Diffusion Reinforcement with Forward Process

Kaiwen Zheng, Huayu Chen, Haotian Ye, Haoxiang Wang, Qinsheng Zhang, Kai Jiang, Hang Su, Stefano Ermon, Jun Zhu, and Ming-Yu Liu. Diffusionnft: Online diffusion reinforcement with forward process.arXiv preprint arXiv:2509.16117, 2025

work page internal anchor Pith review arXiv 2025
[42]

W. Feng, W. Constable, and Y. Mao. Getting started with fully sharded data parallel (FSDP2).PyTorch Official Tutorials, March 2022

2022
[43]

Deepspeed- moe: Advancing mixture-of-experts inference and trainingtopowernext-generationaiscale, 2022

Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, and Yuxiong He. DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale.arXiv e-prints, art. arXiv:2201.05596, January 2022. doi: 10.48550/arXiv.2201.05596

work page doi:10.48550/arxiv.2201.05596 2022
[44]

40 Kevin Frans, Danijar Hafner, Sergey Levine, and Pieter Abbeel

Jiarui Fang and Shangchun Zhao. USP: A Unified Sequence Parallelism Approach for Long Context Generative AI. arXiv e-prints, art. arXiv:2405.07719, May 2024. doi: 10.48550/arXiv.2405.07719

work page doi:10.48550/arxiv.2405.07719 2024
[45]

40 Kevin Frans, Danijar Hafner, Sergey Levine, and Pieter Abbeel

Jiarui Fang and Shangchun Zhao. Usp: A unified sequence parallelism approach for long context generative ai, 2024. URLhttps://arxiv.org/abs/2405.07719

work page arXiv 2024
[46]

VBench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2024
[47]

VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

Dian Zheng, Ziqi Huang, Hongbo Liu, Kai Zou, Yinan He, Fan Zhang, Lulu Gu, Yuanhan Zhang, Jingwen He, Wei-Shi Zheng, Yu Qiao, and Ziwei Liu. VBench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness. arXiv preprint arXiv:2503.21755, 2025

work page internal anchor Pith review arXiv 2025
[48]

Vidu: a highly consistent, dynamic and skilled text-to-video generator with diffusion models.arXiv preprint arXiv:2405.04233, 2024

Fan Bao, Chendong Xiang, Gang Yue, Guande He, Hongzhou Zhu, Kaiwen Zheng, Min Zhao, Shilong Liu, Yaole Wang, and Jun Zhu. Vidu: a highly consistent, dynamic and skilled text-to-video generator with diffusion models.arXiv preprint arXiv:2405.04233, 2024

work page arXiv 2024
[49]

Seedance 1.0: Exploring the Boundaries of Video Generation Models

Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, et al. Seedance 1.0: Exploring the Boundaries of Video Generation Models.arXiv e-prints, art. arXiv:2506.09113, June 2025. doi: 10.48550/arXiv.2506.09113

work page internal anchor Pith review doi:10.48550/arxiv.2506.09113 2025
[50]

Veo 3 technical report.https://deepmind.google/technologies/veo, 2025

Google DeepMind. Veo 3 technical report.https://deepmind.google/technologies/veo, 2025. Accessed: 2025

2025
[51]

Five-bench: A fine-grained video editing benchmark for evaluating emerging diffusion and rectified flow models

Minghan Li, Chenxi Xie, Yichen Wu, Lei Zhang, and Mengyu Wang. Five-bench: A fine-grained video editing benchmark for evaluating emerging diffusion and rectified flow models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16672–16681, 2025

2025
[52]

Insvie-1m: Effective instruction-based video editing with elaborate dataset construction

Yuhui Wu, Liyi Chen, Ruibin Li, Shihao Wang, Chenxi Xie, and Lei Zhang. Insvie-1m: Effective instruction-based video editing with elaborate dataset construction. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16692–16701, 2025

2025
[53]

Lucy edit: High-fidelity text-guided video editing

Decart AI. Lucy edit: High-fidelity text-guided video editing. https://huggingface.co/decart-ai/ Lucy-Edit-Dev, 2025

2025
[54]

CoRR , volume =

Xinyao Liao, Xianfang Zeng, Ziye Song, Zhoujie Fu, Gang Yu, and Guosheng Lin. In-context learning with unpaired clips for instruction-based video editing.arXiv preprint arXiv:2510.14648, 2025

work page arXiv 2025
[55]

Pixverse-r1: Next-generation real-time world model.https://pixverse.ai, 2026

PixVerse Team. Pixverse-r1: Next-generation real-time world model.https://pixverse.ai, 2026. 31 Technical Report

2026
[56]

Kiwi-edit: Versatile video editing via instruction and reference guidance.arXiv preprint arXiv:2603.02175, 2026

Yiqi Lin, Guoqiang Liang, Ziyun Zeng, Zechen Bai, Yanzhe Chen, and Mike Zheng Shou. Kiwi-edit: Versatile video editing via instruction and reference guidance, 2026. URLhttps://arxiv.org/abs/2603.02175

work page internal anchor Pith review arXiv 2026
[57]

Univideo: Unified understanding, generation, and editing for videos.arXiv preprint arXiv:2510.08377, 2025a

Cong Wei, Quande Liu, Zixuan Ye, Qiulin Wang, Xintao Wang, Pengfei Wan, Kun Gai, and Wenhu Chen. Univideo: Unified understanding, generation, and editing for videos.arXiv preprint arXiv:2510.08377, 2025

work page arXiv 2025
[58]

Omniweaving: Towards unified video generation with free-form composition and reasoning.arXiv preprint arXiv:2603.24458, 2026

Kaihang Pan, Qi Tian, Jianwei Zhang, Weijie Kong, Jiangfeng Xiong, Yanxin Long, Shixue Zhang, Haiyi Qiu, Tan Wang, Zheqi Lv, et al. Omniweaving: Towards unified video generation with free-form composition and reasoning. arXiv preprint arXiv:2603.24458, 2026

work page arXiv 2026
[59]

TokenFlow: Consistent diffusion features for consistent video editing

Michal Geyer, Omer Bar-Tal, Shai Bagon, and Tali Dekel. TokenFlow: Consistent diffusion features for consistent video editing. InInternational Conference on Learning Representations (ICLR), 2024

2024
[60]

Space-time diffusion features for zero-shot text-driven motion transfer

Danah Yatim, Rafail Fridman, Omer Bar-Tal, Yoni Kasten, and Tali Dekel. Space-time diffusion features for zero-shot text-driven motion transfer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

2024
[61]

VidToMe: Video token merging for zero-shot video editing

Xirui Li, Chao Ma, Xiaokang Yang, and Ming-Hsuan Yang. VidToMe: Video token merging for zero-shot video editing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

2024
[62]

AnyV2V: A plug-and-play framework for any video-to-video editing tasks.Transactions on Machine Learning Research (TMLR), 2024

Max Ku, Cong Wei, Weiming Ren, Huan Yang, and Wenhu Chen. AnyV2V: A plug-and-play framework for any video-to-video editing tasks.Transactions on Machine Learning Research (TMLR), 2024

2024
[63]

VideoGrain: Modulating space-time attention for multi-grained video editing

Yuxuan Xu et al. VideoGrain: Modulating space-time attention for multi-grained video editing. InInternational Conference on Learning Representations (ICLR), 2025

2025
[64]

Omni-video 2: Scaling mllm-conditioned diffusion for unified video generation and editing.arXiv preprint arXiv:2602.08820, 2026

Junke Wang et al. Omni-Video 2: Scaling MLLM-conditioned diffusion for unified video generation and editing.arXiv preprint arXiv:2602.08820, 2025

work page arXiv 2025
[65]

Context Unrolling in Omni Models

Ceyuan Yang, Zhijie Lin, Yang Zhao, Fei Xiao, Hao He, Qi Zhao, Chaorui Deng, Kunchang Li, Zihan Ding, Yuwei Guo, Fuyun Wang, Fangqi Zhu, Xiaonan Nie, Shenhan Zhu, Shanchuan Lin, Hongsheng Li, Weilin Huang, Guang Shi, and Haoqi Fan. Context unrolling in omni models.arXiv preprint arXiv:2604.21921, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[66]

Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36, 2024

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36, 2024

2024
[67]

SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers

Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, et al. Sana: Efficient high-resolution image synthesis with linear diffusion transformers.arXiv preprint arXiv:2410.10629, 2024

work page internal anchor Pith review arXiv 2024
[68]

FLUX.1: A text-to-image generation model.https://blackforestlabs.ai, 2024

Black Forest Labs. FLUX.1: A text-to-image generation model.https://blackforestlabs.ai, 2024. Accessed: 2025

2024
[69]

Improving image generation with better captions.Computer Science

James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions.Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023

2023
[70]

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Robin Rombach, et al. Scaling rectified flow transformers for high-resolution image synthesis.arXiv preprint arXiv:2403.03206, 2024

work page internal anchor Pith review arXiv 2024
[71]

Hidream-i1: A high-efficient image generative foundation model with sparse diffusion transformer.arXiv preprint arXiv:2505.22705, 2025

Feng Chen et al. HiDream-I1: A high-efficient image generative foundation model with sparse diffusion transformer. arXiv preprint arXiv:2505.22705, 2025

work page arXiv 2025
[72]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

work page internal anchor Pith review arXiv 2025
[73]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024

work page Pith review arXiv 2024
[74]

Seedream 3.0 Technical Report

Yuying Guo et al. Seedream 3.0 technical report.arXiv preprint arXiv:2504.11346, 2025

work page internal anchor Pith review arXiv 2025
[75]

ImgEdit: A Unified Image Editing Dataset and Benchmark

Yang Ye, Xianyi He, Zongjian Li, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Bohan Hou, and Li Yuan. Imgedit: A unified image editing dataset and benchmark.arXiv preprint arXiv:2505.20275, 2025. 32 Technical Report

work page internal anchor Pith review arXiv 2025
[76]

Gemini: A Family of Highly Capable Multimodal Models

Gemini, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page Pith review arXiv 2023
[77]

GPT-4o System Card

OpenAI. Gpt-4o system card, 2024. URLhttps://arxiv.org/abs/2410.21276

work page Pith review arXiv 2024
[78]

Seedream 4.0: Toward Next-generation Multimodal Image Generation

Yu Gao et al. Seedream 4.0: Toward next-generation multimodal image generation.arXiv preprint arXiv:2509.20427, 2025

work page internal anchor Pith review arXiv 2025
[79]

UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

Bin Lin, Yuwei Niu, et al. Uniworld-v1: High-resolution semantic encoders for unified visual understanding and generation.arXiv preprint arXiv:2506.03147, 2025

work page internal anchor Pith review arXiv 2025
[80]

Omnigen2: Exploration to advanced multimodal generation, 2025

Junjie Zhou, Shitao Xiao, Yueze Wang, Tiejun Huang, and Zheng Liu. Omnigen2: Exploration to advanced multimodal generation, 2025

2025
[81]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

work page internal anchor Pith review arXiv 2025

Showing first 80 references.