arxiv: 2512.16776 · v1 · submitted 2025-12-18 · 💻 cs.CV

Recognition: 2 theorem links

Kling-Omni Technical Report

Kling Team: Jialu Chen , Yuanzheng Ci , Xiangyu Du , Zipeng Feng , Kun Gai , Sainan Guo , Feng Han , Jingbin He

show 59 more authors

Kang He Xiao Hu Xiaohua Hu Boyuan Jiang Fangyuan Kong Hang Li Jie Li Qingyu Li Shen Li Xiaohan Li Yan Li Jiajun Liang Borui Liao Yiqiao Liao Weihong Lin Quande Liu Xiaokun Liu Yilun Liu Yuliang Liu Shun Lu Hangyu Mao Yunyao Mao Haodong Ouyang Wenyu Qin Wanqi Shi Xiaoyu Shi Lianghao Su Haozhi Sun Peiqin Sun Pengfei Wan Chao Wang Chenyu Wang Meng Wang Qiulin Wang Runqi Wang Xintao Wang Xuebo Wang Zekun Wang Min Wei Tiancheng Wen Guohao Wu Xiaoshi Wu Zhenhua Wu Da Xie Yingtong Xiong Yulong Xu Sile Yang Zikang Yang Weicai Ye Ziyang Yuan Shenglong Zhang Shuaiyu Zhang Yuanxing Zhang Yufan Zhang Wenzheng Zhao Ruiliang Zhou Yan Zhou Guosheng Zhu Yongjie Zhu

Authors on Pith no claims yet

Pith reviewed 2026-05-15 20:56 UTC · model grok-4.3

classification 💻 cs.CV

keywords video generationmultimodal learninggenerative AIvideo editingin-context generationreasoning modelsmultimodal inputs

0 comments

The pith

Kling-Omni unifies video generation, editing, and reasoning into a single end-to-end framework that accepts text, images, and video inputs to produce high-fidelity cinematic content.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Kling-Omni as a generalist model that handles multiple video-related tasks together rather than through separate systems. It processes multimodal inputs into unified representations to generate videos, perform edits based on reasoning, and follow complex instructions. This matters because it could simplify workflows for creating intelligent video content and point toward more capable simulators of dynamic worlds. The approach relies on large-scale data and pre-training to achieve strong results in in-context scenarios.

Core claim

Kling-Omni is an end-to-end generative framework that bridges video generation, editing, and intelligent reasoning by converting diverse user inputs such as text instructions, reference images, and video contexts into a unified multimodal representation, enabling the creation of cinematic-quality video content supported by a comprehensive data system and large-scale pre-training.

What carries the argument

The unified multimodal representation that integrates text, image, and video inputs for joint generation and reasoning tasks.

Load-bearing premise

The constructed data system and pre-training strategies suffice to integrate generation, editing, and reasoning without hidden trade-offs in quality or hidden evaluation biases.

What would settle it

A head-to-head comparison on a complex reasoning-based video editing benchmark where Kling-Omni produces lower fidelity or less coherent results than a specialized editing model.

read the original abstract

We present Kling-Omni, a generalist generative framework designed to synthesize high-fidelity videos directly from multimodal visual language inputs. Adopting an end-to-end perspective, Kling-Omni bridges the functional separation among diverse video generation, editing, and intelligent reasoning tasks, integrating them into a holistic system. Unlike disjointed pipeline approaches, Kling-Omni supports a diverse range of user inputs, including text instructions, reference images, and video contexts, processing them into a unified multimodal representation to deliver cinematic-quality and highly-intelligent video content creation. To support these capabilities, we constructed a comprehensive data system that serves as the foundation for multimodal video creation. The framework is further empowered by efficient large-scale pre-training strategies and infrastructure optimizations for inference. Comprehensive evaluations reveal that Kling-Omni demonstrates exceptional capabilities in in-context generation, reasoning-based editing, and multimodal instruction following. Moving beyond a content creation tool, we believe Kling-Omni is a pivotal advancement toward multimodal world simulators capable of perceiving, reasoning, generating and interacting with the dynamic and complex worlds.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Kling-Omni sketches a unified multimodal video framework but supplies no metrics or baselines to support its performance claims.

read the letter

The main thing to know is that this is a high-level technical report on Kling-Omni, an end-to-end system meant to handle video generation, editing, and reasoning from mixed text, image, and video inputs in one model. It claims to move past separate pipelines by using a single multimodal representation, backed by a big data system and large-scale pre-training plus inference tweaks. That integration is the actual new piece here, and the write-up does a clean job describing the motivation and the infrastructure side without overclaiming the math or architecture details. The long-term nod toward world simulators is a fair framing for where this kind of work is headed. The clear weakness is the total lack of evidence. The text says comprehensive evaluations show exceptional results in in-context generation and instruction following, yet there are no numbers, no comparisons, no ablations, and no failure cases. Without those, the claim that the unified design avoids trade-offs stays untested. The stress-test note is right on this point: the data and training strategy are presented as sufficient, but nothing in the report lets a reader check that. This is standard for some industry system descriptions, where the real details come later or stay internal. It is useful for anyone tracking what commercial labs are shipping next in generative video, but it has little to offer for readers who need reproducible results or solid benchmarks. I would not cite it on this version alone. It does not need peer review right now because there is nothing concrete for referees to evaluate or strengthen.

Referee Report

1 major / 0 minor

Summary. The paper presents Kling-Omni as an end-to-end generalist framework that unifies video generation, editing, and reasoning tasks from multimodal inputs (text instructions, reference images, video contexts). It describes construction of a comprehensive data system plus large-scale pre-training and infrastructure optimizations, and asserts that comprehensive evaluations show exceptional performance in in-context generation, reasoning-based editing, and multimodal instruction following, advancing toward multimodal world simulators.

Significance. If the integration claims hold with demonstrated performance, the work would offer a notable step toward unified multimodal video systems that avoid pipeline fragmentation. However, the manuscript supplies no quantitative metrics, baselines, ablations, or failure analysis, so its significance cannot be assessed beyond the level of a high-level system description.

major comments (1)

[Abstract] Abstract: the central claim that 'comprehensive evaluations reveal exceptional capabilities in in-context generation, reasoning-based editing, and multimodal instruction following' is unsupported by any reported metrics, baselines, ablation studies, or error analysis. This absence directly undermines verification of the key assertion that the unified data-plus-pre-training approach delivers integrated performance without hidden trade-offs.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our technical report. We address the single major comment below and outline the planned revisions.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that 'comprehensive evaluations reveal exceptional capabilities in in-context generation, reasoning-based editing, and multimodal instruction following' is unsupported by any reported metrics, baselines, ablation studies, or error analysis. This absence directly undermines verification of the key assertion that the unified data-plus-pre-training approach delivers integrated performance without hidden trade-offs.

Authors: We agree that the abstract's phrasing implies quantitative support that is not present in the manuscript. Kling-Omni is released as a technical report describing system design, data curation, and training infrastructure rather than a full empirical study; quantitative benchmarks and ablations remain internal due to proprietary data and model constraints. The referenced 'evaluations' consist of qualitative case studies and internal user assessments. We will revise the abstract to state that Kling-Omni 'demonstrates strong qualitative performance' in the listed tasks, remove the word 'exceptional', and add a dedicated Limitations section that explicitly notes the absence of public quantitative metrics and baselines. This change will be reflected in the next version. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The Kling-Omni Technical Report is a high-level descriptive document outlining a multimodal video generation system, its data construction, pre-training strategies, and claimed capabilities from evaluations. No mathematical derivations, equations, fitted parameters, predictions, or first-principles results are present that could reduce to inputs by construction. No self-citations of theorems, uniqueness claims, or ansatzes appear in a load-bearing role. The central assertions rest on external evaluations rather than any internal chain that loops back to the paper's own definitions or fits, making the report self-contained with no circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The report relies on standard deep-learning assumptions for generative models and the sufficiency of its proprietary data system; no explicit free parameters, axioms, or invented entities are detailed in the abstract.

axioms (1)

domain assumption Standard assumptions in large-scale deep learning for generative models hold for multimodal video synthesis.
Implicit foundation for any end-to-end video generation framework.

pith-pipeline@v0.9.0 · 5746 in / 1084 out tokens · 23615 ms · 2026-05-15T20:56:43.030444+00:00 · methodology

discussion (0)

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MiVE: Multiscale Vision-language features for reference-guided video Editing
cs.CV 2026-05 unverdicted novelty 7.0

MiVE repurposes VLMs as multiscale feature extractors integrated into a unified self-attention Diffusion Transformer, achieving top human preference in reference-guided video editing.
CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating
cs.CV 2026-05 unverdicted novelty 7.0

CaC is a hierarchical spatiotemporal concentrating reward model for video anomalies that reports 25.7% accuracy gains on fine-grained benchmarks and 11.7% anomaly reduction in generated videos via a new dataset and GR...
Do Joint Audio-Video Generation Models Understand Physics?
cs.SD 2026-05 unverdicted novelty 7.0

Current joint audio-video generation models lack robust physical commonsense, especially during transitions and when prompted for impossible behaviors.
AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics
cs.CV 2026-05 unverdicted novelty 7.0

AniMatrix generates anime videos using a production knowledge taxonomy, dual-channel conditioning, style-motion curriculum, and deformation-aware preference optimization, outperforming baselines in animator evaluation...
AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics
cs.CV 2026-05 unverdicted novelty 7.0

AniMatrix generates anime videos by structuring artistic production rules into a controllable taxonomy and training the model to prioritize those rules over physical realism, achieving top scores from professional ani...
AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics
cs.CV 2026-05 unverdicted novelty 7.0

AniMatrix generates anime videos using a structured taxonomy of artistic production variables, dual-channel conditioning, a style-motion curriculum, and deformation-aware optimization to prioritize art over physics.
Delta Forcing: Trust Region Steering for Interactive Autoregressive Video Generation
cs.CV 2026-05 unverdicted novelty 6.0

Delta Forcing uses latent trajectory deltas to adaptively limit unreliable teacher guidance while enforcing monotonic continuity, improving temporal consistency in interactive autoregressive video generation.
OmniHumanoid: Streaming Cross-Embodiment Video Generation with Paired-Free Adaptation
cs.CV 2026-05 unverdicted novelty 6.0

OmniHumanoid factorizes transferable motion learning from embodiment-specific adaptation to enable scalable cross-embodiment video generation without paired data for new humanoids.
From Priors to Perception: Grounding Video-LLMs in Physical Reality
cs.CV 2026-05 unverdicted novelty 6.0

Video-LLMs fail physical reasoning due to semantic prior dominance rather than perception deficits; a new programmatic adversarial curriculum and visual-anchored reasoning chain enable substantial gains via standard L...
SignVerse-2M: A Two-Million-Clip Pose-Native Universe of 55+ Sign Languages
cs.CV 2026-05 unverdicted novelty 6.0

SignVerse-2M provides a 2-million-clip multilingual pose-native dataset for sign language derived from public videos via DWPose preprocessing to enable robust modeling in real-world conditions.
ExoActor: Exocentric Video Generation as Generalizable Interactive Humanoid Control
cs.RO 2026-04 unverdicted novelty 6.0

ExoActor uses exocentric video generation to implicitly model robot-environment-object interactions and converts the resulting videos into task-conditioned humanoid control sequences.
How Far Are Video Models from True Multimodal Reasoning?
cs.CV 2026-04 unverdicted novelty 6.0

Current video models succeed on basic understanding but achieve under 25% success on logically grounded generation and near 0% on interactive generation, exposing gaps in multimodal reasoning.
Human Cognition in Machines: A Unified Perspective of World Models
cs.RO 2026-04 unverdicted novelty 6.0

The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and pro...
OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation
cs.CV 2026-04 unverdicted novelty 6.0

OmniShow unifies text, image, audio, and pose conditions into an end-to-end model for high-quality human-object interaction video generation and introduces the HOIVG-Bench benchmark, claiming state-of-the-art results.
InsEdit: Towards Instruction-based Visual Editing via Data-Efficient Video Diffusion Models Adaptation
cs.CV 2026-04 unverdicted novelty 6.0

InsEdit adapts a video diffusion backbone for text-instruction video editing via Mutual Context Attention, achieving SOTA open-source results with O(100K) data while also supporting image editing.
ImVideoEdit: Image-learning Video Editing via 2D Spatial Difference Attention Blocks
cs.CV 2026-04 unverdicted novelty 6.0

ImVideoEdit learns video editing from 13K image pairs by decoupling spatial modifications from frozen temporal dynamics in pretrained models, matching larger video-trained systems in fidelity and consistency.
A Systematic Post-Train Framework for Video Generation
cs.CV 2026-04 unverdicted novelty 5.0

A post-training pipeline for video generation models combines SFT, RLHF with novel GRPO, prompt enhancement, and inference optimization to improve visual quality, temporal coherence, and instruction following.
On Semiotic-Grounded Interpretive Evaluation of Generative Art
cs.CV 2026-04 unverdicted novelty 5.0

SemJudge uses a Hierarchical Semiosis Graph based on Peircean theory to evaluate deeper artistic meaning in generative art and aligns better with human judgments than prior metrics.
Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory
cs.CV 2026-04 unverdicted novelty 4.0

Matrix-Game 3.0 delivers 720p real-time video generation at 40 FPS with minute-scale memory consistency by combining residual self-correction training, camera-aware memory injection, and DMD-based autoregressive disti...
OmniFysics: Towards Physical Intelligence Evolution via Omni-Modal Signal Processing and Network Optimization
cs.CV 2026-02 unverdicted novelty 4.0

OmniFysics is an omni-modal network using a dynamic physical data engine and evolutive tuning to improve performance on multimodal benchmarks and physics-oriented tasks.
LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation
cs.CV 2026-04 unverdicted novelty 3.0

This review organizes literature on large multimodal models and object-centric vision into four themes—understanding, referring segmentation, editing, and generation—while summarizing paradigms, strategies, and challe...

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · cited by 19 Pith papers · 14 internal anchors

[1]

Video generation models as world simulators.OpenAI, 2024

work page 2024
[2]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

From structure to detail: Hierarchical distillation for efficient diffusion model.arXiv preprint arXiv:2511.08930, 2025

Hanbo Cheng, Peng Wang, Kaixiang Lei, Qi Li, Zhen Zou, Pengfei Hu, and Jun Du. From structure to detail: Hierarchical distillation for efficient diffusion model.arXiv preprint arXiv:2511.08930, 2025

work page arXiv 2025
[4]

https://deepmind.google/models/gemini-image/pro/

Google Deepmind. https://deepmind.google/models/gemini-image/pro/

work page
[5]

Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution.Advances in Neural Information Processing Systems, 36:2252–2274, 2023

Mostafa Dehghani, Basil Mustafa, Josip Djolonga, Jonathan Heek, Matthias Minderer, Mathilde Caron, Andreas Steiner, Joan Puigcerver, Robert Geirhos, Ibrahim M Alabdulmohsin, et al. Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution.Advances in Neural Information Processing Systems, 36:2252–2274, 2023

work page 2023
[6]

Dapple: A pipelined data parallel approach for training large models

Shiqing Fan, Yi Rong, Chen Meng, Zongyan Cao, Siyu Wang, Zhen Zheng, Chuan Wu, Guoping Long, Jun Yang, Lixue Xia, et al. Dapple: A pipelined data parallel approach for training large models. InProceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 431–445, 2021

work page 2021
[7]

Seedance 1.0: Exploring the Boundaries of Video Generation Models

Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models.arXiv preprint arXiv:2506.09113, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

https://aistudio.google.com/models/veo-3

Google. https://aistudio.google.com/models/veo-3

work page
[9]

PipeDream: Fast and Efficient Pipeline Parallel DNN Training

Aaron Harlap, Deepak Narayanan, Amar Phanishayee, Vivek Seshadri, Nikhil Devanur, Greg Ganger, and Phil Gibbons. Pipedream: Fast and efficient pipeline parallel dnn training.arXiv preprint arXiv:1806.03377, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[10]

Re-evaluating the memory-balanced pipeline parallelism: Bpipe.arXiv preprint arXiv:2401.02088, 2024

Mincong Huang, Chao Wang, Chi Ma, Yineng Zhang, Peng Zhang, and Lei Yu. Re-evaluating the memory-balanced pipeline parallelism: Bpipe.arXiv preprint arXiv:2401.02088, 2024

work page arXiv 2024
[11]

Gpipe: Efficient training of giant neural networks using pipeline parallelism

Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, 32, 2019

work page 2019
[12]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, and Yuxiong He. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models, 2023.https://arxiv.org/abs/2309.14509

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Ring Attention with Blockwise Transformers for Near-Infinite Context

Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring attention with blockwise transformers for near-infinite context. arXiv preprint arXiv:2310.01889, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Learning few-step diffusion models by trajectory distribution matching.arXiv preprint arXiv:2503.06674, 2025

Yihong Luo, Tianyang Hu, Jiacheng Sun, Yujun Cai, and Jing Tang. Learning few-step diffusion models by trajectory distribution matching.arXiv preprint arXiv:2503.06674, 2025

work page arXiv 2025
[18]

Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation

Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Xingkai Yu, et al. Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 7739–7751, 2025

work page 2025
[19]

Memory-efficient pipeline-parallel dnn training

Deepak Narayanan, Amar Phanishayee, Kaiyu Shi, Xie Chen, and Matei Zaharia. Memory-efficient pipeline-parallel dnn training. InInternational Conference on Machine Learning, pages 7937–7947. PMLR, 2021

work page 2021
[20]

Efficient large-scale language 35 Kling-Omni: Omni Video Generation Model from Multi-modal Visual Language model training on gpu clusters using megatron-lm

Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, et al. Efficient large-scale language 35 Kling-Omni: Omni Video Generation Model from Multi-modal Visual Language model training on gpu clusters using megatron-lm. InProceedings of...

work page 2021
[21]

Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

work page 2023
[22]

Hyper-sd: Trajectory segmented consistency model for efficient image synthesis.Advances in Neural Information Processing Systems, 37:117340–117362, 2024

Yuxi Ren, Xin Xia, Yanzuo Lu, Jiacheng Zhang, Jie Wu, Pan Xie, Xing Wang, and Xuefeng Xiao. Hyper-sd: Trajectory segmented consistency model for efficient image synthesis.Advances in Neural Information Processing Systems, 37:117340–117362, 2024

work page 2024
[23]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

work page 2022
[24]

https://app.runwayml.com/

Runway. https://app.runwayml.com/

work page
[25]

Flashattention-3: Fast and accurate attention with asynchrony and low-precision.Advances in Neural Information Processing Systems, 37:68658–68685, 2024

Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention-3: Fast and accurate attention with asynchrony and low-precision.Advances in Neural Information Processing Systems, 37:68658–68685, 2024

work page 2024
[26]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URL https://arxiv. org/abs/2402.03300, 2(3):5, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism, 2020.https://arxiv. org/abs/1909.08053

work page internal anchor Pith review Pith/arXiv arXiv 2020
[28]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm

Jingqi Tong, Yurong Mou, Hangcheng Li, Mingzhe Li, Yongzhuo Yang, Ming Zhang, Qiguang Chen, Tianyi Liang, Xiaomeng Hu, Yining Zheng, et al. Thinking with video: Video generation as a promising multimodal reasoning paradigm.arXiv preprint arXiv:2511.04570, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

Phased consistency models.Advances in neural information processing systems, 37:83951–84009, 2024

Fu-Yun Wang, Zhaoyang Huang, Alexander Bergman, Dazhong Shen, Peng Gao, Michael Lingelbach, Keqiang Sun, Weikang Bian, Guanglu Song, Yu Liu, et al. Phased consistency models.Advances in neural information processing systems, 37:83951–84009, 2024

work page 2024
[32]

Flexsp: Accelerating large language model training via flexible sequence parallelism

Yujie Wang, Shiju Wang, Shenhan Zhu, Fangcheng Fu, Xinyi Liu, Xuefeng Xiao, Huixia Li, Jiashi Li, Faming Wu, and Bin Cui. Flexsp: Accelerating large language model training via flexible sequence parallelism. In Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ASPLOS ’2...

work page doi:10.1145/3676641.3715998.https://doi.org/10.1145/3676641.3715998 2025
[33]

Video models are zero-shot learners and reasoners

ThaddäusWiedemer, YuxuanLi, PaulVicol, ShixiangShaneGu, NickMatarese, KevinSwersky, BeenKim, Priyank Jaini, and Robert Geirhos. Video models are zero-shot learners and reasoners.arXiv preprint arXiv:2509.20328, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Improved distribution matching distillation for fast image synthesis.Advances in neural information processing systems, 37:47455–47487, 2024

Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and Bill Freeman. Improved distribution matching distillation for fast image synthesis.Advances in neural information processing systems, 37:47455–47487, 2024

work page 2024
[36]

Accelerating the training of large language models using efficient activation rematerialization and optimal hybrid parallelism

Tailing Yuan, Yuliang Liu, Xucheng Ye, Shenglong Zhang, Jianchao Tan, Bin Chen, Chengru Song, and Di Zhang. Accelerating the training of large language models using efficient activation rematerialization and optimal hybrid parallelism. In2024 USENIX Annual Technical Conference (USENIX ATC 24), pages 545–561, Santa Clara, CA, July 2024. USENIX Association....

work page 2024
[37]

Score identity distillation: Exponentially fast distillation of pretrained diffusion models for one-step generation

Mingyuan Zhou, Huangjie Zheng, Zhendong Wang, Mingzhang Yin, and Hai Huang. Score identity distillation: Exponentially fast distillation of pretrained diffusion models for one-step generation. InForty-first International Conference on Machine Learning, 2024. 37

work page 2024