pith. machine review for the scientific record. sign in

arxiv: 2512.16776 · v1 · submitted 2025-12-18 · 💻 cs.CV

Recognition: 2 theorem links

Kling-Omni Technical Report

Authors on Pith no claims yet

Pith reviewed 2026-05-15 20:56 UTC · model grok-4.3

classification 💻 cs.CV
keywords video generationmultimodal learninggenerative AIvideo editingin-context generationreasoning modelsmultimodal inputs
0
0 comments X

The pith

Kling-Omni unifies video generation, editing, and reasoning into a single end-to-end framework that accepts text, images, and video inputs to produce high-fidelity cinematic content.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Kling-Omni as a generalist model that handles multiple video-related tasks together rather than through separate systems. It processes multimodal inputs into unified representations to generate videos, perform edits based on reasoning, and follow complex instructions. This matters because it could simplify workflows for creating intelligent video content and point toward more capable simulators of dynamic worlds. The approach relies on large-scale data and pre-training to achieve strong results in in-context scenarios.

Core claim

Kling-Omni is an end-to-end generative framework that bridges video generation, editing, and intelligent reasoning by converting diverse user inputs such as text instructions, reference images, and video contexts into a unified multimodal representation, enabling the creation of cinematic-quality video content supported by a comprehensive data system and large-scale pre-training.

What carries the argument

The unified multimodal representation that integrates text, image, and video inputs for joint generation and reasoning tasks.

Load-bearing premise

The constructed data system and pre-training strategies suffice to integrate generation, editing, and reasoning without hidden trade-offs in quality or hidden evaluation biases.

What would settle it

A head-to-head comparison on a complex reasoning-based video editing benchmark where Kling-Omni produces lower fidelity or less coherent results than a specialized editing model.

read the original abstract

We present Kling-Omni, a generalist generative framework designed to synthesize high-fidelity videos directly from multimodal visual language inputs. Adopting an end-to-end perspective, Kling-Omni bridges the functional separation among diverse video generation, editing, and intelligent reasoning tasks, integrating them into a holistic system. Unlike disjointed pipeline approaches, Kling-Omni supports a diverse range of user inputs, including text instructions, reference images, and video contexts, processing them into a unified multimodal representation to deliver cinematic-quality and highly-intelligent video content creation. To support these capabilities, we constructed a comprehensive data system that serves as the foundation for multimodal video creation. The framework is further empowered by efficient large-scale pre-training strategies and infrastructure optimizations for inference. Comprehensive evaluations reveal that Kling-Omni demonstrates exceptional capabilities in in-context generation, reasoning-based editing, and multimodal instruction following. Moving beyond a content creation tool, we believe Kling-Omni is a pivotal advancement toward multimodal world simulators capable of perceiving, reasoning, generating and interacting with the dynamic and complex worlds.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper presents Kling-Omni as an end-to-end generalist framework that unifies video generation, editing, and reasoning tasks from multimodal inputs (text instructions, reference images, video contexts). It describes construction of a comprehensive data system plus large-scale pre-training and infrastructure optimizations, and asserts that comprehensive evaluations show exceptional performance in in-context generation, reasoning-based editing, and multimodal instruction following, advancing toward multimodal world simulators.

Significance. If the integration claims hold with demonstrated performance, the work would offer a notable step toward unified multimodal video systems that avoid pipeline fragmentation. However, the manuscript supplies no quantitative metrics, baselines, ablations, or failure analysis, so its significance cannot be assessed beyond the level of a high-level system description.

major comments (1)
  1. [Abstract] Abstract: the central claim that 'comprehensive evaluations reveal exceptional capabilities in in-context generation, reasoning-based editing, and multimodal instruction following' is unsupported by any reported metrics, baselines, ablation studies, or error analysis. This absence directly undermines verification of the key assertion that the unified data-plus-pre-training approach delivers integrated performance without hidden trade-offs.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our technical report. We address the single major comment below and outline the planned revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that 'comprehensive evaluations reveal exceptional capabilities in in-context generation, reasoning-based editing, and multimodal instruction following' is unsupported by any reported metrics, baselines, ablation studies, or error analysis. This absence directly undermines verification of the key assertion that the unified data-plus-pre-training approach delivers integrated performance without hidden trade-offs.

    Authors: We agree that the abstract's phrasing implies quantitative support that is not present in the manuscript. Kling-Omni is released as a technical report describing system design, data curation, and training infrastructure rather than a full empirical study; quantitative benchmarks and ablations remain internal due to proprietary data and model constraints. The referenced 'evaluations' consist of qualitative case studies and internal user assessments. We will revise the abstract to state that Kling-Omni 'demonstrates strong qualitative performance' in the listed tasks, remove the word 'exceptional', and add a dedicated Limitations section that explicitly notes the absence of public quantitative metrics and baselines. This change will be reflected in the next version. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The Kling-Omni Technical Report is a high-level descriptive document outlining a multimodal video generation system, its data construction, pre-training strategies, and claimed capabilities from evaluations. No mathematical derivations, equations, fitted parameters, predictions, or first-principles results are present that could reduce to inputs by construction. No self-citations of theorems, uniqueness claims, or ansatzes appear in a load-bearing role. The central assertions rest on external evaluations rather than any internal chain that loops back to the paper's own definitions or fits, making the report self-contained with no circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The report relies on standard deep-learning assumptions for generative models and the sufficiency of its proprietary data system; no explicit free parameters, axioms, or invented entities are detailed in the abstract.

axioms (1)
  • domain assumption Standard assumptions in large-scale deep learning for generative models hold for multimodal video synthesis.
    Implicit foundation for any end-to-end video generation framework.

pith-pipeline@v0.9.0 · 5746 in / 1084 out tokens · 23615 ms · 2026-05-15T20:56:43.030444+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MiVE: Multiscale Vision-language features for reference-guided video Editing

    cs.CV 2026-05 unverdicted novelty 7.0

    MiVE repurposes VLMs as multiscale feature extractors integrated into a unified self-attention Diffusion Transformer, achieving top human preference in reference-guided video editing.

  2. CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating

    cs.CV 2026-05 unverdicted novelty 7.0

    CaC is a hierarchical spatiotemporal concentrating reward model for video anomalies that reports 25.7% accuracy gains on fine-grained benchmarks and 11.7% anomaly reduction in generated videos via a new dataset and GR...

  3. Do Joint Audio-Video Generation Models Understand Physics?

    cs.SD 2026-05 unverdicted novelty 7.0

    Current joint audio-video generation models lack robust physical commonsense, especially during transitions and when prompted for impossible behaviors.

  4. AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics

    cs.CV 2026-05 unverdicted novelty 7.0

    AniMatrix generates anime videos using a production knowledge taxonomy, dual-channel conditioning, style-motion curriculum, and deformation-aware preference optimization, outperforming baselines in animator evaluation...

  5. AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics

    cs.CV 2026-05 unverdicted novelty 7.0

    AniMatrix generates anime videos by structuring artistic production rules into a controllable taxonomy and training the model to prioritize those rules over physical realism, achieving top scores from professional ani...

  6. AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics

    cs.CV 2026-05 unverdicted novelty 7.0

    AniMatrix generates anime videos using a structured taxonomy of artistic production variables, dual-channel conditioning, a style-motion curriculum, and deformation-aware optimization to prioritize art over physics.

  7. Delta Forcing: Trust Region Steering for Interactive Autoregressive Video Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    Delta Forcing uses latent trajectory deltas to adaptively limit unreliable teacher guidance while enforcing monotonic continuity, improving temporal consistency in interactive autoregressive video generation.

  8. OmniHumanoid: Streaming Cross-Embodiment Video Generation with Paired-Free Adaptation

    cs.CV 2026-05 unverdicted novelty 6.0

    OmniHumanoid factorizes transferable motion learning from embodiment-specific adaptation to enable scalable cross-embodiment video generation without paired data for new humanoids.

  9. From Priors to Perception: Grounding Video-LLMs in Physical Reality

    cs.CV 2026-05 unverdicted novelty 6.0

    Video-LLMs fail physical reasoning due to semantic prior dominance rather than perception deficits; a new programmatic adversarial curriculum and visual-anchored reasoning chain enable substantial gains via standard L...

  10. SignVerse-2M: A Two-Million-Clip Pose-Native Universe of 55+ Sign Languages

    cs.CV 2026-05 unverdicted novelty 6.0

    SignVerse-2M provides a 2-million-clip multilingual pose-native dataset for sign language derived from public videos via DWPose preprocessing to enable robust modeling in real-world conditions.

  11. ExoActor: Exocentric Video Generation as Generalizable Interactive Humanoid Control

    cs.RO 2026-04 unverdicted novelty 6.0

    ExoActor uses exocentric video generation to implicitly model robot-environment-object interactions and converts the resulting videos into task-conditioned humanoid control sequences.

  12. How Far Are Video Models from True Multimodal Reasoning?

    cs.CV 2026-04 unverdicted novelty 6.0

    Current video models succeed on basic understanding but achieve under 25% success on logically grounded generation and near 0% on interactive generation, exposing gaps in multimodal reasoning.

  13. Human Cognition in Machines: A Unified Perspective of World Models

    cs.RO 2026-04 unverdicted novelty 6.0

    The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and pro...

  14. OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    OmniShow unifies text, image, audio, and pose conditions into an end-to-end model for high-quality human-object interaction video generation and introduces the HOIVG-Bench benchmark, claiming state-of-the-art results.

  15. InsEdit: Towards Instruction-based Visual Editing via Data-Efficient Video Diffusion Models Adaptation

    cs.CV 2026-04 unverdicted novelty 6.0

    InsEdit adapts a video diffusion backbone for text-instruction video editing via Mutual Context Attention, achieving SOTA open-source results with O(100K) data while also supporting image editing.

  16. ImVideoEdit: Image-learning Video Editing via 2D Spatial Difference Attention Blocks

    cs.CV 2026-04 unverdicted novelty 6.0

    ImVideoEdit learns video editing from 13K image pairs by decoupling spatial modifications from frozen temporal dynamics in pretrained models, matching larger video-trained systems in fidelity and consistency.

  17. A Systematic Post-Train Framework for Video Generation

    cs.CV 2026-04 unverdicted novelty 5.0

    A post-training pipeline for video generation models combines SFT, RLHF with novel GRPO, prompt enhancement, and inference optimization to improve visual quality, temporal coherence, and instruction following.

  18. On Semiotic-Grounded Interpretive Evaluation of Generative Art

    cs.CV 2026-04 unverdicted novelty 5.0

    SemJudge uses a Hierarchical Semiosis Graph based on Peircean theory to evaluate deeper artistic meaning in generative art and aligns better with human judgments than prior metrics.

  19. Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory

    cs.CV 2026-04 unverdicted novelty 4.0

    Matrix-Game 3.0 delivers 720p real-time video generation at 40 FPS with minute-scale memory consistency by combining residual self-correction training, camera-aware memory injection, and DMD-based autoregressive disti...

  20. OmniFysics: Towards Physical Intelligence Evolution via Omni-Modal Signal Processing and Network Optimization

    cs.CV 2026-02 unverdicted novelty 4.0

    OmniFysics is an omni-modal network using a dynamic physical data engine and evolutive tuning to improve performance on multimodal benchmarks and physics-oriented tasks.

  21. LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation

    cs.CV 2026-04 unverdicted novelty 3.0

    This review organizes literature on large multimodal models and object-centric vision into four themes—understanding, referring segmentation, editing, and generation—while summarizing paradigms, strategies, and challe...

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · cited by 19 Pith papers · 14 internal anchors

  1. [1]

    Video generation models as world simulators.OpenAI, 2024

  2. [2]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  3. [3]

    From structure to detail: Hierarchical distillation for efficient diffusion model.arXiv preprint arXiv:2511.08930, 2025

    Hanbo Cheng, Peng Wang, Kaixiang Lei, Qi Li, Zhen Zou, Pengfei Hu, and Jun Du. From structure to detail: Hierarchical distillation for efficient diffusion model.arXiv preprint arXiv:2511.08930, 2025

  4. [4]

    https://deepmind.google/models/gemini-image/pro/

    Google Deepmind. https://deepmind.google/models/gemini-image/pro/

  5. [5]

    Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution.Advances in Neural Information Processing Systems, 36:2252–2274, 2023

    Mostafa Dehghani, Basil Mustafa, Josip Djolonga, Jonathan Heek, Matthias Minderer, Mathilde Caron, Andreas Steiner, Joan Puigcerver, Robert Geirhos, Ibrahim M Alabdulmohsin, et al. Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution.Advances in Neural Information Processing Systems, 36:2252–2274, 2023

  6. [6]

    Dapple: A pipelined data parallel approach for training large models

    Shiqing Fan, Yi Rong, Chen Meng, Zongyan Cao, Siyu Wang, Zhen Zheng, Chuan Wu, Guoping Long, Jun Yang, Lixue Xia, et al. Dapple: A pipelined data parallel approach for training large models. InProceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 431–445, 2021

  7. [7]

    Seedance 1.0: Exploring the Boundaries of Video Generation Models

    Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models.arXiv preprint arXiv:2506.09113, 2025

  8. [8]

    https://aistudio.google.com/models/veo-3

    Google. https://aistudio.google.com/models/veo-3

  9. [9]

    PipeDream: Fast and Efficient Pipeline Parallel DNN Training

    Aaron Harlap, Deepak Narayanan, Amar Phanishayee, Vivek Seshadri, Nikhil Devanur, Greg Ganger, and Phil Gibbons. Pipedream: Fast and efficient pipeline parallel dnn training.arXiv preprint arXiv:1806.03377, 2018

  10. [10]

    Re-evaluating the memory-balanced pipeline parallelism: Bpipe.arXiv preprint arXiv:2401.02088, 2024

    Mincong Huang, Chao Wang, Chi Ma, Yineng Zhang, Peng Zhang, and Lei Yu. Re-evaluating the memory-balanced pipeline parallelism: Bpipe.arXiv preprint arXiv:2401.02088, 2024

  11. [11]

    Gpipe: Efficient training of giant neural networks using pipeline parallelism

    Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, 32, 2019

  12. [12]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

  13. [14]

    DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

    Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, and Yuxiong He. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models, 2023.https://arxiv.org/abs/2309.14509

  14. [15]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

  15. [16]

    Ring Attention with Blockwise Transformers for Near-Infinite Context

    Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring attention with blockwise transformers for near-infinite context. arXiv preprint arXiv:2310.01889, 2023

  16. [17]

    Learning few-step diffusion models by trajectory distribution matching.arXiv preprint arXiv:2503.06674, 2025

    Yihong Luo, Tianyang Hu, Jiacheng Sun, Yujun Cai, and Jing Tang. Learning few-step diffusion models by trajectory distribution matching.arXiv preprint arXiv:2503.06674, 2025

  17. [18]

    Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation

    Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Xingkai Yu, et al. Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 7739–7751, 2025

  18. [19]

    Memory-efficient pipeline-parallel dnn training

    Deepak Narayanan, Amar Phanishayee, Kaiyu Shi, Xie Chen, and Matei Zaharia. Memory-efficient pipeline-parallel dnn training. InInternational Conference on Machine Learning, pages 7937–7947. PMLR, 2021

  19. [20]

    Efficient large-scale language 35 Kling-Omni: Omni Video Generation Model from Multi-modal Visual Language model training on gpu clusters using megatron-lm

    Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, et al. Efficient large-scale language 35 Kling-Omni: Omni Video Generation Model from Multi-modal Visual Language model training on gpu clusters using megatron-lm. InProceedings of...

  20. [21]

    Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

  21. [22]

    Hyper-sd: Trajectory segmented consistency model for efficient image synthesis.Advances in Neural Information Processing Systems, 37:117340–117362, 2024

    Yuxi Ren, Xin Xia, Yanzuo Lu, Jiacheng Zhang, Jie Wu, Pan Xie, Xing Wang, and Xuefeng Xiao. Hyper-sd: Trajectory segmented consistency model for efficient image synthesis.Advances in Neural Information Processing Systems, 37:117340–117362, 2024

  22. [23]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

  23. [24]

    https://app.runwayml.com/

    Runway. https://app.runwayml.com/

  24. [25]

    Flashattention-3: Fast and accurate attention with asynchrony and low-precision.Advances in Neural Information Processing Systems, 37:68658–68685, 2024

    Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention-3: Fast and accurate attention with asynchrony and low-precision.Advances in Neural Information Processing Systems, 37:68658–68685, 2024

  25. [26]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URL https://arxiv. org/abs/2402.03300, 2(3):5, 2024

  26. [27]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism, 2020.https://arxiv. org/abs/1909.08053

  27. [28]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

  28. [29]

    Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm

    Jingqi Tong, Yurong Mou, Hangcheng Li, Mingzhe Li, Yongzhuo Yang, Ming Zhang, Qiguang Chen, Tianyi Liang, Xiaomeng Hu, Yining Zheng, et al. Thinking with video: Video generation as a promising multimodal reasoning paradigm.arXiv preprint arXiv:2511.04570, 2025

  29. [30]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  30. [31]

    Phased consistency models.Advances in neural information processing systems, 37:83951–84009, 2024

    Fu-Yun Wang, Zhaoyang Huang, Alexander Bergman, Dazhong Shen, Peng Gao, Michael Lingelbach, Keqiang Sun, Weikang Bian, Guanglu Song, Yu Liu, et al. Phased consistency models.Advances in neural information processing systems, 37:83951–84009, 2024

  31. [32]

    Flexsp: Accelerating large language model training via flexible sequence parallelism

    Yujie Wang, Shiju Wang, Shenhan Zhu, Fangcheng Fu, Xinyi Liu, Xuefeng Xiao, Huixia Li, Jiashi Li, Faming Wu, and Bin Cui. Flexsp: Accelerating large language model training via flexible sequence parallelism. In Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ASPLOS ’2...

  32. [33]

    Video models are zero-shot learners and reasoners

    ThaddäusWiedemer, YuxuanLi, PaulVicol, ShixiangShaneGu, NickMatarese, KevinSwersky, BeenKim, Priyank Jaini, and Robert Geirhos. Video models are zero-shot learners and reasoners.arXiv preprint arXiv:2509.20328, 2025

  33. [34]

    Qwen-Image Technical Report

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

  34. [35]

    Improved distribution matching distillation for fast image synthesis.Advances in neural information processing systems, 37:47455–47487, 2024

    Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and Bill Freeman. Improved distribution matching distillation for fast image synthesis.Advances in neural information processing systems, 37:47455–47487, 2024

  35. [36]

    Accelerating the training of large language models using efficient activation rematerialization and optimal hybrid parallelism

    Tailing Yuan, Yuliang Liu, Xucheng Ye, Shenglong Zhang, Jianchao Tan, Bin Chen, Chengru Song, and Di Zhang. Accelerating the training of large language models using efficient activation rematerialization and optimal hybrid parallelism. In2024 USENIX Annual Technical Conference (USENIX ATC 24), pages 545–561, Santa Clara, CA, July 2024. USENIX Association....

  36. [37]

    Score identity distillation: Exponentially fast distillation of pretrained diffusion models for one-step generation

    Mingyuan Zhou, Huangjie Zheng, Zhendong Wang, Mingzhang Yin, and Hai Huang. Score identity distillation: Exponentially fast distillation of pretrained diffusion models for one-step generation. InForty-first International Conference on Machine Learning, 2024. 37