{"paper":{"title":"Kling-Omni Technical Report","license":"http://creativecommons.org/licenses/by-nc-nd/4.0/","headline":"Kling-Omni unifies video generation, editing, and reasoning into a single end-to-end framework that accepts text, images, and video inputs to produce high-fidelity cinematic content.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Borui Liao, Boyuan Jiang, Chao Wang, Chenyu Wang, Da Xie, Fangyuan Kong, Feng Han, Guohao Wu, Guosheng Zhu, Hang Li, Hangyu Mao, Haodong Ouyang, Haozhi Sun, Jiajun Liang, Jie Li, Jingbin He, Kang He, Kling Team: Jialu Chen, Kun Gai, Lianghao Su, Meng Wang, Min Wei, Peiqin Sun, Pengfei Wan, Qingyu Li, Qiulin Wang, Quande Liu, Ruiliang Zhou, Runqi Wang, Sainan Guo, Shenglong Zhang, Shen Li, Shuaiyu Zhang, Shun Lu, Sile Yang, Tiancheng Wen, Wanqi Shi, Weicai Ye, Weihong Lin, Wenyu Qin, Wenzheng Zhao, Xiangyu Du, Xiaohan Li, Xiao Hu, Xiaohua Hu, Xiaokun Liu, Xiaoshi Wu, Xiaoyu Shi, Xintao Wang, Xuebo Wang, Yan Li, Yan Zhou, Yilun Liu, Yingtong Xiong, Yiqiao Liao, Yongjie Zhu, Yuanxing Zhang, Yuanzheng Ci, Yufan Zhang, Yuliang Liu, Yulong Xu, Yunyao Mao, Zekun Wang, Zhenhua Wu, Zikang Yang, Zipeng Feng, Ziyang Yuan","submitted_at":"2025-12-18T17:08:12Z","abstract_excerpt":"We present Kling-Omni, a generalist generative framework designed to synthesize high-fidelity videos directly from multimodal visual language inputs. Adopting an end-to-end perspective, Kling-Omni bridges the functional separation among diverse video generation, editing, and intelligent reasoning tasks, integrating them into a holistic system. Unlike disjointed pipeline approaches, Kling-Omni supports a diverse range of user inputs, including text instructions, reference images, and video contexts, processing them into a unified multimodal representation to deliver cinematic-quality and highly"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Kling-Omni demonstrates exceptional capabilities in in-context generation, reasoning-based editing, and multimodal instruction following.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The assumption that the constructed comprehensive data system and large-scale pre-training strategies are sufficient to deliver the claimed integration of generation, editing, and reasoning without hidden performance trade-offs or evaluation biases.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Kling-Omni is a unified multimodal generative system that produces cinematic videos from diverse inputs by integrating generation, editing, and intelligent reasoning in a single end-to-end model.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Kling-Omni unifies video generation, editing, and reasoning into a single end-to-end framework that accepts text, images, and video inputs to produce high-fidelity cinematic content.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"908165395ecd4109a240059e2c32a65b3bcc9041e2f21eb5da45bf7367940b2f"},"source":{"id":"2512.16776","kind":"arxiv","version":1},"verdict":{"id":"8a208e22-eac1-4e0c-ac07-1f636e3779e3","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T20:56:43.030444Z","strongest_claim":"Kling-Omni demonstrates exceptional capabilities in in-context generation, reasoning-based editing, and multimodal instruction following.","one_line_summary":"Kling-Omni is a unified multimodal generative system that produces cinematic videos from diverse inputs by integrating generation, editing, and intelligent reasoning in a single end-to-end model.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The assumption that the constructed comprehensive data system and large-scale pre-training strategies are sufficient to deliver the claimed integration of generation, editing, and reasoning without hidden performance trade-offs or evaluation biases.","pith_extraction_headline":"Kling-Omni unifies video generation, editing, and reasoning into a single end-to-end framework that accepts text, images, and video inputs to produce high-fidelity cinematic content."},"references":{"count":36,"sample":[{"doi":"","year":2024,"title":"Video generation models as world simulators.OpenAI, 2024","work_id":"8e788c52-86c4-45b7-8c02-9a9933a4812d","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","ref_index":2,"cited_arxiv_id":"2303.08774","is_internal_anchor":true},{"doi":"","year":2025,"title":"From structure to detail: Hierarchical distillation for efficient diffusion model.arXiv preprint arXiv:2511.08930, 2025","work_id":"bdb1f4d7-aac5-4797-a6bd-2553e464cb3d","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"https://deepmind.google/models/gemini-image/pro/","work_id":"578365a4-5faf-4412-8f6e-dc48fca5e015","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution.Advances in Neural Information Processing Systems, 36:2252–2274","work_id":"af2aa358-10a6-409d-83c1-c2ec6427e78c","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":36,"snapshot_sha256":"09a3525342ac5d61e91b2e44ccd9c63e6f6a18e396bacab633024d996809dad9","internal_anchors":14},"formal_canon":{"evidence_count":2,"snapshot_sha256":"e4277bb682c40bd523969353a781b5ad2b13fb20ffb83be03f158916bf6f5693"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}