hub Canonical reference

Kling-Omni Technical Report

Kling Team: Jialu Chen, Yuanzheng Ci, Xiangyu Du, Zipeng Feng, Kun Gai, Sainan Guo · 2025 · cs.CV · arXiv 2512.16776

Canonical reference. 93% of citing Pith papers cite this work as background.

43 Pith papers citing it

Background 93% of classified citations

open full Pith review browse 43 citing papers arXiv PDF

abstract

We present Kling-Omni, a generalist generative framework designed to synthesize high-fidelity videos directly from multimodal visual language inputs. Adopting an end-to-end perspective, Kling-Omni bridges the functional separation among diverse video generation, editing, and intelligent reasoning tasks, integrating them into a holistic system. Unlike disjointed pipeline approaches, Kling-Omni supports a diverse range of user inputs, including text instructions, reference images, and video contexts, processing them into a unified multimodal representation to deliver cinematic-quality and highly-intelligent video content creation. To support these capabilities, we constructed a comprehensive data system that serves as the foundation for multimodal video creation. The framework is further empowered by efficient large-scale pre-training strategies and infrastructure optimizations for inference. Comprehensive evaluations reveal that Kling-Omni demonstrates exceptional capabilities in in-context generation, reasoning-based editing, and multimodal instruction following. Moving beyond a content creation tool, we believe Kling-Omni is a pivotal advancement toward multimodal world simulators capable of perceiving, reasoning, generating and interacting with the dynamic and complex worlds.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 13 baseline 1

citation-polarity summary

background 13 baseline 1

representative citing papers

GroundShot: Visually Consistent Multi-Shot Long Video Generation via Entity-Grounded Shot Scheduling

cs.CV · 2026-06-18 · unverdicted · novelty 7.0

GroundShot introduces entity-grounded shot scheduling with online visual memory to improve consistency in multi-shot video generation and presents GroundBench for entity-level evaluation.

Dream.exe: Can Video Generation Models Dream Executable Robot Manipulation?

cs.CV · 2026-06-03 · unverdicted · novelty 7.0

Dream.exe evaluates 8 video generation models on 101 manipulation tasks by converting generated videos into executable robot trajectories in a simulator, finding measurable success rates that visual metrics do not predict.

PaSBench-Video: A Streaming Video Benchmark for Proactive Safety Warning

cs.CL · 2026-06-01 · unverdicted · novelty 7.0

PaSBench-Video benchmark shows no tested MLLM exceeds 20% on strict proactive safety metrics, with recall correlated 0.64 to false-positive rate on safe clips.

Explainable Forensics of Manipulated Segments in Untrimmed Long Videos

cs.CV · 2026-06-01 · unverdicted · novelty 7.0

Introduces TASLE benchmark and MSLoc baseline for temporal localization and explanation of manipulated segments in long videos.

Future Forcing: Future-aware Training-free KV Cache Policy for Autoregressive Video Generation

cs.CV · 2026-05-28 · unverdicted · novelty 7.0

Future Forcing constructs a future query proxy from historical pre-RoPE statistics to score and merge KV tokens, improving subject consistency by up to 1.49 on VBench-Long for 60s AR video generation.

What-If World: A Causal Benchmark for General World Models in Embodied Scenarios

cs.CV · 2026-05-26 · unverdicted · novelty 7.0

What-If World is a new paired-prompt benchmark showing that nine state-of-the-art video generation models achieve at most 52% on causal intervention tests and cluster near 28% for open-source systems.

LongAV-Compass: Towards Unified Evaluation of Minute-Scale Audio-Visual Generation Across T2AV, I2AV, and V2AV

cs.CV · 2026-05-25 · unverdicted · novelty 7.0

LongAV-Compass is a new benchmark and evaluation framework for minute-scale audio-visual generation across T2AV, I2AV, and V2AV with multi-dimensional assessment.

Soap2Soap: Long Cinematic Video Remaking via Multi-Agent Collaboration

cs.CV · 2026-05-17 · unverdicted · novelty 7.0

Soap2Soap uses a multi-agent system with dual-bridge consistency via JSON screenplays and visual anchors plus batch keyframe generation to achieve better long-term consistency in cinematic video remaking than commercial APIs.

StreamingEffect: Real-Time Human-Centric Video Effect Generation

cs.CV · 2026-05-16 · unverdicted · novelty 7.0

StreamingEffect enables real-time 720p human-centric video effect generation on one GPU via teacher-student distillation, keyframe control, and a new 130K video dataset.

Echo-Forcing: A Scene Memory Framework for Interactive Long Video Generation

cs.CV · 2026-05-15 · unverdicted · novelty 7.0

Echo-Forcing decouples stable anchors, compressed history, and recent dynamics in video diffusion KV caches using hierarchical memory, scene recall frames, and difference-aware decay to support interactive long video generation under bounded cache.

CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating

cs.CV · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

CaC presents a new spatiotemporal concentrating reward model for video anomalies, built on a novel large-scale dataset and three-stage training with RL and IoU rewards, claiming 25.7% accuracy gains and 11.7% anomaly reduction.

Do Joint Audio-Video Generation Models Understand Physics?

cs.SD · 2026-05-08 · unverdicted · novelty 7.0 · 2 refs

AV-Phys Bench shows that current joint audio-video models lack robust physical commonsense, with major drops on transitions and deliberate anti-physics prompts.

AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics

cs.CV · 2026-05-05 · unverdicted · novelty 7.0 · 3 refs

AniMatrix generates anime videos by structuring artistic production rules into a controllable taxonomy and training the model to prioritize those rules over physical realism, achieving top scores from professional animators on prompt understanding and artistic motion.

Geometry-Instructed Video Editing

cs.CV · 2026-06-23 · unverdicted · novelty 6.0

GIVE uses paired depth-box and orientation-box geometry streams plus a graphics pipeline to train models for consistent object-level geometric edits like translate, rotate, and scale in videos.

MakeupMirror: Improving Facial Attribute Preservation in Diffusion Models for Makeup Transfer

cs.CV · 2026-06-18 · unverdicted · novelty 6.0

MakeupMirror reports 60% better facial similarity and 50% less skin tone change than Stable-Makeup using geometry, region, and tone controls in diffusion models.

Pulling The REINS: Training-Free Safety Alignment of Video Diffusion Models via Representation Steering

cs.CV · 2026-06-15 · unverdicted · novelty 6.0

REINS uses supervised PCA on safety-labeled activations to find a linear direction that, when added to hidden states at roughly 50% depth in video diffusion transformers, redirects generations from unsafe to safe content across multiple models.

SpecLoR: Spectral Lookahead Rectification for Motion-Coherent Text-to-Video Generation

cs.CV · 2026-06-10 · unverdicted · novelty 6.0

SpecLoR rectifies the amplitude spectrum of lookahead-estimated clean latents to natural-video priors during early ODE sampling steps, cutting physical artifacts with only four extra NFEs.

Dexterity-BEV: Aligning 3D World and Actions for Generalizable Robot Policies Learning

cs.RO · 2026-06-01 · unverdicted · novelty 6.0

Dexterity-BEV creates 3D vertex-based inputs and BEV-aligned outputs to reduce spatial-temporal misalignments in end-to-end robot policies trained on diverse datasets and embodiments.

Refining Multidimensional Video Reward Models via Disentangled Influence Functions

cs.LG · 2026-05-27 · unverdicted · novelty 6.0

Introduces dimension-disentangled influence estimation to prune or reweight training samples for MVRMs, outperforming global scalar filtering in alignment with ground truth.

SmartDirector: Keyframe-Conditioned Cinematic Video Generation with Narrative Pacing Control

cs.CV · 2026-05-27 · unverdicted · novelty 6.0

SmartDirector generates cinematic videos via Director-Gen for low-res keyframe-conditioned output followed by Director-SR refinement using high-res keyframes, trained on curated movie sequences.

VISTA: Triplet-Supervised Video Style Transfer with Diffusion Transformers

cs.CV · 2026-05-17 · unverdicted · novelty 6.0

VISTA introduces a new synthetic triplet dataset and diffusion-transformer framework with style adapter that jointly models style, content, and motion to achieve state-of-the-art video style transfer.

MiVE: Multiscale Vision-language features for reference-guided video Editing

cs.CV · 2026-05-14 · unverdicted · novelty 6.0

MiVE repurposes VLMs as multiscale feature extractors integrated into a unified self-attention Diffusion Transformer for reference-guided video editing, claiming top human preference scores over prior methods.

OmniHumanoid: Streaming Cross-Embodiment Video Generation with Paired-Free Adaptation

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

OmniHumanoid factorizes transferable motion learning from embodiment-specific adaptation to enable scalable cross-embodiment video generation without paired data for new humanoids.

From Priors to Perception: Grounding Video-LLMs in Physical Reality

cs.CV · 2026-05-06 · unverdicted · novelty 6.0

Video-LLMs fail physical reasoning due to semantic prior dominance rather than perception deficits; a new programmatic adversarial curriculum and visual-anchored reasoning chain enable substantial gains via standard LoRA fine-tuning.

citing papers explorer

Showing 0 of 0 citing papers after filters.

No citing papers match the current filters.

Kling-Omni Technical Report

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer