PaSBench-Video benchmark shows no tested MLLM exceeds 20% on strict proactive safety metrics, with recall correlated 0.64 to false-positive rate on safe clips.
hub Canonical reference
Kling-Omni Technical Report
Canonical reference. 93% of citing Pith papers cite this work as background.
abstract
We present Kling-Omni, a generalist generative framework designed to synthesize high-fidelity videos directly from multimodal visual language inputs. Adopting an end-to-end perspective, Kling-Omni bridges the functional separation among diverse video generation, editing, and intelligent reasoning tasks, integrating them into a holistic system. Unlike disjointed pipeline approaches, Kling-Omni supports a diverse range of user inputs, including text instructions, reference images, and video contexts, processing them into a unified multimodal representation to deliver cinematic-quality and highly-intelligent video content creation. To support these capabilities, we constructed a comprehensive data system that serves as the foundation for multimodal video creation. The framework is further empowered by efficient large-scale pre-training strategies and infrastructure optimizations for inference. Comprehensive evaluations reveal that Kling-Omni demonstrates exceptional capabilities in in-context generation, reasoning-based editing, and multimodal instruction following. Moving beyond a content creation tool, we believe Kling-Omni is a pivotal advancement toward multimodal world simulators capable of perceiving, reasoning, generating and interacting with the dynamic and complex worlds.
hub tools
citation-role summary
citation-polarity summary
years
2026 34representative citing papers
Future Forcing constructs a future query proxy from historical pre-RoPE statistics to score and merge KV tokens, improving subject consistency by up to 1.49 on VBench-Long for 60s AR video generation.
What-If World is a new paired-prompt benchmark showing that nine state-of-the-art video generation models achieve at most 52% on causal intervention tests and cluster near 28% for open-source systems.
LongAV-Compass is a new benchmark and evaluation framework for minute-scale audio-visual generation across T2AV, I2AV, and V2AV with multi-dimensional assessment.
Soap2Soap uses a multi-agent system with dual-bridge consistency via JSON screenplays and visual anchors plus batch keyframe generation to achieve better long-term consistency in cinematic video remaking than commercial APIs.
StreamingEffect enables real-time 720p human-centric video effect generation on one GPU via teacher-student distillation, keyframe control, and a new 130K video dataset.
Echo-Forcing decouples stable anchors, compressed history, and recent dynamics in video diffusion KV caches using hierarchical memory, scene recall frames, and difference-aware decay to support interactive long video generation under bounded cache.
CaC presents a new spatiotemporal concentrating reward model for video anomalies, built on a novel large-scale dataset and three-stage training with RL and IoU rewards, claiming 25.7% accuracy gains and 11.7% anomaly reduction.
AniMatrix generates anime videos by structuring artistic production rules into a controllable taxonomy and training the model to prioritize those rules over physical realism, achieving top scores from professional animators on prompt understanding and artistic motion.
Introduces dimension-disentangled influence estimation to prune or reweight training samples for MVRMs, outperforming global scalar filtering in alignment with ground truth.
SmartDirector generates cinematic videos via Director-Gen for low-res keyframe-conditioned output followed by Director-SR refinement using high-res keyframes, trained on curated movie sequences.
VISTA introduces a new synthetic triplet dataset and diffusion-transformer framework with style adapter that jointly models style, content, and motion to achieve state-of-the-art video style transfer.
MiVE repurposes VLMs as multiscale feature extractors integrated into a unified self-attention Diffusion Transformer for reference-guided video editing, claiming top human preference scores over prior methods.
OmniHumanoid factorizes transferable motion learning from embodiment-specific adaptation to enable scalable cross-embodiment video generation without paired data for new humanoids.
Video-LLMs fail physical reasoning due to semantic prior dominance rather than perception deficits; a new programmatic adversarial curriculum and visual-anchored reasoning chain enable substantial gains via standard LoRA fine-tuning.
SignVerse-2M provides a 2-million-clip multilingual pose-native dataset for sign language derived from public videos via DWPose preprocessing to enable robust modeling in real-world conditions.
ExoActor uses exocentric video generation to implicitly model robot-environment-object interactions and converts the resulting videos into task-conditioned humanoid control sequences.
Current video models succeed on basic understanding but achieve under 25% success on logically grounded generation and near 0% on interactive generation, exposing gaps in multimodal reasoning.
The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and proposes Epistemic World Models as a new category for scientific discovery agents.
OmniShow unifies text, image, audio, and pose conditions into an end-to-end model for high-quality human-object interaction video generation and introduces the HOIVG-Bench benchmark, claiming state-of-the-art results.
InsEdit adapts a video diffusion backbone for text-instruction video editing via Mutual Context Attention, achieving SOTA open-source results with O(100K) data while also supporting image editing.
ImVideoEdit learns video editing from 13K image pairs by decoupling spatial modifications from frozen temporal dynamics in pretrained models, matching larger video-trained systems in fidelity and consistency.
Bernini is a framework that uses an MLLM planner to output semantic representations for a DiT renderer to generate or edit videos, reporting SOTA benchmark performance.
EasyVFX decouples VFX generation via frequency-aware Mixture-of-Experts and test-time training to achieve realistic effects with limited resources.
citing papers explorer
-
PaSBench-Video: A Streaming Video Benchmark for Proactive Safety Warning
PaSBench-Video benchmark shows no tested MLLM exceeds 20% on strict proactive safety metrics, with recall correlated 0.64 to false-positive rate on safe clips.