Dream.exe evaluates 8 video generation models on 101 manipulation tasks by converting generated videos into executable robot trajectories in a simulator, finding measurable success rates that visual metrics do not predict.
hub Canonical reference
Kling-Omni Technical Report
Canonical reference. 93% of citing Pith papers cite this work as background.
abstract
We present Kling-Omni, a generalist generative framework designed to synthesize high-fidelity videos directly from multimodal visual language inputs. Adopting an end-to-end perspective, Kling-Omni bridges the functional separation among diverse video generation, editing, and intelligent reasoning tasks, integrating them into a holistic system. Unlike disjointed pipeline approaches, Kling-Omni supports a diverse range of user inputs, including text instructions, reference images, and video contexts, processing them into a unified multimodal representation to deliver cinematic-quality and highly-intelligent video content creation. To support these capabilities, we constructed a comprehensive data system that serves as the foundation for multimodal video creation. The framework is further empowered by efficient large-scale pre-training strategies and infrastructure optimizations for inference. Comprehensive evaluations reveal that Kling-Omni demonstrates exceptional capabilities in in-context generation, reasoning-based editing, and multimodal instruction following. Moving beyond a content creation tool, we believe Kling-Omni is a pivotal advancement toward multimodal world simulators capable of perceiving, reasoning, generating and interacting with the dynamic and complex worlds.
hub tools
citation-role summary
citation-polarity summary
years
2026 37verdicts
UNVERDICTED 37representative citing papers
PaSBench-Video benchmark shows no tested MLLM exceeds 20% on strict proactive safety metrics, with recall correlated 0.64 to false-positive rate on safe clips.
Introduces TASLE benchmark and MSLoc baseline for temporal localization and explanation of manipulated segments in long videos.
Future Forcing constructs a future query proxy from historical pre-RoPE statistics to score and merge KV tokens, improving subject consistency by up to 1.49 on VBench-Long for 60s AR video generation.
What-If World is a new paired-prompt benchmark showing that nine state-of-the-art video generation models achieve at most 52% on causal intervention tests and cluster near 28% for open-source systems.
LongAV-Compass is a new benchmark and evaluation framework for minute-scale audio-visual generation across T2AV, I2AV, and V2AV with multi-dimensional assessment.
Soap2Soap uses a multi-agent system with dual-bridge consistency via JSON screenplays and visual anchors plus batch keyframe generation to achieve better long-term consistency in cinematic video remaking than commercial APIs.
StreamingEffect enables real-time 720p human-centric video effect generation on one GPU via teacher-student distillation, keyframe control, and a new 130K video dataset.
Echo-Forcing decouples stable anchors, compressed history, and recent dynamics in video diffusion KV caches using hierarchical memory, scene recall frames, and difference-aware decay to support interactive long video generation under bounded cache.
CaC presents a new spatiotemporal concentrating reward model for video anomalies, built on a novel large-scale dataset and three-stage training with RL and IoU rewards, claiming 25.7% accuracy gains and 11.7% anomaly reduction.
AV-Phys Bench shows that current joint audio-video models lack robust physical commonsense, with major drops on transitions and deliberate anti-physics prompts.
AniMatrix generates anime videos by structuring artistic production rules into a controllable taxonomy and training the model to prioritize those rules over physical realism, achieving top scores from professional animators on prompt understanding and artistic motion.
Dexterity-BEV creates 3D vertex-based inputs and BEV-aligned outputs to reduce spatial-temporal misalignments in end-to-end robot policies trained on diverse datasets and embodiments.
Introduces dimension-disentangled influence estimation to prune or reweight training samples for MVRMs, outperforming global scalar filtering in alignment with ground truth.
SmartDirector generates cinematic videos via Director-Gen for low-res keyframe-conditioned output followed by Director-SR refinement using high-res keyframes, trained on curated movie sequences.
VISTA introduces a new synthetic triplet dataset and diffusion-transformer framework with style adapter that jointly models style, content, and motion to achieve state-of-the-art video style transfer.
MiVE repurposes VLMs as multiscale feature extractors integrated into a unified self-attention Diffusion Transformer for reference-guided video editing, claiming top human preference scores over prior methods.
OmniHumanoid factorizes transferable motion learning from embodiment-specific adaptation to enable scalable cross-embodiment video generation without paired data for new humanoids.
Video-LLMs fail physical reasoning due to semantic prior dominance rather than perception deficits; a new programmatic adversarial curriculum and visual-anchored reasoning chain enable substantial gains via standard LoRA fine-tuning.
SignVerse-2M provides a 2-million-clip multilingual pose-native dataset for sign language derived from public videos via DWPose preprocessing to enable robust modeling in real-world conditions.
ExoActor uses exocentric video generation to implicitly model robot-environment-object interactions and converts the resulting videos into task-conditioned humanoid control sequences.
Current video models succeed on basic understanding but achieve under 25% success on logically grounded generation and near 0% on interactive generation, exposing gaps in multimodal reasoning.
The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and proposes Epistemic World Models as a new category for scientific discovery agents.
OmniShow unifies text, image, audio, and pose conditions into an end-to-end model for high-quality human-object interaction video generation and introduces the HOIVG-Bench benchmark, claiming state-of-the-art results.
citing papers explorer
-
Refining Multidimensional Video Reward Models via Disentangled Influence Functions
Introduces dimension-disentangled influence estimation to prune or reweight training samples for MVRMs, outperforming global scalar filtering in alignment with ground truth.