Omniweaving: Towards unified video generation with free-form composition and reasoning

Kaihang Pan, Qi Tian, Jianwei Zhang, Weijie Kong, Jiangfeng Xiong, Yanxin Long, Shixue Zhang, Haiyi Qiu, Tan Wang, Zheqi Lv, Yue Wu, Liefeng Bo, Siliang Tang, Zhao Zhong · 2026 · arXiv 2603.24458

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

read on arXiv browse 6 citing papers

citation-role summary

baseline 2 background 1

citation-polarity summary

baseline 2 background 1

representative citing papers

What Semantics Survive the Connector? Diagnosing VLM-to-DiT Alignment in Video Editing

cs.CV · 2026-05-20 · unverdicted · novelty 7.0

VLM-to-DiT alignment in video editing models acts as a semantic bottleneck that degrades fine-grained structural semantics, demonstrated via a new diagnostic dataset and protocol on relation-based edits.

StreamingEffect: Real-Time Human-Centric Video Effect Generation

cs.CV · 2026-05-16 · unverdicted · novelty 7.0

StreamingEffect enables real-time 720p human-centric video effect generation on one GPU via teacher-student distillation, keyframe control, and a new 130K video dataset.

PhyGround: Benchmarking Physical Reasoning in Generative World Models

cs.CV · 2026-05-11 · accept · novelty 7.0

PhyGround is a new benchmark with curated prompts, a 13-law taxonomy, large-scale human annotations, and an open physics-specialized VLM judge for evaluating physical reasoning in generative video models.

SpatialFusion: Endowing Unified Image Generation with Intrinsic 3D Geometric Awareness

cs.CV · 2026-04-29 · unverdicted · novelty 6.0

SpatialFusion internalizes 3D geometric awareness into unified image generation models by pairing an MLLM with a spatial transformer that produces depth maps to constrain diffusion generation.

PhyWorld: Physics-Faithful World Model for Video Generation

cs.CV · 2026-05-19 · unverdicted · novelty 5.0

PhyWorld improves temporal consistency and physical plausibility in video world models via flow matching fine-tuning followed by DPO on physics preference pairs, with reported gains on VBench and a custom physical-faithfulness benchmark.

Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE

cs.CV · 2026-05-04 · unverdicted · novelty 4.0

Mamoda2.5 is a 25B-parameter DiT-MoE unified AR-Diffusion model that reaches top video generation and editing benchmarks with 4-step inference up to 95.9x faster than baselines.

citing papers explorer

Showing 6 of 6 citing papers.

What Semantics Survive the Connector? Diagnosing VLM-to-DiT Alignment in Video Editing cs.CV · 2026-05-20 · unverdicted · none · ref 14
VLM-to-DiT alignment in video editing models acts as a semantic bottleneck that degrades fine-grained structural semantics, demonstrated via a new diagnostic dataset and protocol on relation-based edits.
StreamingEffect: Real-Time Human-Centric Video Effect Generation cs.CV · 2026-05-16 · unverdicted · none · ref 50
StreamingEffect enables real-time 720p human-centric video effect generation on one GPU via teacher-student distillation, keyframe control, and a new 130K video dataset.
PhyGround: Benchmarking Physical Reasoning in Generative World Models cs.CV · 2026-05-11 · accept · none · ref 30
PhyGround is a new benchmark with curated prompts, a 13-law taxonomy, large-scale human annotations, and an open physics-specialized VLM judge for evaluating physical reasoning in generative video models.
SpatialFusion: Endowing Unified Image Generation with Intrinsic 3D Geometric Awareness cs.CV · 2026-04-29 · unverdicted · none · ref 34
SpatialFusion internalizes 3D geometric awareness into unified image generation models by pairing an MLLM with a spatial transformer that produces depth maps to constrain diffusion generation.
PhyWorld: Physics-Faithful World Model for Video Generation cs.CV · 2026-05-19 · unverdicted · none · ref 63
PhyWorld improves temporal consistency and physical plausibility in video world models via flow matching fine-tuning followed by DPO on physics preference pairs, with reported gains on VBench and a custom physical-faithfulness benchmark.
Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE cs.CV · 2026-05-04 · unverdicted · none · ref 58
Mamoda2.5 is a 25B-parameter DiT-MoE unified AR-Diffusion model that reaches top video generation and editing benchmarks with 4-step inference up to 95.9x faster than baselines.

Omniweaving: Towards unified video generation with free-form composition and reasoning

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer