pith. sign in

hub Canonical reference

Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model

Canonical reference. 100% of citing Pith papers cite this work as background.

21 Pith papers citing it
Background 100% of classified citations
abstract

Recent strides in video generation have paved the way for unified audio-visual generation. In this work, we present Seedance 1.5 pro, a foundational model engineered specifically for native, joint audio-video generation. Leveraging a dual-branch Diffusion Transformer architecture, the model integrates a cross-modal joint module with a specialized multi-stage data pipeline, achieving exceptional audio-visual synchronization and superior generation quality. To ensure practical utility, we implement meticulous post-training optimizations, including Supervised Fine-Tuning (SFT) on high-quality datasets and Reinforcement Learning from Human Feedback (RLHF) with multi-dimensional reward models. Furthermore, we introduce an acceleration framework that boosts inference speed by over 10X. Seedance 1.5 pro distinguishes itself through precise multilingual and dialect lip-syncing, dynamic cinematic camera control, and enhanced narrative coherence, positioning it as a robust engine for professional-grade content creation. Seedance 1.5 pro is now accessible on Volcano Engine at https://console.volcengine.com/ark/region:ark+cn-beijing/experience/vision?type=GenVideo.

hub tools

citation-role summary

background 14

citation-polarity summary

years

2026 21

polarities

background 14

representative citing papers

AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics

cs.CV · 2026-05-05 · unverdicted · novelty 7.0 · 3 refs

AniMatrix generates anime videos by structuring artistic production rules into a controllable taxonomy and training the model to prioritize those rules over physical realism, achieving top scores from professional animators on prompt understanding and artistic motion.

Efficient Video Diffusion Models: Advancements and Challenges

cs.CV · 2026-04-17 · unverdicted · novelty 7.0

A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.

Tracking High-order Evolutions via Cascading Low-rank Fitting

cs.LG · 2026-04-13 · unverdicted · novelty 7.0

Cascading low-rank fitting approximates successive high-order derivatives in diffusion models via a shared base function with sequentially added low-rank components, accompanied by theorems proving monotonic non-increasing ranks under linear decomposability and the possibility of arbitrary rank perm

ViPO: Visual Preference Optimization at Scale

cs.CV · 2026-04-27 · unverdicted · novelty 6.0

Poly-DPO improves robustness to noisy preference data in visual models, and the new ViPO dataset enables superior performance, with the method reducing to standard DPO on high-quality data.

How Far Are Video Models from True Multimodal Reasoning?

cs.CV · 2026-04-21 · unverdicted · novelty 6.0

Current video models succeed on basic understanding but achieve under 25% success on logically grounded generation and near 0% on interactive generation, exposing gaps in multimodal reasoning.

Continuous Adversarial Flow Models

cs.LG · 2026-04-13 · unverdicted · novelty 6.0

Continuous adversarial flow models replace MSE in flow matching with adversarial training via a discriminator, improving guidance-free FID on ImageNet from 8.26 to 3.63 for SiT and similar gains for JiT and text-to-image benchmarks.

Motif-Video 2B: Technical Report

cs.CV · 2026-04-14 · unverdicted · novelty 4.0 · 2 refs

Motif-Video 2B reaches 83.76% on VBench, outperforming a 14B-parameter model with 7x fewer parameters and far less training data through shared cross-attention and a three-part backbone.

Advancing Open-source World Models

cs.CV · 2026-01-28 · unverdicted · novelty 4.0

LingBot-World is presented as an open-source world model that delivers high-fidelity simulation, minute-level contextual consistency, and real-time interactivity under one second latency.

citing papers explorer

Showing 21 of 21 citing papers.