HL-OutPaint enables high-resolution outpainting of long video sequences via a coarse-to-fine pipeline that first builds Global Coarse Guidance through global-local frame swapping then synthesizes details.
Title resolution pending
6 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 6representative citing papers
DDE introduces a compact coordinator network that combines denoised outputs from pre-trained diffusion models to enable generation in larger domains and complex conditioning settings.
Text embeddings in MM-DiTs encode a detectable omission signal for missing concepts; amplifying it via OSI reduces concept omission in text-to-image outputs on FLUX.1-Dev and SD3.5-Medium.
Spherical flows on S^{d-1} with vMF noise reduce the continuity equation to a scalar ODE in cosine similarity, yielding posterior-weighted marginal velocity and score that enable ODE and predictor-corrector sampling for categorical sequences, with the posterior trained by cross-entropy and empirical
Categorical flow matching models scale to 1.7B parameters on 2.1T tokens, enabling 4-step text generation with competitive quality and benchmark performance.
Step-Video-T2V describes a 30B-parameter text-to-video model with custom Video-VAE, 3D DiT, flow matching, and Video-DPO that claims state-of-the-art results on a new internal benchmark.
citing papers explorer
-
HL-OutPaint: Coarse-to-Fine Video Outpainting for High-Resolution Long-Range Videos
HL-OutPaint enables high-resolution outpainting of long video sequences via a coarse-to-fine pipeline that first builds Global Coarse Guidance through global-local frame swapping then synthesizes details.
-
Diffusion Domain Expansion: Learning to Coordinate Pre-trained Diffusion Models
DDE introduces a compact coordinator network that combines denoised outputs from pre-trained diffusion models to enable generation in larger domains and complex conditioning settings.
-
Diagnosing and Correcting Concept Omission in Multimodal Diffusion Transformers
Text embeddings in MM-DiTs encode a detectable omission signal for missing concepts; amplifying it via OSI reduces concept omission in text-to-image outputs on FLUX.1-Dev and SD3.5-Medium.
-
Spherical Flows for Sampling Categorical Data
Spherical flows on S^{d-1} with vMF noise reduce the continuity equation to a scalar ODE in cosine similarity, yielding posterior-weighted marginal velocity and score that enable ODE and predictor-corrector sampling for categorical sequences, with the posterior trained by cross-entropy and empirical
-
Scaling Categorical Flow Maps
Categorical flow matching models scale to 1.7B parameters on 2.1T tokens, enabling 4-step text generation with competitive quality and benchmark performance.
-
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
Step-Video-T2V describes a 30B-parameter text-to-video model with custom Video-VAE, 3D DiT, flow matching, and Video-DPO that claims state-of-the-art results on a new internal benchmark.