A large-scale empirical study across tokenizers and diffusion backbones identifies Velocity Irreducible Variance (VIV) as one of the most stable predictors of latent diffusion generation quality.
Larp: Tokeniz- ing videos with a learned autoregressive generative prior
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.CV 3verdicts
UNVERDICTED 3roles
baseline 1polarities
baseline 1representative citing papers
Prologue adds a small set of learnable tokens trained exclusively with AR cross-entropy loss to decouple generation from reconstruction in autoregressive visual models, yielding lower gFID on ImageNet 256x256.
KeyframeFace uses LLM priors and semantic keyframe supervision in ARKit space to produce language-driven facial animations with improved fidelity and interpretability over continuous regression methods.
citing papers explorer
-
Diffusing in the Right Space: A Systematic Study of Latent Diffusability
A large-scale empirical study across tokenizers and diffusion backbones identifies Velocity Irreducible Variance (VIV) as one of the most stable predictors of latent diffusion generation quality.
-
Autoregressive Visual Generation Needs a Prologue
Prologue adds a small set of learnable tokens trained exclusively with AR cross-entropy loss to decouple generation from reconstruction in autoregressive visual models, yielding lower gFID on ImageNet 256x256.
-
KeyframeFace: Language-Driven Facial Animation via Semantic Keyframes
KeyframeFace uses LLM priors and semantic keyframe supervision in ARKit space to produce language-driven facial animations with improved fidelity and interpretability over continuous regression methods.