A large-scale empirical study across tokenizers and diffusion backbones identifies Velocity Irreducible Variance (VIV) as one of the most stable predictors of latent diffusion generation quality.
Larp: Tokeniz- ing videos with a learned autoregressive generative prior
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.CV 4verdicts
UNVERDICTED 4roles
baseline 1polarities
baseline 1representative citing papers
Prologue adds a small set of learnable tokens trained exclusively with AR cross-entropy loss to decouple generation from reconstruction in autoregressive visual models, yielding lower gFID on ImageNet 256x256.
KeyframeFace uses LLM priors and semantic keyframe supervision in ARKit space to produce language-driven facial animations with improved fidelity and interpretability over continuous regression methods.
A parameter-free approach drops redundant video tokens via temporal L1 differences in frozen latent space and reconstructs them with LIT, yielding 31x speedup over ElasticTok-CV on TokenBench and DAVIS.
citing papers explorer
-
Autoregressive Visual Generation Needs a Prologue
Prologue adds a small set of learnable tokens trained exclusively with AR cross-entropy loss to decouple generation from reconstruction in autoregressive visual models, yielding lower gFID on ImageNet 256x256.