{"paper":{"title":"Autoregressive Video Generation without Vector Quantization","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Video generation can be done autoregressively without vector quantization by predicting frames sequentially in time and sets spatially within each frame.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Haiwen Diao, Haoge Deng, Huchuan Lu, Shiguang Shan, Ting Pan, Xinlong Wang, Yonggang Qi, Yufeng Cui, Zhengxiong Luo","submitted_at":"2024-12-18T18:59:53Z","abstract_excerpt":"This paper presents a novel approach that enables autoregressive video generation with high efficiency. We propose to reformulate the video generation problem as a non-quantized autoregressive modeling of temporal frame-by-frame prediction and spatial set-by-set prediction. Unlike raster-scan prediction in prior autoregressive models or joint distribution modeling of fixed-length tokens in diffusion models, our approach maintains the causal property of GPT-style models for flexible in-context capabilities, while leveraging bidirectional modeling within individual frames for efficiency. With th"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"NOVA surpasses prior autoregressive video models in data efficiency, inference speed, visual fidelity, and video fluency, even with a much smaller model capacity, i.e., 0.6B parameters. NOVA also outperforms state-of-the-art image diffusion models in text-to-image generation tasks, with a significantly lower training cost.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That non-quantized autoregressive modeling via temporal frame-by-frame prediction and spatial set-by-set prediction can preserve sufficient visual information and coherence without the discretization step of vector quantization.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"NOVA reformulates video generation as non-quantized autoregressive frame-by-frame temporal prediction combined with set-by-set spatial prediction, outperforming prior AR video models and some diffusion models in efficiency and quality.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Video generation can be done autoregressively without vector quantization by predicting frames sequentially in time and sets spatially within each frame.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"6a28aa8ea4d1ca335decc9c745608a854e465cdbcbdd327cb0a7ee77b3ee2a9e"},"source":{"id":"2412.14169","kind":"arxiv","version":2},"verdict":{"id":"786057d5-cf12-45f8-86eb-b80190d51198","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T15:02:54.272971Z","strongest_claim":"NOVA surpasses prior autoregressive video models in data efficiency, inference speed, visual fidelity, and video fluency, even with a much smaller model capacity, i.e., 0.6B parameters. NOVA also outperforms state-of-the-art image diffusion models in text-to-image generation tasks, with a significantly lower training cost.","one_line_summary":"NOVA reformulates video generation as non-quantized autoregressive frame-by-frame temporal prediction combined with set-by-set spatial prediction, outperforming prior AR video models and some diffusion models in efficiency and quality.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That non-quantized autoregressive modeling via temporal frame-by-frame prediction and spatial set-by-set prediction can preserve sufficient visual information and coherence without the discretization step of vector quantization.","pith_extraction_headline":"Video generation can be done autoregressively without vector quantization by predicting frames sequentially in time and sets spatially within each frame."},"references":{"count":36,"sample":[{"doi":"","year":null,"title":"PaLM 2 Technical Report","work_id":"905ee9a7-ea61-4a94-bd62-2600cbe3e315","ref_index":1,"cited_arxiv_id":"2305.10403","is_internal_anchor":true},{"doi":"","year":null,"title":"Imagen 3.arXiv preprint arXiv:2408.07009, 2024","work_id":"a1dd317f-8300-4a79-a1d0-92ddd93fa983","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets","work_id":"4f68eada-27e3-437a-a2fe-6e4ca524d0d3","ref_index":3,"cited_arxiv_id":"2311.15127","is_internal_anchor":true},{"doi":"","year":null,"title":"Chameleon: Mixed-Modal Early-Fusion Foundation Models","work_id":"2661b9a6-25cc-41a1-8100-612d2b801289","ref_index":4,"cited_arxiv_id":"2405.09818","is_internal_anchor":true},{"doi":"","year":null,"title":"Muse: Text-to-image generation via masked generative transformers","work_id":"ad8925f8-72d8-4ac4-88b8-027e08b46103","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":36,"snapshot_sha256":"af959daab140b38a113ec4657b3b0246e069b569d8217348bd3e76a38c0b96ee","internal_anchors":21},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}