{"paper":{"title":"Long-Context Autoregressive Video Modeling with Next-Frame Prediction","license":"http://creativecommons.org/licenses/by-nc-sa/4.0/","headline":"Asymmetric patchify kernels enable efficient long-context autoregressive video modeling by exploiting context redundancy.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Mike Zheng Shou, Weijia Mao, Yuchao Gu","submitted_at":"2025-03-25T03:38:06Z","abstract_excerpt":"Long-context video modeling is essential for enabling generative models to function as world simulators, as they must maintain temporal coherence over extended time spans. However, most existing models are trained on short clips, limiting their ability to capture long-range dependencies, even with test-time extrapolation. While training directly on long videos is a natural solution, the rapid growth of vision tokens makes it computationally prohibitive. To support exploring efficient long-context video modeling, we first establish a strong autoregressive baseline called Frame AutoRegressive (F"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Our method achieves state-of-the-art results on both short and long video generation, providing an effective baseline for long-context autoregressive video modeling.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The assumption that video autoregression exhibits exploitable context redundancy where distant frames can safely use large asymmetric patchify kernels without losing critical temporal information needed for coherence.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"FAR baseline plus asymmetric kernels for long short-term context modeling achieves SOTA short and long video generation in autoregressive setups.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Asymmetric patchify kernels enable efficient long-context autoregressive video modeling by exploiting context redundancy.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"65a74ae5bf4ee906459b1583a6ecc373b11f2aa0103b2aa0f9a3707363d1e5c6"},"source":{"id":"2503.19325","kind":"arxiv","version":3},"verdict":{"id":"fdbbb3af-5f6b-4b67-9121-70a825cda7fb","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T23:01:10.067285Z","strongest_claim":"Our method achieves state-of-the-art results on both short and long video generation, providing an effective baseline for long-context autoregressive video modeling.","one_line_summary":"FAR baseline plus asymmetric kernels for long short-term context modeling achieves SOTA short and long video generation in autoregressive setups.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The assumption that video autoregression exhibits exploitable context redundancy where distant frames can safely use large asymmetric patchify kernels without losing critical temporal information needed for coherence.","pith_extraction_headline":"Asymmetric patchify kernels enable efficient long-context autoregressive video modeling by exploiting context redundancy."},"references":{"count":60,"sample":[{"doi":"","year":2024,"title":"Video generation models as world simulators,","work_id":"36411502-be32-4aca-bb2e-6e69ad8e9542","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2025,"title":"Wan: Open and Advanced Large-Scale Video Generative Models","work_id":"ad3ebc3b-4224-46c9-b61d-bcf135da0a7c","ref_index":2,"cited_arxiv_id":"2503.20314","is_internal_anchor":true},{"doi":"","year":2025,"title":"Cosmos World Foundation Model Platform for Physical AI","work_id":"a2dba24c-318d-476a-8b21-4289c265810c","ref_index":3,"cited_arxiv_id":"2501.03575","is_internal_anchor":true},{"doi":"","year":2024,"title":"Freelong: Training-free long video generation with spectralblend temporal attention,","work_id":"2fe5a21b-1a98-4980-90ac-dcfa54e0e135","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2025,"title":"Riflex: A free lunch for length extrapolation in video diffusion transformers","work_id":"027bc19e-1d61-407f-ae74-3f5cd543fa53","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":60,"snapshot_sha256":"d75ac10567aebd9e8e9460fe879d34a28cc0faca46b528215f2667e2fceaa7c1","internal_anchors":20},"formal_canon":{"evidence_count":2,"snapshot_sha256":"b53828b0889de6d19908a9a70984139ee92bb15b8bcfaf408317a489bebf1e61"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}