{"total":17,"items":[{"citing_arxiv_id":"2606.27964","ref_index":41,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Directing the World: Fast Autoregressive Video Generation with Compositional Human-Camera Control","primary_cat":"cs.CV","submitted_at":"2026-06-26T11:08:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A decoupled-control autoregressive video model using Fast-Slow Memory training, dynamic projection, and staged camera control to produce stable long-horizon outputs with human and viewpoint guidance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.09803","ref_index":64,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Echo-Memory: A Controlled Study of Memory in Action World Models","primary_cat":"cs.CV","submitted_at":"2026-06-08T17:54:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A controlled study finds that block-wise state-space recurrence outperforms other memory designs for open-domain scene return in action-conditioned video models, and that standard replay metrics do not adequately measure memory quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.04527","ref_index":45,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Echo-Infinity: Learning Evolving Memory for Real-Time Infinite Video Generation","primary_cat":"cs.MM","submitted_at":"2026-06-03T07:09:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Echo-Infinity replaces handcrafted KV-cache schedules with end-to-end optimized Memory Queries and a Unified Relative RoPE recipe to support real-time infinite video generation in diffusion transformers.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.02553","ref_index":60,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LongLive-RAG: A General Retrieval-Augmented Framework for Long Video Generation","primary_cat":"cs.CV","submitted_at":"2026-06-01T17:50:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LongLive-RAG formulates long video generation as retrieval-augmented generation by treating self-generated latents as a dynamic searchable history and adding a Window Temporal Delta Loss for better retrieval.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.02436","ref_index":68,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Geometry-Aware Implicit Memory for Video World Models","primary_cat":"cs.CV","submitted_at":"2026-06-01T16:08:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GIM-World adds a camera-queryable geometry distillation head and pruning rule to implicit memory in video world models, claiming better long-horizon geometric consistency on the MIND benchmark than explicit and implicit baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30519","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OmniMem: Scalable and Adaptive Memory Retrieval for Long Video Generation","primary_cat":"cs.CV","submitted_at":"2026-05-28T19:56:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OmniMem enables scalable long video generation via adaptive sparse KV retrieval that addresses local bias and union explosion while preserving explicit historical access.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30351","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion","primary_cat":"cs.CV","submitted_at":"2026-05-28T17:59:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"VideoMLA applies multi-head latent attention with 3D-RoPE decoupling to autoregressive video diffusion, delivering 92.7% KV memory reduction while matching short-horizon baselines and leading long-horizon VBench scores.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30349","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AdaState: Self-Evolving Anchors for Streaming Video Generation","primary_cat":"cs.CV","submitted_at":"2026-05-28T17:59:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AdaState replaces the static first-frame KV anchor with an evolving hidden latent that the model denoises alongside content, treating time as relative to enable recurrence and richer dynamics in streaming video generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30083","ref_index":53,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Future Forcing: Future-aware Training-free KV Cache Policy for Autoregressive Video Generation","primary_cat":"cs.CV","submitted_at":"2026-05-28T15:30:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Future Forcing constructs a future query proxy from historical pre-RoPE statistics to score and merge KV tokens, improving subject consistency by up to 1.49 on VBench-Long for 60s AR video generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21028","ref_index":19,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DySink: Dynamic Frame Sinks for Autoregressive Long Video Generation","primary_cat":"cs.CV","submitted_at":"2026-05-20T11:01:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DySink maintains a memory bank and retrieves relevant historical frames as dynamic sinks while using an anomaly gate to suppress collapse, yielding higher temporal quality and dynamic degree on minute-long videos.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18739","ref_index":70,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation","primary_cat":"cs.CV","submitted_at":"2026-05-18T17:57:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LongLive-2.0 delivers an NVFP4 parallel infrastructure that enables direct training of long multi-shot autoregressive diffusion video models and achieves up to 2.15x training and 1.84x inference speedups on Blackwell and other GPUs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18733","ref_index":50,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Advancing Narrative Long Video Generation via Training-Free Identity-Aware Memory","primary_cat":"cs.CV","submitted_at":"2026-05-18T17:54:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"IAMFlow is a training-free identity-aware memory system that tracks entities via LLM global ID assignment and VLM frame verification to reduce identity drift in narrative long video generation from shifting prompts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15190","ref_index":49,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"RAVEN: Real-time Autoregressive Video Extrapolation with Consistency-model GRPO","primary_cat":"cs.CV","submitted_at":"2026-05-14T17:59:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RAVEN aligns training and inference for causal autoregressive video diffusion via interleaved rollout repacking and introduces CM-GRPO for direct RL on consistency-model kernels, claiming better quality than recent baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14487","ref_index":66,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity","primary_cat":"cs.CV","submitted_at":"2026-05-14T07:27:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Head Forcing assigns tailored KV cache strategies to local, anchor, and memory attention heads plus head-wise RoPE re-encoding to extend autoregressive video generation from seconds to minutes without training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.15911","ref_index":169,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Efficient Video Diffusion Models: Advancements and Challenges","primary_cat":"cs.CV","submitted_at":"2026-04-17T10:11:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"10958 https://arxiv.org/abs/2411.10958 [168] Jintao Zhang, Haoxu Wang, Kai Jiang, Shuo Yang, Kaiwen Zheng, Haocheng Xi, Ziteng Wang, Hongzhou Zhu, Min Zhao, Ion Stoica, Joseph E. Gonzalez, Jun Zhu, and Jianfei Chen. 2025. SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention. arXiv:2509.24006 https://arxiv.org/abs/2509.24006 [169] Jintao Zhang, Jia Wei, Haofeng Huang, Pengle Zhang, Jun Zhu, and Jianfei Chen. 2025. SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration. arXiv:2410.02367 https://arxiv.org/abs/2410.02367 [170] Jintao Zhang, Jia Wei, Pengle Zhang, Xiaoming Xu, Haofeng Huang, Haoxu Wang, Kai Jiang, Jianfei Chen, and Jun Zhu. 2026. SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training."},{"citing_arxiv_id":"2603.28489","ref_index":100,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms","primary_cat":"eess.IV","submitted_at":"2026-03-30T14:23:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Video generation models can function as world simulators if efficiency gaps in spatiotemporal modeling are bridged via organized paradigms, architectures, and algorithms.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"MemFlow [98] also dynami- cally updates the memory bank by retrieving the most relevant historical frames with the text prompt of the current chunk. VideoSSM [99] introduces a global memory to absorb tokens evicted from the local window and relies on a state space model to recurrently compress them into a compact, fixed- size state. Other works such as Context as Memory [100], LoViC [101] (Figure 4), and Mixture of Contexts [102] refine how contexts are retrieved. Compared to spatial maps, compressed contexts are more flexible, but may struggle with precise geometric grounding. 4) Implicit Model Memory:Implicit Model Memory em- beds historical contexts directly into the model's weights via online updates (test-time training, TTT)."},{"citing_arxiv_id":"2602.07775","ref_index":106,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion","primary_cat":"cs.CV","submitted_at":"2026-02-08T02:16:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Rolling Sink is a training-free cache adjustment technique that maintains visual consistency in autoregressive video diffusion models for ultra-long open-ended generation beyond training horizons.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"adopt bidirectional attentions [71] and denoise all frames simultaneously. There- fore, though impressive, the generated videos are generally limited to short clips. In contrast, AR models [1,10,75,87-89] can in principle, infinitely predict next- state conditioned on prior ones. To marry the best of both paradigms, a rapidly growing number of AR video diffusion models [11,12,16,18,25-27,37,38,42,48,55, 62,63,66,72,74,77,84,93-95,97,98,101,102,106,107,109,111,114] have emerged. Earlier methods, e.g., NOVA [17], SkyReels-V2 [13], and MAGI-1 [86] still rely on inefficient multi-step denoisingin eachAR generation step. Recently, Pyramid Flow [45] and CausVid [103-105] adopt few-step generation, making AR video generationtemporallyefficient. However, as the cached history grows longer, the demand of computational resources grows dramatically, which significantly con-"}],"limit":50,"offset":0}