OmniMem: Scalable and Adaptive Memory Retrieval for Long Video Generation

Lin Zhao; Pu Zhao; Yanzhi Wang; Yifan Gong; Yushu Wu

OmniMem performs sparse retrieval over the full historical KV cache to generate longer videos without the detail loss from truncation or compression.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

2026-06-29 07:54 UTC pith:47CORIUG

load-bearing objection OmniMem packages three practical sparse KV heuristics for chunked AR video that target local bias and union explosion, but the single headline number leaves the actual gains hard to assess. the 2 major comments →

arxiv 2605.30519 v1 pith:47CORIUG submitted 2026-05-28 cs.CV

OmniMem: Scalable and Adaptive Memory Retrieval for Long Video Generation

Lin Zhao , Yushu Wu , Yifan Gong , Yanzhi Wang , Pu Zhao This is my paper

classification cs.CV

keywords long video generationautoregressive video modelsKV cache retrievalsparse attentionmemory efficiencyvideo synthesisdynamic degreetemporal consistency

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Autoregressive video generation builds videos chunk by chunk but must repeatedly consult an expanding record of past computations stored in a KV cache. Existing solutions either discard older entries or fold them into a compressed form, both of which remove explicit access to details that may matter later. OmniMem keeps the entire cache and instead selects only a sparse subset of relevant past entries for each new chunk. Three targeted mechanisms counter the tendency of sparse selection to favor recent blocks and to produce overly large memory buffers. On long-video benchmarks the approach raises measured dynamic degree by 52.3 percent relative to strong baselines while holding consistency and memory footprint comparable.

Core claim

OmniMem is an explicit full-range memory retrieval framework that performs sparse KV retrieval over the historical cache. Adaptive Window Exclusion removes local-window blocks from selection candidates once sufficient long-range history exists. Query-Shared KV Selection reduces cross-query diversity. Per-Head Scattered KV Access lets each attention head retrieve non-contiguous KV blocks according to its own pattern, avoiding union explosion in the selected buffer.

What carries the argument

Sparse KV retrieval over the full historical cache, implemented through Adaptive Window Exclusion, Query-Shared KV Selection, and Per-Head Scattered KV Access.

Load-bearing premise

The three sparse-selection techniques can reliably locate and fetch the query-relevant historical details that truncation or compression would otherwise discard.

What would settle it

A controlled video sequence in which an early event required for later consistency is never selected by the retrieval mechanism, producing measurable drops in temporal coherence or dynamic degree.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

Longer video sequences become feasible at fixed memory budget because the full explicit history remains available.
Dynamic degree improves by 52.3 percent while consistency metrics stay strong.
Memory usage remains comparable to truncation or compression baselines.
Each attention head can follow its own non-contiguous retrieval pattern without expanding the selected buffer size.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same sparse-retrieval pattern could be tested on autoregressive models for long text or audio to check whether explicit memory access outperforms compression there as well.
If retrieval accuracy holds, training runs could avoid the need to lengthen context windows solely to capture distant dependencies.
Per-head scattered access suggests that hardware kernels optimized for irregular sparse loads may become performance-critical for scaling this style of generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

OmniMem packages three practical sparse KV heuristics for chunked AR video that target local bias and union explosion, but the single headline number leaves the actual gains hard to assess.

read the letter

The main takeaway is that this paper identifies two concrete problems in sparse retrieval for autoregressive video—local bias in selection and union explosion across heads—and offers three targeted fixes: Adaptive Window Exclusion, Query-Shared KV Selection, and Per-Head Scattered KV Access. These keep explicit long-range KV access instead of falling back to truncation or compression.

What stands out is the framing. The authors spell out why standard sparse methods fall short once you move to chunk-based generation with growing history, and the three mechanisms are direct responses to those issues. Adaptive Window Exclusion looks like a sensible way to protect the budget for distant frames once enough history exists. The other two reduce diversity and buffer size without forcing every head to share the same blocks. That combination is new enough relative to prior truncation and compression work.

The reported 52.3% Dynamic Degree improvement with comparable memory is the empirical claim. If the full experiments back it with proper baselines, ablations, and variance, the techniques could be worth trying in other long-context video setups.

The soft spot is the lack of visible experimental grounding in the abstract. One percentage point without dataset details, run count, or component ablations makes it difficult to know whether the gain is robust or tied to a particular setup. The stress-test note says the argument is internally consistent, which is fair, but consistency alone does not replace seeing whether the selected KV actually carries the right information or whether the methods introduce new artifacts in consistency or quality.

This is for people already working on efficient long video generation who need ideas for managing KV growth without losing explicit access. A reader looking for reusable sparse attention patterns in chunked AR models could get value from the three heuristics.

It deserves a serious referee. The problem is real, the proposed solution is specific, and the paper is coherent on its own terms even if the evidence needs more scrutiny in review.

Referee Report

2 major / 0 minor

Summary. The paper proposes OmniMem, an explicit full-range memory retrieval framework for autoregressive chunk-based long video generation. It introduces three sparse KV retrieval mechanisms—Adaptive Window Exclusion (to counter local bias when long-range history is available), Query-Shared KV Selection (to reduce cross-query diversity), and Per-Head Scattered KV Access (to avoid union explosion by allowing per-head non-contiguous block selection)—that together enable query-relevant historical KV access without truncation or compression. Experiments report a 52.3% gain in Dynamic Degree over strong baselines while preserving consistency and comparable memory usage.

Significance. If the empirical results and the claim that the three mechanisms recover relevant historical details without information loss hold under scrutiny, the work would meaningfully advance scalable AR video generation by retaining explicit long-range access at manageable cost. This addresses a core scaling bottleneck and could influence subsequent memory-efficient video and multimodal generation systems.

major comments (2)

[Abstract] Abstract: the central claim of a 52.3% Dynamic Degree improvement is presented without any description of the baseline methods, dataset, number of videos or frames evaluated, variance across runs, or statistical significance; this single quantitative result is load-bearing for the paper's contribution and cannot be assessed from the given information.
[Abstract / Method] The manuscript's core assumption—that Adaptive Window Exclusion, Query-Shared KV Selection, and Per-Head Scattered KV Access together surface query-relevant historical KV without the loss incurred by truncation or compression—requires explicit supporting evidence (e.g., ablation tables isolating each component, attention visualizations, or retrieval-precision metrics) to be load-bearing; the abstract alone does not supply this verification.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract and the need for stronger verification of our core claims. We address each major comment below and will revise the manuscript to improve self-containment and evidence presentation.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of a 52.3% Dynamic Degree improvement is presented without any description of the baseline methods, dataset, number of videos or frames evaluated, variance across runs, or statistical significance; this single quantitative result is load-bearing for the paper's contribution and cannot be assessed from the given information.

Authors: We agree the abstract should be more self-contained for the load-bearing quantitative claim. The experimental section details the baselines (strong AR video generation methods with KV cache management), the long-video dataset, evaluation scale (multiple videos with extended frame counts), and reports averaged results with variance. In revision we will expand the abstract to concisely include these elements (baselines, dataset, scale, and note on averaging/variance) while preserving length limits. revision: yes
Referee: [Abstract / Method] The manuscript's core assumption—that Adaptive Window Exclusion, Query-Shared KV Selection, and Per-Head Scattered KV Access together surface query-relevant historical KV without the loss incurred by truncation or compression—requires explicit supporting evidence (e.g., ablation tables isolating each component, attention visualizations, or retrieval-precision metrics) to be load-bearing; the abstract alone does not supply this verification.

Authors: The overall experimental results (52.3% Dynamic Degree gain with preserved consistency and comparable memory) provide empirical support for the combined mechanisms recovering relevant history without truncation/compression loss. Component-wise ablations, attention visualizations, and retrieval analysis appear in the experiments and supplementary sections. To make this verification more explicit and tied to the abstract claim, we will add or highlight ablation tables isolating each mechanism, plus attention/retrieval-precision figures in the main paper during revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The manuscript describes an explicit sparse-retrieval framework (Adaptive Window Exclusion, Query-Shared KV Selection, Per-Head Scattered KV Access) whose design choices are stated directly and whose performance is reported solely as empirical outcomes on long-video generation benchmarks. No equations, fitted parameters renamed as predictions, self-citations used as load-bearing uniqueness theorems, or ansatzes smuggled via prior work appear in the supplied text. The central claim therefore rests on observable experimental deltas rather than any definitional or self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no explicit free parameters, axioms, or invented entities identifiable. No equations or modeling assumptions stated beyond the high-level problem description.

pith-pipeline@v0.9.1-grok · 5738 in / 1005 out tokens · 25834 ms · 2026-06-29T07:54:59.348268+00:00 · methodology

0 comments

read the original abstract

Autoregressive (AR) video generation extends videos by producing latent chunks sequentially, but scaling to long videos requires repeated access to a growing historical KV cache. Existing methods reduce this cost by truncating the KV cache or compressing it into implicit memory, but both lose explicit access to query-relevant historical details. We propose OmniMem, an explicit full-range memory retrieval framework that performs sparse KV retrieval over the historical cache. To make this practical for chunk-based AR video generation, OmniMem addresses two issues: (i) local bias in sparse KV selection and (ii) Union Explosion in memory access. Adaptive Window Exclusion removes local-window blocks from the selection candidates when sufficient long-range history is available, preserving the sparse budget for informative long-range retrieval. Query-Shared KV Selection reduces cross-query diversity, while Per-Head Scattered KV Access avoids expanding head-specific selections into a large selected KV buffer. This allows each attention head to retrieve non-contiguous KV blocks according to its own selection pattern. Experiments on long-video generation show that OmniMem improves Dynamic Degree by 52.3% and preserves strong consistency over strong baselines, while maintaining comparable memory usage.

Figures

Figures reproduced from arXiv: 2605.30519 by Lin Zhao, Pu Zhao, Yanzhi Wang, Yifan Gong, Yushu Wu.

**Figure 1.** Figure 1: OmniMem preserves object identity while maintaining rich motion in long video generation. SWA shows object drift, and Sink-SWA produces repetitive motion. Abstract Autoregressive (AR) video generation extends videos by producing latent chunks sequentially, but scaling to long videos requires repeated access to a growing historical KV cache. Existing methods reduce this cost by truncating the KV cache or c… view at source ↗

**Figure 2.** Figure 2: Overview of OmniMem. Current-chunk queries attend to recent KV blocks, pooled historical KV blocks, and retrieved full-resolution KV blocks through sliding-window, compression, and selection attention, respectively. The right panels summarize the key retrieval and access designs: filtering near-window candidates before Top-K selection, sharing Top-K selection within query groups, and accessing per-head blo… view at source ↗

**Figure 3.** Figure 3: Local bias and Union Explosion in selection attention. (a) Top-K selection focuses near the current chunk without AWE, and shifts to long-range blocks with AWE. (b) Different query chunks in one head select different blocks. (c) Different heads also select different regions. (b) and (c) together cause Union Explosion. Note that each chunk contains a number of tokens (e.g., 4-5K). 3.1 Framework Problem Form… view at source ↗

**Figure 4.** Figure 4: Qualitative comparison on long-video generation. Red boxes highlight repetitive frames where LongLive [18] collapses back to early content. Full videos and additional results are provided in the supplementary material [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Memory access scalability. Naive Sparse reduces latency but loses the memory benefit due to Union Explosion. OmniMem maintains memory usage nearly constant while remaining efficient. common KV selection. Sharing the selection across all 12 heads significantly degrades all metrics, and even a moderate size of Gh = 3 still leaves a clear gap to per-head selection. This indicates that different attention head… view at source ↗

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 28 canonical work pages · 17 internal anchors

[1]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Open-Sora: Democratizing Efficient Video Production for All

Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all.arXiv preprint arXiv:2412.20404, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Open-Sora Plan: Open-Source Large Video Generation Model

Bin Lin, Yunyang Ge, Xinhua Cheng, Zongjian Li, Bin Zhu, Shaodong Wang, Xianyi He, Yang Ye, Shenghai Yuan, Liuhan Chen, et al. Open-sora plan: Open-source large video generation model.arXiv preprint arXiv:2412.00131, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Fastcar: Cache attentive replay for fast auto-regressive video generation on the edge

Xuan Shen, Weize Ma, Yufa Zhou, Enhao Tang, Yanyue Xie, Zhengang Li, Yifan Gong, Quanyi Wang, Henghui Ding, Yiwei Wang, Pu Zhao, Jun Lin, and Jiuxiang Gu. Fastcar: Cache attentive replay for fast auto-regressive video generation on the edge. InThe Fourteenth International Conference on Learning Representations, 2026

2026
[5]

Draftattention: Fast video diffusion via low-resolution attention guidance.arXiv preprint arXiv:2505.14708, 2025

Xuan Shen, Chenxia Han, Yufa Zhou, Yanyue Xie, Yifan Gong, Quanyi Wang, Yiwei Wang, Yanzhi Wang, Pu Zhao, and Jiuxiang Gu. Draftattention: Fast video diffusion via low-resolution attention guidance. arXiv preprint arXiv:2505.14708, 2025

work page arXiv 2025
[6]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Movie Gen: A Cast of Media Foundation Models

Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

S2dit: Sandwichdiffusiontransformerformobilestreaming video generation.arXiv preprint arXiv:2601.12719, 2026

Lin Zhao, Yushu Wu, Aleksei Lebedev, Dishani Lahiri, Meng Dong, Arpit Sahni, Michael Vasilkovsky, Hao Chen, Ju Hu, Aliaksandr Siarohin, et al. S2dit: Sandwich diffusion transformer for mobile streaming video generation.arXiv preprint arXiv:2601.12719, 2026

work page arXiv 2026
[9]

arXiv preprint arXiv:2410.02757 (2024)

Yuqing Wang, Tianwei Xiong, Daquan Zhou, Zhijie Lin, Yang Zhao, Bingyi Kang, Jiashi Feng, and Xihui Liu. Loong: Generating minute-level long videos with autoregressive language models.arXiv preprint arXiv:2410.02757, 2024

work page arXiv 2024
[11]

Pyramidal flow matching for efficient video generative modeling.arXiv preprint arXiv:2410.05954,

Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, and Zhouchen Lin. Pyramidal flow matching for efficient video generative modeling. arXiv preprint arXiv:2410.05954, 2024

work page arXiv 2024
[12]

From slow bidirectional to fast autoregressive video diffusion models

Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22963–22974, 2025

2025
[13]

Gamegen-x: Interactive open-world game video generation

Haoxuan Che, Xuanhua He, Quande Liu, Cheng Jin, and Hao Chen. Gamegen-x: Interactive open-world game video generation. InInternational Conference on Learning Representations, 2025

2025
[14]

Causal diffusion transformers for generative modeling.arXiv preprint arXiv:2412.12095, 2024

Chaorui Deng, Deyao Zhu, Kunchang Li, Shi Guang, and Haoqi Fan. Causal diffusion transformers for generative modeling.arXiv preprint arXiv:2412.12095, 2024

work page arXiv 2024
[15]

Taming diffusion transformer for efficient mobile video generation in seconds.arXiv preprint arXiv:2507.13343, 2025

Yushu Wu, Yanyu Li, Anil Kag, Ivan Skorokhodov, Willi Menapace, Ke Ma, Arpit Sahni, Ju Hu, Aliaksandr Siarohin, Dhritiman Sagar, et al. Taming diffusion transformer for efficient mobile video generation in seconds.arXiv preprint arXiv:2507.13343, 2025

work page arXiv 2025
[16]

Causal World Modeling for Robot Control

Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, Yujun Shen, and Yinghao Xu. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026. 10

work page internal anchor Pith review Pith/arXiv arXiv 2026
[17]

Genie: Generative interactive environments.Forty-first International Conference on Machine Learning, 2024

Jake Bruce, Michael Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, Yusuf Aytar, Sarah Bechtle, Feryal Behbahani, Stephanie Chan, Nicolas Heess, Lucy Gonzalez, Simon Osindero, Sherjil Ozair, Scott Reed, Jingwei Zhang, Konrad Zolna, Jeff Clune, Nando de Freitas, Satinder Si...

2024
[18]

LongLive: Real-time Interactive Long Video Generation

Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, Song Han, and Yukang Chen. Longlive: Real-time interactive long video generation.arXiv preprint arXiv:2509.22622, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Rolling Forcing: Autoregressive Long Video Diffusion in Real Time

Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling forcing: Autoregressive long video diffusion in real time.arXiv preprint arXiv:2509.25161, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Memflow: Flowing adaptive memory for consistent and efficient long video narratives.arXiv preprint arXiv:2512.14699,

Sihui Ji, Xi Chen, Shuai Yang, Xin Tao, Pengfei Wan, and Hengshuang Zhao. Memflow: Flowing adaptive memory for consistent and efficient long video narratives.arXiv preprint arXiv:2512.14699, 2025

work page arXiv 2025
[21]

Videossm: Autoregressive long video generation with hybrid state-space memory

Yifei Yu, Xiaoshan Wu, Xinting Hu, Tao Hu, Yangtian Sun, Xiaoyang Lyu, Bo Wang, Lin Ma, Yuewen Ma, Zhongrui Wang, et al. Videossm: Autoregressive long video generation with hybrid state-space memory. arXiv preprint arXiv:2512.04519, 2025

work page arXiv 2025
[22]

TinyHistory: Lightweight Video History Embeddings via Two-Stage Context Learning

Lvmin Zhang, Shengqu Cai, Muyang Li, Chong Zeng, Beijia Lu, Anyi Rao, Song Han, Gordon Wetzstein, and Maneesh Agrawala. Pretraining frame preservation in autoregressive video memory compression. arXiv preprint arXiv:2512.23851, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Native sparse attention: Hardware-aligned and natively trainable sparse attention

Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, Yuxing Wei, Lean Wang, Zhiping Xiao, et al. Native sparse attention: Hardware-aligned and natively trainable sparse attention. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 23078–23097, 2025

2025
[24]

ai, Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, W

Sand. ai, Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, W. Q. Zhang, Weifeng Luo, Xiaoyang Kang, Yuchen Sun, Yue Cao, Yunpeng Huang, Yutong Lin, Yuxin Fang, Zewei Tao, Zheng Zhang, Zhongshu Wang, Zixun Liu, Dai Shi, Guoli Su, Hanwen Sun, Hong Pan, Jie Wang, Jiexin Sheng, Min Cui, Min Hu, Ming Yan, Shuchen...

2025
[25]

Skyreels-v2: Infinite-length film generative model, 2025

Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Junchen Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengcheng Ma, Weiming Xiong, Wei Wang, Nuo Pang, Kang Kang, Zhiheng Xu, Yuzhe Jin, Yupeng Liang, Yubing Song, Peng Zhao, Boyuan Xu, Di Qiu, Debang Li, Zhengcong Fei, Yang Li, and Yahui Zhou. Skyreels-v2: Infinite-length film generative model, 2025

2025
[26]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, and Cho-Jui Hsieh. Self-forcing++: Towards minute-scale high-quality video generation.arXiv preprint arXiv:2510.02283, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Mode seeking meets mean seeking for fast long video generation

Shengqu Cai, Weili Nie, Chao Liu, Julius Berner, Lvmin Zhang, Nanye Ma, Hansheng Chen, Maneesh Agrawala, Leonidas Guibas, Gordon Wetzstein, and Arash Vahdat. Mode seeking meets mean seeking for fast long video generation. InarXiv, 2026

2026
[29]

H2o: Heavy-hitter oracle for efficient generative inference of large language models.Advances in Neural Information Processing Systems, 36:34661–34710, 2023

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models.Advances in Neural Information Processing Systems, 36:34661–34710, 2023

2023
[30]

SnapKV: LLM Knows What You are Looking for Before Generation

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation.arXiv preprint arXiv:2404.14469, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami

Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami. Kvquant: Towards 10 million context length llm inference with kv cache quantization.Advances in Neural Information Processing Systems, 37, 2024

2024
[32]

Efficient Streaming Language Models with Attention Sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453, 2023. 11

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization

Haocheng Xi, Shuo Yang, Yilong Zhao, Muyang Li, Han Cai, Xingyang Li, Yujun Lin, Zhuoyang Zhang, Jintao Zhang, Xiuyu Li, et al. Quant videogen: Auto-regressive long video generation via 2-bit kv-cache quantization.arXiv preprint arXiv:2602.02958, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[34]

BIFE: Better Interaction, Fewer Errors for Minute-Long Video Generation

Zeyu Zhang, Shuning Chang, Yuanyu He, Yizeng Han, Jiasheng Tang, Fan Wang, and Bohan Zhuang. Blockvid: Block diffusion for high-quality and consistent minute-long video generation.arXiv preprint arXiv:2511.22973, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

arXiv preprint arXiv:2601.20499 , year=

Hang Guo, Zhaoyang Jia, Jiahao Li, Bin Li, Yuanhao Cai, Jiangshan Wang, Yawei Li, and Yan Lu. Efficient autoregressive video diffusion with dummy head.arXiv preprint arXiv:2601.20499, 2026

work page arXiv 2026
[36]

Past- and future- informed KV cache policy with salience estimation in autoregressive video diffusion,

Xu Yang et al. Past- and future-informed kv cache policy with salience estimation in autoregressive video diffusion.arXiv preprint arXiv:2601.21896, 2026

work page arXiv 2026
[37]

LTX-Video: Realtime Video Latent Diffusion

Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

Wan: Open and advanced large-scale video generative models, 2025

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

2025
[39]

Vidprom: A million-scale real prompt-gallery dataset for text-to-video diffusion models.Advances in Neural Information Processing Systems, 37:65618–65642, 2024

Wenhao Wang and Yi Yang. Vidprom: A million-scale real prompt-gallery dataset for text-to-video diffusion models.Advances in Neural Information Processing Systems, 37:65618–65642, 2024

2024
[40]

VBench++: Comprehensive and versatile benchmark suite for video generative models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, Yaohui Wang, Xinyuan Chen, Ying-Cong Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench++: Comprehensive and versatile benchmark suite for video generative models.IEEE Transactions on Pattern Analysis and Machine Intelli...

2025
[41]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024

2024
[42]

Autoregressive Video Generation without Vector Quantization

Haoge Deng, Ting Pan, Haiwen Diao, Zhengxiong Luo, Yufeng Cui, Huchuan Lu, Shiguang Shan, Yonggang Qi, and Xinlong Wang. Autoregressive video generation without vector quantization.arXiv preprint arXiv:2412.14169, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

A dynamic o v er -t he-shoulder perspectiv e of a chef meticulously plating a dish in a bust ling kit chen

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024. 12 Limitation and Broader Impact OmniMem is evaluated on a single open-sourced DiT backbone, Wan2.1-T2V-1.3B, aligned with recent works. This controlled setting helps isolate the effect ...

work page arXiv 2024

[1] [1]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

Open-Sora: Democratizing Efficient Video Production for All

Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all.arXiv preprint arXiv:2412.20404, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

Open-Sora Plan: Open-Source Large Video Generation Model

Bin Lin, Yunyang Ge, Xinhua Cheng, Zongjian Li, Bin Zhu, Shaodong Wang, Xianyi He, Yang Ye, Shenghai Yuan, Liuhan Chen, et al. Open-sora plan: Open-source large video generation model.arXiv preprint arXiv:2412.00131, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

Fastcar: Cache attentive replay for fast auto-regressive video generation on the edge

Xuan Shen, Weize Ma, Yufa Zhou, Enhao Tang, Yanyue Xie, Zhengang Li, Yifan Gong, Quanyi Wang, Henghui Ding, Yiwei Wang, Pu Zhao, Jun Lin, and Jiuxiang Gu. Fastcar: Cache attentive replay for fast auto-regressive video generation on the edge. InThe Fourteenth International Conference on Learning Representations, 2026

2026

[5] [5]

Draftattention: Fast video diffusion via low-resolution attention guidance.arXiv preprint arXiv:2505.14708, 2025

Xuan Shen, Chenxia Han, Yufa Zhou, Yanyue Xie, Yifan Gong, Quanyi Wang, Yiwei Wang, Yanzhi Wang, Pu Zhao, and Jiuxiang Gu. Draftattention: Fast video diffusion via low-resolution attention guidance. arXiv preprint arXiv:2505.14708, 2025

work page arXiv 2025

[6] [6]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

Movie Gen: A Cast of Media Foundation Models

Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

S2dit: Sandwichdiffusiontransformerformobilestreaming video generation.arXiv preprint arXiv:2601.12719, 2026

Lin Zhao, Yushu Wu, Aleksei Lebedev, Dishani Lahiri, Meng Dong, Arpit Sahni, Michael Vasilkovsky, Hao Chen, Ju Hu, Aliaksandr Siarohin, et al. S2dit: Sandwich diffusion transformer for mobile streaming video generation.arXiv preprint arXiv:2601.12719, 2026

work page arXiv 2026

[9] [9]

arXiv preprint arXiv:2410.02757 (2024)

Yuqing Wang, Tianwei Xiong, Daquan Zhou, Zhijie Lin, Yang Zhao, Bingyi Kang, Jiashi Feng, and Xihui Liu. Loong: Generating minute-level long videos with autoregressive language models.arXiv preprint arXiv:2410.02757, 2024

work page arXiv 2024

[10] [11]

Pyramidal flow matching for efficient video generative modeling.arXiv preprint arXiv:2410.05954,

Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, and Zhouchen Lin. Pyramidal flow matching for efficient video generative modeling. arXiv preprint arXiv:2410.05954, 2024

work page arXiv 2024

[11] [12]

From slow bidirectional to fast autoregressive video diffusion models

Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22963–22974, 2025

2025

[12] [13]

Gamegen-x: Interactive open-world game video generation

Haoxuan Che, Xuanhua He, Quande Liu, Cheng Jin, and Hao Chen. Gamegen-x: Interactive open-world game video generation. InInternational Conference on Learning Representations, 2025

2025

[13] [14]

Causal diffusion transformers for generative modeling.arXiv preprint arXiv:2412.12095, 2024

Chaorui Deng, Deyao Zhu, Kunchang Li, Shi Guang, and Haoqi Fan. Causal diffusion transformers for generative modeling.arXiv preprint arXiv:2412.12095, 2024

work page arXiv 2024

[14] [15]

Taming diffusion transformer for efficient mobile video generation in seconds.arXiv preprint arXiv:2507.13343, 2025

Yushu Wu, Yanyu Li, Anil Kag, Ivan Skorokhodov, Willi Menapace, Ke Ma, Arpit Sahni, Ju Hu, Aliaksandr Siarohin, Dhritiman Sagar, et al. Taming diffusion transformer for efficient mobile video generation in seconds.arXiv preprint arXiv:2507.13343, 2025

work page arXiv 2025

[15] [16]

Causal World Modeling for Robot Control

Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, Yujun Shen, and Yinghao Xu. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026. 10

work page internal anchor Pith review Pith/arXiv arXiv 2026

[16] [17]

Genie: Generative interactive environments.Forty-first International Conference on Machine Learning, 2024

Jake Bruce, Michael Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, Yusuf Aytar, Sarah Bechtle, Feryal Behbahani, Stephanie Chan, Nicolas Heess, Lucy Gonzalez, Simon Osindero, Sherjil Ozair, Scott Reed, Jingwei Zhang, Konrad Zolna, Jeff Clune, Nando de Freitas, Satinder Si...

2024

[17] [18]

LongLive: Real-time Interactive Long Video Generation

Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, Song Han, and Yukang Chen. Longlive: Real-time interactive long video generation.arXiv preprint arXiv:2509.22622, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [19]

Rolling Forcing: Autoregressive Long Video Diffusion in Real Time

Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling forcing: Autoregressive long video diffusion in real time.arXiv preprint arXiv:2509.25161, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [20]

Memflow: Flowing adaptive memory for consistent and efficient long video narratives.arXiv preprint arXiv:2512.14699,

Sihui Ji, Xi Chen, Shuai Yang, Xin Tao, Pengfei Wan, and Hengshuang Zhao. Memflow: Flowing adaptive memory for consistent and efficient long video narratives.arXiv preprint arXiv:2512.14699, 2025

work page arXiv 2025

[20] [21]

Videossm: Autoregressive long video generation with hybrid state-space memory

Yifei Yu, Xiaoshan Wu, Xinting Hu, Tao Hu, Yangtian Sun, Xiaoyang Lyu, Bo Wang, Lin Ma, Yuewen Ma, Zhongrui Wang, et al. Videossm: Autoregressive long video generation with hybrid state-space memory. arXiv preprint arXiv:2512.04519, 2025

work page arXiv 2025

[21] [22]

TinyHistory: Lightweight Video History Embeddings via Two-Stage Context Learning

Lvmin Zhang, Shengqu Cai, Muyang Li, Chong Zeng, Beijia Lu, Anyi Rao, Song Han, Gordon Wetzstein, and Maneesh Agrawala. Pretraining frame preservation in autoregressive video memory compression. arXiv preprint arXiv:2512.23851, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [23]

Native sparse attention: Hardware-aligned and natively trainable sparse attention

Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, Yuxing Wei, Lean Wang, Zhiping Xiao, et al. Native sparse attention: Hardware-aligned and natively trainable sparse attention. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 23078–23097, 2025

2025

[23] [24]

ai, Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, W

Sand. ai, Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, W. Q. Zhang, Weifeng Luo, Xiaoyang Kang, Yuchen Sun, Yue Cao, Yunpeng Huang, Yutong Lin, Yuxin Fang, Zewei Tao, Zheng Zhang, Zhongshu Wang, Zixun Liu, Dai Shi, Guoli Su, Hanwen Sun, Hong Pan, Jie Wang, Jiexin Sheng, Min Cui, Min Hu, Ming Yan, Shuchen...

2025

[24] [25]

Skyreels-v2: Infinite-length film generative model, 2025

Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Junchen Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengcheng Ma, Weiming Xiong, Wei Wang, Nuo Pang, Kang Kang, Zhiheng Xu, Yuzhe Jin, Yupeng Liang, Yubing Song, Peng Zhao, Boyuan Xu, Di Qiu, Debang Li, Zhengcong Fei, Yang Li, and Yahui Zhou. Skyreels-v2: Infinite-length film generative model, 2025

2025

[25] [26]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [27]

Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, and Cho-Jui Hsieh. Self-forcing++: Towards minute-scale high-quality video generation.arXiv preprint arXiv:2510.02283, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[27] [28]

Mode seeking meets mean seeking for fast long video generation

Shengqu Cai, Weili Nie, Chao Liu, Julius Berner, Lvmin Zhang, Nanye Ma, Hansheng Chen, Maneesh Agrawala, Leonidas Guibas, Gordon Wetzstein, and Arash Vahdat. Mode seeking meets mean seeking for fast long video generation. InarXiv, 2026

2026

[28] [29]

H2o: Heavy-hitter oracle for efficient generative inference of large language models.Advances in Neural Information Processing Systems, 36:34661–34710, 2023

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models.Advances in Neural Information Processing Systems, 36:34661–34710, 2023

2023

[29] [30]

SnapKV: LLM Knows What You are Looking for Before Generation

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation.arXiv preprint arXiv:2404.14469, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[30] [31]

Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami

Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami. Kvquant: Towards 10 million context length llm inference with kv cache quantization.Advances in Neural Information Processing Systems, 37, 2024

2024

[31] [32]

Efficient Streaming Language Models with Attention Sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453, 2023. 11

work page internal anchor Pith review Pith/arXiv arXiv 2023

[32] [33]

Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization

Haocheng Xi, Shuo Yang, Yilong Zhao, Muyang Li, Han Cai, Xingyang Li, Yujun Lin, Zhuoyang Zhang, Jintao Zhang, Xiuyu Li, et al. Quant videogen: Auto-regressive long video generation via 2-bit kv-cache quantization.arXiv preprint arXiv:2602.02958, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[33] [34]

BIFE: Better Interaction, Fewer Errors for Minute-Long Video Generation

Zeyu Zhang, Shuning Chang, Yuanyu He, Yizeng Han, Jiasheng Tang, Fan Wang, and Bohan Zhuang. Blockvid: Block diffusion for high-quality and consistent minute-long video generation.arXiv preprint arXiv:2511.22973, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [35]

arXiv preprint arXiv:2601.20499 , year=

Hang Guo, Zhaoyang Jia, Jiahao Li, Bin Li, Yuanhao Cai, Jiangshan Wang, Yawei Li, and Yan Lu. Efficient autoregressive video diffusion with dummy head.arXiv preprint arXiv:2601.20499, 2026

work page arXiv 2026

[35] [36]

Past- and future- informed KV cache policy with salience estimation in autoregressive video diffusion,

Xu Yang et al. Past- and future-informed kv cache policy with salience estimation in autoregressive video diffusion.arXiv preprint arXiv:2601.21896, 2026

work page arXiv 2026

[36] [37]

LTX-Video: Realtime Video Latent Diffusion

Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[37] [38]

Wan: Open and advanced large-scale video generative models, 2025

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

2025

[38] [39]

Vidprom: A million-scale real prompt-gallery dataset for text-to-video diffusion models.Advances in Neural Information Processing Systems, 37:65618–65642, 2024

Wenhao Wang and Yi Yang. Vidprom: A million-scale real prompt-gallery dataset for text-to-video diffusion models.Advances in Neural Information Processing Systems, 37:65618–65642, 2024

2024

[39] [40]

VBench++: Comprehensive and versatile benchmark suite for video generative models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, Yaohui Wang, Xinyuan Chen, Ying-Cong Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench++: Comprehensive and versatile benchmark suite for video generative models.IEEE Transactions on Pattern Analysis and Machine Intelli...

2025

[40] [41]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024

2024

[41] [42]

Autoregressive Video Generation without Vector Quantization

Haoge Deng, Ting Pan, Haiwen Diao, Zhengxiong Luo, Yufeng Cui, Huchuan Lu, Shiguang Shan, Yonggang Qi, and Xinlong Wang. Autoregressive video generation without vector quantization.arXiv preprint arXiv:2412.14169, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[42] [43]

A dynamic o v er -t he-shoulder perspectiv e of a chef meticulously plating a dish in a bust ling kit chen

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024. 12 Limitation and Broader Impact OmniMem is evaluated on a single open-sourced DiT backbone, Wan2.1-T2V-1.3B, aligned with recent works. This controlled setting helps isolate the effect ...

work page arXiv 2024