PermaVid: Consistent Video Generation Across Edits via Disentangled Context Memory

Bingjie Gao; Dahua Lin; Jiaqi Wang; Shuai Yang; Tong Wu; Ziwei Liu

arxiv: 2606.16449 · v2 · pith:MLAYEAJGnew · submitted 2026-06-15 · 💻 cs.CV

PermaVid: Consistent Video Generation Across Edits via Disentangled Context Memory

Shuai Yang , Bingjie Gao , Ziwei Liu , Jiaqi Wang , Dahua Lin , Tong Wu This is my paper

Pith reviewed 2026-06-27 03:20 UTC · model grok-4.3

classification 💻 cs.CV

keywords consistent video generationcontext memorydisentangled representationedit-aware updateRGB and depth memorylong-term consistencyvideo editingmulti-modal fusion

0 comments

The pith

A memory system disentangles video context into appearance and geometry to preserve consistency after edits

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops a framework for generating consistent videos when scenes are edited over time. It separates the memory of a scene into one part that tracks appearance and color, and another that tracks only the 3D structure. An update strategy that accounts for edits ensures the memories stay current with new observations. A generation model then uses both memories to create new frames that match previous ones in both looks and layout. The goal is to fix the problem where edits make later video frames drift out of alignment, which current approaches cannot handle well for long sequences.

Core claim

The central discovery is a multi-modal context memory that disentangles spatial context into semantic appearance captured in an RGB context memory and geometric structure in a depth context memory. An edit-aware memory update and retrieval strategy ensures memory evolution aligns with subsequent observations. This supports a memory-guided video generation model performing multi-modal feature fusion from mixed-modality contexts, leading to maintained long-term semantic and structural consistency after edits that outperforms existing methods.

What carries the argument

Disentangled multi-modal context memory with separate RGB appearance bank and depth structure bank, plus edit-aware update and retrieval strategy

If this is right

Long-term semantic and structural consistency is maintained after edits to the scene
Subsequent video generations remain coherent across time and viewpoints following modifications
The method significantly outperforms state-of-the-art approaches in consistency metrics
Memory-guided generation enables multi-modal feature fusion under reference conditions from the memory banks

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Separating geometry from appearance could allow more robust handling of viewpoint changes in video synthesis
The approach might apply to other domains like image editing or 3D model generation where persistent context is needed
It implies that edit-aware memory management could reduce the accumulation of errors in autoregressive generation models

Load-bearing premise

Disentangling spatial context into separate semantic appearance and geometric structure memories with an edit-aware strategy will keep the memory aligned with observations without creating new inconsistencies or using outdated information

What would settle it

A test sequence involving multiple successive edits to a video scene followed by generation of many subsequent frames, checking if consistency holds or breaks in semantic content or geometric structure

Figures

Figures reproduced from arXiv: 2606.16449 by Bingjie Gao, Dahua Lin, Jiaqi Wang, Shuai Yang, Tong Wu, Ziwei Liu.

**Figure 1.** Figure 1: We propose PermaVid, a framework for consistent video generation across edits. For global edits (e.g., style transformation), PermaVid propagates updated semantics consistently across time and viewpoints while maintaining stable geometry. For local edits (e.g., object-level editing), the model reliably recalls the post-edit content during revisiting, preserving both structural integrity and updated local s… view at source ↗

**Figure 2.** Figure 2: Overview of PermaVid. PermaVid maintains a disentangled multi-modal context memory with an RGB bank for semantic appearance and a depth bank for geometric structure. Given target camera poses and editing operations, it updates and retrieves memory in an edit-aware manner, then fuses mixed-modality references to guide consistent video generation across time, viewpoints, and edits. 3.1 Disentangled Multi-mod… view at source ↗

**Figure 3.** Figure 3: Qualitative comparison under global edits. Under a global edit (e.g., style transformation), our method maintains stable geometric structure while consistently propagating the edited semantic appearance across time and viewpoints. global edits alter the overall semantic appearance, while the underlying geometric structure should remain stable. As shown in [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison under local edits. Under a local edit, our method consistently recalls the edited region during revisiting while preserving the surrounding geometric structure. the best performance in PSNR, SSIM, and LPIPS, indicating strong preservation of geometric structure under global semantic edits. It also significantly outperforms all baselines in semantic consistency (CLIP-Vid), reflecting … view at source ↗

**Figure 5.** Figure 5: Memory overhead profiling during long-duration generation. Left: component time ratios [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Ablation study on disentangled context memory. With disentangled context memory, the model consistently propagates updated global semantics after the edit while preserving stable geometry, whereas entangled RGB contexts reuse outdated semantics, leading to degraded global semantic consistency over time. of each component and the absolute retrieval time throughout a long generation sequence with a largeloo… view at source ↗

**Figure 7.** Figure 7: Additional results under local edits, showing localized semantic updates with preserved [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: Additional results under global edits, showing coherent propagation of global semantic [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

read the original abstract

Consistent video generation under editing operations requires persistence: when edits modify scene appearance or layout, subsequent generations should remain coherent across time and viewpoints. However, existing memory designs struggle to maintain long-term consistency after such modifications, as stored contexts may become outdated or invalid. To address this, we propose PermaVid, a novel framework built upon a multi-modal context memory that disentangles spatial context into semantic appearance and geometric structure, together with an edit-aware memory update and retrieval strategy that keeps memory evolution aligned with subsequent observations. Specifically, we develop two complementary memory banks: an RGB context memory that captures appearance-aware observations while implicitly encoding geometry, and a depth context memory that preserves geometry-only structure disentangled from semantics. Building on this design, we introduce a memory-guided video generation model that performs multi-modal feature fusion under reference conditions drawn from mixed-modality memory contexts. Experiments demonstrate that our method maintains strong long-term semantic and structural consistency after edits, significantly outperforming state-of-the-art methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The disentangled RGB/depth memory plus edit-aware update is a distinct design choice for post-edit video consistency, but the abstract supplies zero metrics or experiment details so the performance claim cannot be assessed.

read the letter

The new piece here is the explicit split into an RGB memory bank that keeps appearance and an implicit geometry, paired with a separate depth bank for structure only, plus an edit-aware update/retrieval rule meant to drop or refresh outdated entries. That separation is not how most prior memory modules in video generation are described.

The approach targets a genuine pain point: once an edit changes layout or appearance, later frames often drift or contradict the change. The design tries to solve it by keeping the two modalities from contaminating each other and by conditioning generation on mixed-modality references.

The soft spot is obvious from the abstract alone. It asserts "significantly outperforming state-of-the-art methods" on long-term consistency but gives no numbers, no baselines, no ablations, and no description of the update equations or retrieval logic. Without those, the central assumption—that disentanglement plus edit awareness will actually prevent retention of pre-edit geometry or appearance drift—remains untested. The stress-test concern about partial invalidation or fusion creating new inconsistencies is therefore still live.

This is for people already working on memory-augmented video models or editing pipelines. A reader who wants to see whether the memory banks deliver measurable gains on standard consistency metrics would get value from the full paper; anyone looking for immediate usable results will not.

I would send it to peer review. The problem is practical, the architecture is different enough from cited memory designs to deserve referee input, and the authors appear to be engaging the literature honestly even if the current write-up is thin on evidence.

Referee Report

2 major / 1 minor

Summary. The paper proposes PermaVid, a framework for consistent video generation under editing operations. It introduces a multi-modal context memory that disentangles spatial context into an RGB memory bank (capturing semantic appearance while implicitly encoding geometry) and a depth memory bank (preserving geometry-only structure). An edit-aware memory update and retrieval strategy is claimed to keep memory evolution aligned with subsequent observations. A memory-guided video generation model performs multi-modal feature fusion under reference conditions from the mixed-modality memories. The central claim is that this design maintains strong long-term semantic and structural consistency after edits and significantly outperforms state-of-the-art methods.

Significance. If the empirical claims hold, the work would address a practical limitation in memory-based video generation and editing pipelines, where stored contexts become outdated after appearance or layout changes. The disentangled RGB/depth design and edit-aware strategy represent a targeted architectural contribution to long-term consistency, which could influence downstream applications in video synthesis if supported by rigorous validation.

major comments (2)

[Abstract] Abstract: The central claim that 'Experiments demonstrate that our method maintains strong long-term semantic and structural consistency after edits, significantly outperforming state-of-the-art methods' is asserted without any quantitative metrics, baselines, ablation results, or experiment details. This is load-bearing for the primary contribution, as the soundness of the consistency claim cannot be evaluated from the provided description alone.
[Method] Method description (disentangled memory banks and edit-aware update): The update and retrieval strategy is described at a high level as keeping 'memory evolution aligned with subsequent observations,' but no equations, pseudocode, or explicit rules are supplied for how pre-edit geometry is invalidated in the depth bank or how appearance drift is prevented in the RGB bank after layout edits. Without these details, it is not possible to verify that the design avoids the exact inconsistencies the paper aims to solve.

minor comments (1)

[Abstract] The abstract refers to 'multi-modal feature fusion under reference conditions drawn from mixed-modality memory contexts' without clarifying the fusion mechanism or reference conditioning implementation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below, indicating where revisions will be made to strengthen the presentation.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that 'Experiments demonstrate that our method maintains strong long-term semantic and structural consistency after edits, significantly outperforming state-of-the-art methods' is asserted without any quantitative metrics, baselines, ablation results, or experiment details. This is load-bearing for the primary contribution, as the soundness of the consistency claim cannot be evaluated from the provided description alone.

Authors: Abstracts conventionally summarize key outcomes at a high level without embedding full metrics or experimental details, which are instead reported in the dedicated Experiments section. The full manuscript contains quantitative evaluations, baseline comparisons, and ablations that support the claim. We can expand the abstract slightly to reference the primary metrics if the editor deems it necessary. revision: partial
Referee: [Method] Method description (disentangled memory banks and edit-aware update): The update and retrieval strategy is described at a high level as keeping 'memory evolution aligned with subsequent observations,' but no equations, pseudocode, or explicit rules are supplied for how pre-edit geometry is invalidated in the depth bank or how appearance drift is prevented in the RGB bank after layout edits. Without these details, it is not possible to verify that the design avoids the exact inconsistencies the paper aims to solve.

Authors: We agree that the method description requires greater specificity. The revised manuscript will incorporate explicit equations and pseudocode for the edit-aware update and retrieval procedures, detailing the invalidation of pre-edit geometry in the depth bank and the mechanisms to prevent appearance drift in the RGB bank. revision: yes

Circularity Check

0 steps flagged

No circularity: novel design proposal with no derivation chain

full rationale

The paper presents PermaVid as a new architectural framework consisting of disentangled RGB/depth memory banks plus an edit-aware update/retrieval strategy. No equations, fitted parameters, predictions, or derivation steps appear in the abstract or description. The central claim rests on the proposed design and experimental outcomes rather than any reduction to prior fitted quantities, self-citations, or self-definitional constructs. This is a standard case of an independent design contribution with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the memory banks and update strategy are introduced as new design elements but lack implementation details for auditing.

pith-pipeline@v0.9.1-grok · 5713 in / 1048 out tokens · 63859 ms · 2026-06-27T03:20:24.518021+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 15 linked inside Pith

[1]

Video generation models as world simulators

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators. OpenAI Blog, 1(8):1, 2024

2024
[2]

Relic: Interactive video world model with long-horizon memory.arXiv preprint arXiv:2512.04040, 2025

Yicong Hong, Yiqun Mei, Chongjian Ge, Yiran Xu, Yang Zhou, Sai Bi, Yannick Hold-Geoffroy, Mike Roberts, Matthew Fisher, Eli Shechtman, et al. Relic: Interactive video world model with long-horizon memory.arXiv preprint arXiv:2512.04040, 2025

arXiv 2025
[3]

Worldplay: Towards long-term geometric consistency for real-time interactive world modeling.arXiv preprint arXiv:2512.14614, 2025

Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, and Chunchao Guo. Worldplay: Towards long-term geometric consistency for real-time interactive world modeling.arXiv preprint arXiv:2512.14614, 2025

Pith/arXiv arXiv 2025
[4]

Video world models with long-term spatial memory.arXiv preprint arXiv:2506.05284, 2025

Tong Wu, Shuai Yang, Ryan Po, Yinghao Xu, Ziwei Liu, Dahua Lin, and Gordon Wetzstein. Video world models with long-term spatial memory.arXiv preprint arXiv:2506.05284, 2025

arXiv 2025
[5]

Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Pith/arXiv arXiv 2025
[6]

Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

Pith/arXiv arXiv 2024
[7]

Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

Pith/arXiv arXiv 2024
[8]

Imagine360: Immersive 360 video generation from perspective anchor.arXiv preprint arXiv:2412.03552, 2024

Jing Tan, Shuai Yang, Tong Wu, Jingwen He, Yuwei Guo, Ziwei Liu, and Dahua Lin. Imagine360: Immersive 360 video generation from perspective anchor.arXiv preprint arXiv:2412.03552, 2024

arXiv 2024
[9]

Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024

Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richard- son, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024

Pith/arXiv arXiv 2024
[10]

Cameractrl: Enabling camera control for text-to-video generation.arXiv preprint arXiv:2404.02101, 2024

Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation.arXiv preprint arXiv:2404.02101, 2024

Pith/arXiv arXiv 2024
[11]

Gen3c: 3d-informed world- consistent video generation with precise camera control

Xuanchi Ren, Tianchang Shen, Jiahui Huang, Huan Ling, Yifan Lu, Merlin Nimier-David, Thomas Müller, Alexander Keller, Sanja Fidler, and Jun Gao. Gen3c: 3d-informed world- consistent video generation with precise camera control. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

2025
[12]

Layerpano3d: Layered 3d panorama for hyper-immersive scene generation

Shuai Yang, Jing Tan, Mengchen Zhang, Tong Wu, Gordon Wetzstein, Ziwei Liu, and Dahua Lin. Layerpano3d: Layered 3d panorama for hyper-immersive scene generation. InProceedings of the special interest group on computer graphics and interactive techniques conference conference papers, pages 1–10, 2025

2025
[13]

Context as memory: Scene-consistent interactive long video generation with memory retrieval.arXiv preprint arXiv:2506.03141, 2025

Jiwen Yu, Jianhong Bai, Yiran Qin, Quande Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Xihui Liu. Context as memory: Scene-consistent interactive long video generation with memory retrieval.arXiv preprint arXiv:2506.03141, 2025

arXiv 2025
[14]

Worldmem: Long-term consistent world simulation with memory.arXiv preprint arXiv:2504.12369, 2025

Zeqi Xiao, Yushi Lan, Yifan Zhou, Wenqi Ouyang, Shuai Yang, Yanhong Zeng, and Xin- gang Pan. Worldmem: Long-term consistent world simulation with memory.arXiv preprint arXiv:2504.12369, 2025

arXiv 2025
[15]

Vmem: Consistent interactive video scene generation with surfel-indexed view memory.arXiv preprint arXiv:2506.18903, 2025

Runjia Li, Philip Torr, Andrea Vedaldi, and Tomas Jakab. Vmem: Consistent interactive video scene generation with surfel-indexed view memory.arXiv preprint arXiv:2506.18903, 2025

arXiv 2025
[16]

Pretraining frame preservation in autoregressive video memory compression.arXiv preprint arXiv:2512.23851, 2025

Lvmin Zhang, Shengqu Cai, Muyang Li, Chong Zeng, Beijia Lu, Anyi Rao, Song Han, Gordon Wetzstein, and Maneesh Agrawala. Pretraining frame preservation in autoregressive video memory compression.arXiv preprint arXiv:2512.23851, 2025. 10

Pith/arXiv arXiv 2025
[17]

Frame context packing and drift prevention in next-frame-prediction video diffusion models

Lvmin Zhang, Shengqu Cai, Muyang Li, Gordon Wetzstein, and Maneesh Agrawala. Frame context packing and drift prevention in next-frame-prediction video diffusion models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[18]

Corgi: Cached memory guided video generation

Xindi Wu, Uriel Singer, Zhaojiang Lin, Andrea Madotto, Xide Xia, Yifan Xu, Paul Crook, Xin Luna Dong, and Seungwhan Moon. Corgi: Cached memory guided video generation. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 4585–4594. IEEE, 2025

2025
[19]

Scaling instruction-based video editing with a high-quality synthetic dataset.arXiv preprint arXiv:2510.15742, 2025

Qingyan Bai, Qiuyu Wang, Hao Ouyang, Yue Yu, Hanlin Wang, Wen Wang, Ka Leong Cheng, Shuailei Ma, Yanhong Zeng, Zichen Liu, et al. Scaling instruction-based video editing with a high-quality synthetic dataset.arXiv preprint arXiv:2510.15742, 2025

arXiv 2025
[20]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023

2023
[21]

Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 1(2):3, 2022

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 1(2):3, 2022

Pith/arXiv arXiv 2022
[22]

High- resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

2022
[23]

Make-a-video: Text-to-video generation without text-video data

Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data. InInternational Conference on Learning Representations (ICLR), 2023

2023
[24]

Freenoise: Tuning-free longer video diffusion via noise rescheduling.arXiv preprint arXiv:2310.15169, 2023

Haonan Qiu, Menghan Xia, Yong Zhang, Yingqing He, Xintao Wang, Ying Shan, and Ziwei Liu. Freenoise: Tuning-free longer video diffusion via noise rescheduling.arXiv preprint arXiv:2310.15169, 2023

arXiv 2023
[25]

Enhancing multi-text long video generation consistency without tuning: Time-frequency analysis, prompt alignment, and theory.arXiv preprint arXiv:2412.17254, 2024

Xingyao Li, Fengzhuo Zhang, Jiachun Pan, Yunlong Hou, Vincent YF Tan, and Zhuoran Yang. Enhancing multi-text long video generation consistency without tuning: Time-frequency analysis, prompt alignment, and theory.arXiv preprint arXiv:2412.17254, 2024

arXiv 2024
[26]

The devil is in the prompts: Retrieval-augmented prompt optimization for text-to-video generation

Bingjie Gao, Xinyu Gao, Xiaoxue Wu, Yujie Zhou, Yu Qiao, Li Niu, Xinyuan Chen, and Yaohui Wang. The devil is in the prompts: Retrieval-augmented prompt optimization for text-to-video generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3173–3183, 2025

2025
[27]

Vista: A test-time self-improving video generation agent.arXiv preprint arXiv:2510.15831, 2025

Do Xuan Long, Xingchen Wan, Hootan Nakhost, Chen-Yu Lee, Tomas Pfister, and Ser- can Ö Arık. Vista: A test-time self-improving video generation agent.arXiv preprint arXiv:2510.15831, 2025

arXiv 2025
[28]

Modelscope text-to-video technical report.arXiv preprint arXiv:2308.06571, 2023

Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report.arXiv preprint arXiv:2308.06571, 2023

Pith/arXiv arXiv 2023
[29]

Animatediff: Animate your personalized text-to-image diffusion models without specific tuning.arXiv preprint arXiv:2307.04725, 2023

Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning.arXiv preprint arXiv:2307.04725, 2023

Pith/arXiv arXiv 2023
[30]

Latte: Latent diffusion transformer for video generation.arXiv preprint arXiv:2401.03048, 2024

Xin Ma, Yaohui Wang, Xinyuan Chen, Gengyun Jia, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Latent diffusion transformer for video generation.arXiv preprint arXiv:2401.03048, 2024

Pith/arXiv arXiv 2024
[31]

Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024

Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024

2024
[32]

Streamingt2v: Consistent, dynamic, and extendable long video generation from text

Roberto Henschel, Levon Khachatryan, Hayk Poghosyan, Daniil Hayrapetyan, Vahram Tade- vosyan, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Streamingt2v: Consistent, dynamic, and extendable long video generation from text. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2568–2577, 2025. 11

2025
[33]

Hunyuan-gamecraft: High-dynamic interactive game video generation with hybrid history condition.arXiv preprint arXiv:2506.17201, 2025

Jiaqi Li, Junshu Tang, Zhiyong Xu, Longhuang Wu, Yuan Zhou, Shuai Shao, Tianbao Yu, Zhiguo Cao, and Qinglin Lu. Hunyuan-gamecraft: High-dynamic interactive game video generation with hybrid history condition.arXiv preprint arXiv:2506.17201, 2025

arXiv 2025
[34]

Yume: An interactive world generation model.arXiv preprint arXiv:2507.17744, 2025

Xiaofeng Mao, Shaoheng Lin, Zhen Li, Chuanhao Li, Wenshuo Peng, Tong He, Jiangmiao Pang, Mingmin Chi, Yu Qiao, and Kaipeng Zhang. Yume: An interactive world generation model.arXiv preprint arXiv:2507.17744, 2025

arXiv 2025
[35]

Cambrian-s: Towards spatial supersensing in video

Shusheng Yang, Jihan Yang, Pinzhi Huang, Ellis Brown, Zihao Yang, Yue Yu, Shengbang Tong, Zihan Zheng, Yifan Xu, Muhan Wang, et al. Cambrian-s: Towards spatial supersensing in video. arXiv preprint arXiv:2511.04670, 2025

Pith/arXiv arXiv 2025
[36]

Vace: All-in-one video creation and editing.arXiv preprint arXiv:2503.07598, 2025

Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in-one video creation and editing.arXiv preprint arXiv:2503.07598, 2025

Pith/arXiv arXiv 2025
[37]

Sekai: A video dataset towards world exploration

Zhen Li, Chuanhao Li, Xiaofeng Mao, Shaoheng Lin, Ming Li, Shitian Zhao, Zhaopan Xu, Xinyue Li, Yukang Feng, Jianwen Sun, et al. Sekai: A video dataset towards world exploration. arXiv preprint arXiv:2506.15675, 2025

arXiv 2025
[38]

Spatialvid: A large-scale video dataset with spatial annotations.arXiv preprint arXiv:2509.09676, 2025

Jiahao Wang, Yufeng Yuan, Rujie Zheng, Youtian Lin, Jian Gao, Lin-Zhuo Chen, Yajie Bao, Yi Zhang, Chang Zeng, Yanxi Zhou, et al. Spatialvid: A large-scale video dataset with spatial annotations.arXiv preprint arXiv:2509.09676, 2025

arXiv 2025
[39]

Dec 2017

Nigel Spivey.Epic Games, page 250–263. Dec 2017

2017
[40]

Unrealzoo: Enriching photo-realistic virtual worlds for embodied ai

Fangwei Zhong, Kui Wu, Churan Wang, Hao Chen, Hai Ci, Zhoujun Li, and Yizhou Wang. Unrealzoo: Enriching photo-realistic virtual worlds for embodied ai. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5769–5779, 2025

2025
[41]

Qwen3-vl technical report, 2025

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

2025
[42]

Video depth anything: Consistent depth estimation for super-long videos

Sili Chen, Hengkai Guo, Shengnan Zhu, Feihu Zhang, Zilong Huang, Jiashi Feng, and Bingyi Kang. Video depth anything: Consistent depth estimation for super-long videos. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22831–22840, 2025

2025
[43]

Matrix-game 2.0: An open-source real-time and streaming interactive world model.arXiv preprint arXiv:2508.13009, 2025

Xianglong He, Chunli Peng, Zexiang Liu, Boyang Wang, Yifan Zhang, Qi Cui, Fei Kang, Biao Jiang, Mengyin An, Yangyang Ren, et al. Matrix-game 2.0: An open-source real-time and streaming interactive world model.arXiv preprint arXiv:2508.13009, 2025

Pith/arXiv arXiv 2025
[44]

Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

Pith/arXiv arXiv 2025
[45]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024

2024
[46]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021. 12 Local Edits Edit Prompt: Remove fallen leaves from ...

2021

[1] [1]

Video generation models as world simulators

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators. OpenAI Blog, 1(8):1, 2024

2024

[2] [2]

Relic: Interactive video world model with long-horizon memory.arXiv preprint arXiv:2512.04040, 2025

Yicong Hong, Yiqun Mei, Chongjian Ge, Yiran Xu, Yang Zhou, Sai Bi, Yannick Hold-Geoffroy, Mike Roberts, Matthew Fisher, Eli Shechtman, et al. Relic: Interactive video world model with long-horizon memory.arXiv preprint arXiv:2512.04040, 2025

arXiv 2025

[3] [3]

Worldplay: Towards long-term geometric consistency for real-time interactive world modeling.arXiv preprint arXiv:2512.14614, 2025

Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, and Chunchao Guo. Worldplay: Towards long-term geometric consistency for real-time interactive world modeling.arXiv preprint arXiv:2512.14614, 2025

Pith/arXiv arXiv 2025

[4] [4]

Video world models with long-term spatial memory.arXiv preprint arXiv:2506.05284, 2025

Tong Wu, Shuai Yang, Ryan Po, Yinghao Xu, Ziwei Liu, Dahua Lin, and Gordon Wetzstein. Video world models with long-term spatial memory.arXiv preprint arXiv:2506.05284, 2025

arXiv 2025

[5] [5]

Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Pith/arXiv arXiv 2025

[6] [6]

Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

Pith/arXiv arXiv 2024

[7] [7]

Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

Pith/arXiv arXiv 2024

[8] [8]

Imagine360: Immersive 360 video generation from perspective anchor.arXiv preprint arXiv:2412.03552, 2024

Jing Tan, Shuai Yang, Tong Wu, Jingwen He, Yuwei Guo, Ziwei Liu, and Dahua Lin. Imagine360: Immersive 360 video generation from perspective anchor.arXiv preprint arXiv:2412.03552, 2024

arXiv 2024

[9] [9]

Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024

Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richard- son, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024

Pith/arXiv arXiv 2024

[10] [10]

Cameractrl: Enabling camera control for text-to-video generation.arXiv preprint arXiv:2404.02101, 2024

Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation.arXiv preprint arXiv:2404.02101, 2024

Pith/arXiv arXiv 2024

[11] [11]

Gen3c: 3d-informed world- consistent video generation with precise camera control

Xuanchi Ren, Tianchang Shen, Jiahui Huang, Huan Ling, Yifan Lu, Merlin Nimier-David, Thomas Müller, Alexander Keller, Sanja Fidler, and Jun Gao. Gen3c: 3d-informed world- consistent video generation with precise camera control. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

2025

[12] [12]

Layerpano3d: Layered 3d panorama for hyper-immersive scene generation

Shuai Yang, Jing Tan, Mengchen Zhang, Tong Wu, Gordon Wetzstein, Ziwei Liu, and Dahua Lin. Layerpano3d: Layered 3d panorama for hyper-immersive scene generation. InProceedings of the special interest group on computer graphics and interactive techniques conference conference papers, pages 1–10, 2025

2025

[13] [13]

Context as memory: Scene-consistent interactive long video generation with memory retrieval.arXiv preprint arXiv:2506.03141, 2025

Jiwen Yu, Jianhong Bai, Yiran Qin, Quande Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Xihui Liu. Context as memory: Scene-consistent interactive long video generation with memory retrieval.arXiv preprint arXiv:2506.03141, 2025

arXiv 2025

[14] [14]

Worldmem: Long-term consistent world simulation with memory.arXiv preprint arXiv:2504.12369, 2025

Zeqi Xiao, Yushi Lan, Yifan Zhou, Wenqi Ouyang, Shuai Yang, Yanhong Zeng, and Xin- gang Pan. Worldmem: Long-term consistent world simulation with memory.arXiv preprint arXiv:2504.12369, 2025

arXiv 2025

[15] [15]

Vmem: Consistent interactive video scene generation with surfel-indexed view memory.arXiv preprint arXiv:2506.18903, 2025

Runjia Li, Philip Torr, Andrea Vedaldi, and Tomas Jakab. Vmem: Consistent interactive video scene generation with surfel-indexed view memory.arXiv preprint arXiv:2506.18903, 2025

arXiv 2025

[16] [16]

Pretraining frame preservation in autoregressive video memory compression.arXiv preprint arXiv:2512.23851, 2025

Lvmin Zhang, Shengqu Cai, Muyang Li, Chong Zeng, Beijia Lu, Anyi Rao, Song Han, Gordon Wetzstein, and Maneesh Agrawala. Pretraining frame preservation in autoregressive video memory compression.arXiv preprint arXiv:2512.23851, 2025. 10

Pith/arXiv arXiv 2025

[17] [17]

Frame context packing and drift prevention in next-frame-prediction video diffusion models

Lvmin Zhang, Shengqu Cai, Muyang Li, Gordon Wetzstein, and Maneesh Agrawala. Frame context packing and drift prevention in next-frame-prediction video diffusion models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025

[18] [18]

Corgi: Cached memory guided video generation

Xindi Wu, Uriel Singer, Zhaojiang Lin, Andrea Madotto, Xide Xia, Yifan Xu, Paul Crook, Xin Luna Dong, and Seungwhan Moon. Corgi: Cached memory guided video generation. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 4585–4594. IEEE, 2025

2025

[19] [19]

Scaling instruction-based video editing with a high-quality synthetic dataset.arXiv preprint arXiv:2510.15742, 2025

Qingyan Bai, Qiuyu Wang, Hao Ouyang, Yue Yu, Hanlin Wang, Wen Wang, Ka Leong Cheng, Shuailei Ma, Yanhong Zeng, Zichen Liu, et al. Scaling instruction-based video editing with a high-quality synthetic dataset.arXiv preprint arXiv:2510.15742, 2025

arXiv 2025

[20] [20]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023

2023

[21] [21]

Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 1(2):3, 2022

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 1(2):3, 2022

Pith/arXiv arXiv 2022

[22] [22]

High- resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

2022

[23] [23]

Make-a-video: Text-to-video generation without text-video data

Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data. InInternational Conference on Learning Representations (ICLR), 2023

2023

[24] [24]

Freenoise: Tuning-free longer video diffusion via noise rescheduling.arXiv preprint arXiv:2310.15169, 2023

Haonan Qiu, Menghan Xia, Yong Zhang, Yingqing He, Xintao Wang, Ying Shan, and Ziwei Liu. Freenoise: Tuning-free longer video diffusion via noise rescheduling.arXiv preprint arXiv:2310.15169, 2023

arXiv 2023

[25] [25]

Enhancing multi-text long video generation consistency without tuning: Time-frequency analysis, prompt alignment, and theory.arXiv preprint arXiv:2412.17254, 2024

Xingyao Li, Fengzhuo Zhang, Jiachun Pan, Yunlong Hou, Vincent YF Tan, and Zhuoran Yang. Enhancing multi-text long video generation consistency without tuning: Time-frequency analysis, prompt alignment, and theory.arXiv preprint arXiv:2412.17254, 2024

arXiv 2024

[26] [26]

The devil is in the prompts: Retrieval-augmented prompt optimization for text-to-video generation

Bingjie Gao, Xinyu Gao, Xiaoxue Wu, Yujie Zhou, Yu Qiao, Li Niu, Xinyuan Chen, and Yaohui Wang. The devil is in the prompts: Retrieval-augmented prompt optimization for text-to-video generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3173–3183, 2025

2025

[27] [27]

Vista: A test-time self-improving video generation agent.arXiv preprint arXiv:2510.15831, 2025

Do Xuan Long, Xingchen Wan, Hootan Nakhost, Chen-Yu Lee, Tomas Pfister, and Ser- can Ö Arık. Vista: A test-time self-improving video generation agent.arXiv preprint arXiv:2510.15831, 2025

arXiv 2025

[28] [28]

Modelscope text-to-video technical report.arXiv preprint arXiv:2308.06571, 2023

Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report.arXiv preprint arXiv:2308.06571, 2023

Pith/arXiv arXiv 2023

[29] [29]

Animatediff: Animate your personalized text-to-image diffusion models without specific tuning.arXiv preprint arXiv:2307.04725, 2023

Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning.arXiv preprint arXiv:2307.04725, 2023

Pith/arXiv arXiv 2023

[30] [30]

Latte: Latent diffusion transformer for video generation.arXiv preprint arXiv:2401.03048, 2024

Xin Ma, Yaohui Wang, Xinyuan Chen, Gengyun Jia, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Latent diffusion transformer for video generation.arXiv preprint arXiv:2401.03048, 2024

Pith/arXiv arXiv 2024

[31] [31]

Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024

Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024

2024

[32] [32]

Streamingt2v: Consistent, dynamic, and extendable long video generation from text

Roberto Henschel, Levon Khachatryan, Hayk Poghosyan, Daniil Hayrapetyan, Vahram Tade- vosyan, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Streamingt2v: Consistent, dynamic, and extendable long video generation from text. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2568–2577, 2025. 11

2025

[33] [33]

Hunyuan-gamecraft: High-dynamic interactive game video generation with hybrid history condition.arXiv preprint arXiv:2506.17201, 2025

Jiaqi Li, Junshu Tang, Zhiyong Xu, Longhuang Wu, Yuan Zhou, Shuai Shao, Tianbao Yu, Zhiguo Cao, and Qinglin Lu. Hunyuan-gamecraft: High-dynamic interactive game video generation with hybrid history condition.arXiv preprint arXiv:2506.17201, 2025

arXiv 2025

[34] [34]

Yume: An interactive world generation model.arXiv preprint arXiv:2507.17744, 2025

Xiaofeng Mao, Shaoheng Lin, Zhen Li, Chuanhao Li, Wenshuo Peng, Tong He, Jiangmiao Pang, Mingmin Chi, Yu Qiao, and Kaipeng Zhang. Yume: An interactive world generation model.arXiv preprint arXiv:2507.17744, 2025

arXiv 2025

[35] [35]

Cambrian-s: Towards spatial supersensing in video

Shusheng Yang, Jihan Yang, Pinzhi Huang, Ellis Brown, Zihao Yang, Yue Yu, Shengbang Tong, Zihan Zheng, Yifan Xu, Muhan Wang, et al. Cambrian-s: Towards spatial supersensing in video. arXiv preprint arXiv:2511.04670, 2025

Pith/arXiv arXiv 2025

[36] [36]

Vace: All-in-one video creation and editing.arXiv preprint arXiv:2503.07598, 2025

Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in-one video creation and editing.arXiv preprint arXiv:2503.07598, 2025

Pith/arXiv arXiv 2025

[37] [37]

Sekai: A video dataset towards world exploration

Zhen Li, Chuanhao Li, Xiaofeng Mao, Shaoheng Lin, Ming Li, Shitian Zhao, Zhaopan Xu, Xinyue Li, Yukang Feng, Jianwen Sun, et al. Sekai: A video dataset towards world exploration. arXiv preprint arXiv:2506.15675, 2025

arXiv 2025

[38] [38]

Spatialvid: A large-scale video dataset with spatial annotations.arXiv preprint arXiv:2509.09676, 2025

Jiahao Wang, Yufeng Yuan, Rujie Zheng, Youtian Lin, Jian Gao, Lin-Zhuo Chen, Yajie Bao, Yi Zhang, Chang Zeng, Yanxi Zhou, et al. Spatialvid: A large-scale video dataset with spatial annotations.arXiv preprint arXiv:2509.09676, 2025

arXiv 2025

[39] [39]

Dec 2017

Nigel Spivey.Epic Games, page 250–263. Dec 2017

2017

[40] [40]

Unrealzoo: Enriching photo-realistic virtual worlds for embodied ai

Fangwei Zhong, Kui Wu, Churan Wang, Hao Chen, Hai Ci, Zhoujun Li, and Yizhou Wang. Unrealzoo: Enriching photo-realistic virtual worlds for embodied ai. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5769–5779, 2025

2025

[41] [41]

Qwen3-vl technical report, 2025

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

2025

[42] [42]

Video depth anything: Consistent depth estimation for super-long videos

Sili Chen, Hengkai Guo, Shengnan Zhu, Feihu Zhang, Zilong Huang, Jiashi Feng, and Bingyi Kang. Video depth anything: Consistent depth estimation for super-long videos. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22831–22840, 2025

2025

[43] [43]

Matrix-game 2.0: An open-source real-time and streaming interactive world model.arXiv preprint arXiv:2508.13009, 2025

Xianglong He, Chunli Peng, Zexiang Liu, Boyang Wang, Yifan Zhang, Qi Cui, Fei Kang, Biao Jiang, Mengyin An, Yangyang Ren, et al. Matrix-game 2.0: An open-source real-time and streaming interactive world model.arXiv preprint arXiv:2508.13009, 2025

Pith/arXiv arXiv 2025

[44] [44]

Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

Pith/arXiv arXiv 2025

[45] [45]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024

2024

[46] [46]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021. 12 Local Edits Edit Prompt: Remove fallen leaves from ...

2021