arxiv: 2604.06939 · v2 · submitted 2026-04-08 · 💻 cs.CV

Recognition: no theorem link

Grounded Forcing: Bridging Time-Independent Semantics and Proximal Dynamics in Autoregressive Video Synthesis

Jintao Chen , Chengyu Bai , Junjun Hu , Xinda Xue , Mu Xu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:28 UTC · model grok-4.3

classification 💻 cs.CV

keywords autoregressive video synthesislong-range consistencysemantic anchorspositional embeddingsprompt controllabilityKV cachevideo generation

0 comments

The pith

Grounded Forcing uses three interlocking mechanisms to maintain semantic anchors and suppress drift in autoregressive video synthesis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that autoregressive video models suffer from semantic forgetting, visual drift, and loss of control during long sequences or prompt changes. It proposes Grounded Forcing as a way to keep global semantic cores stable while permitting local temporal flexibility. The approach decouples memory, constrains positional embeddings to the training range, and uses proximity-weighted updates on prompt switches. If correct, this would allow coherent generation over much longer horizons without the usual degradation, supporting sustained interactive video applications.

Core claim

Grounded Forcing bridges time-independent semantics and proximal dynamics through Dual Memory KV Cache, which separates local dynamics from global semantic anchors, Dual-Reference RoPE Injection, which keeps positional embeddings inside the training manifold while making semantics time-invariant, and Asymmetric Proximity Recache, which enables smooth semantic inheritance on prompt transitions. These components tether the generative process to stable semantic cores while accommodating flexible local dynamics.

What carries the argument

Grounded Forcing, the framework of three interlocking mechanisms that decouples local temporal dynamics from global semantic anchors while confining positional embeddings and weighting cache updates by proximity.

If this is right

Long-range consistency improves because semantic anchors remain accessible across extended contexts.
Visual drift is suppressed as positional embeddings stay within the training manifold.
Controllability is preserved during interactive prompt changes via proximity-weighted cache updates.
The method provides a foundation for interactive long-form video synthesis without resetting the generation state.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same decoupling of anchors from dynamics could reduce error accumulation in other autoregressive domains such as audio or text.
If the mechanisms prove additive, hybrid models might combine them with existing diffusion or flow-based video methods for further gains.
Practical deployment would benefit from testing on real-time interaction loops where users issue repeated instructions.

Load-bearing premise

The three mechanisms operate together without creating new artifacts, efficiency costs, or failure modes during implementation.

What would settle it

Generate long video sequences with and without the full set of three mechanisms and measure whether identity consistency and visual stability degrade faster in the baseline after several hundred frames or prompt switches.

Figures

Figures reproduced from arXiv: 2604.06939 by Chengyu Bai, Jintao Chen, Junjun Hu, Mu Xu, Xinda Xue.

**Figure 2.** Figure 2: Limitations of Existing Methods. (a) Infinity-RoPE in Multi-Shot Generation: The model remains fixated on the initial frame (man, red box) and fails to preserve newly introduced entities (Hulk in yellow, Silver Wyvern in purple), demonstrating semantic forgetting. (b) LongLive in Continuous Generation: Rigid anchoring to initial semantics causes visual inconsistencies during semantic transitions, hinderin… view at source ↗

**Figure 3.** Figure 3: Dual Memory Mechanism. (a) Conventional KV Cache with a single sink frame, which suffers from semantic dilution over time. (b) Our Dual Memory KV Cache separates Local Temporal Memory (sliding window) from Global Consistency Memory (persistent anchors). (c) Case of Global Memory Update: Upon prompt switching, the global memory is selectively updated based on latent diversity. New frames with high semantic … view at source ↗

**Figure 4.** Figure 4: Dual-Reference RoPE Injection. Visualization of the positional encoding management over generation steps. Orange blocks represent Global Consistency Memory (GCM) keys, always injected with a fixed RoPE index of 0 to ensure positionagnostic semantic anchoring. Blue blocks denote Local Temporal Memory (LTM) keys, assigned relative indices within the model’s perceptual range (e.g., [0, 21]). Gray blocks ind… view at source ↗

**Figure 5.** Figure 5: Asymmetric Proximity Recache (APR). Comparison of KV Cache strategies during prompt switching. (a-b) Standard caching fails to adapt. (c) Uniform KV ReCache refreshes all frames equally, causing semantic discontinuity. (d) Our APR applies a proximity-dependent recache scale (arrow), refreshing recent frames more aggressively to follow the new prompt while retaining distant frames to preserve semantic inh… view at source ↗

**Figure 6.** Figure 6: Qualitative Comparison with Baselines. (a) Multi-Shot Generation: Our method maintains consistent identity through multiple semantic transitions (man → Hulk → Silver Wyvern), while Infinity-RoPE suffers from identity drift. (b) SingleShot Generation: Our approach preserves Iron Man’s identity and action coherence (laser → explosion), whereas LongLive and Infinity-RoPE exhibit visual degradation. extended … view at source ↗

**Figure 7.** Figure 7: Multi-shot generation with narrative continuity (60s). [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Single-prompt generation (240s). Our method demonstrates stronger subject-matter and background consistency in long-range generation results compared to other methods [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

**Figure 9.** Figure 9: Interactive Prompt Switching (60s). Our method demonstrates stronger consistency of main content and smoother video transitions in long-range interactive prompt generation results [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

**Figure 10.** Figure 10: Interactive Prompt Switching (60s). Our method demonstrates stronger consistency of main content and smoother video transitions in long-range interactive prompt generation results [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗

read the original abstract

Autoregressive video synthesis offers a promising pathway for infinite-horizon generation but is fundamentally hindered by three intertwined challenges: semantic forgetting from context limitations, visual drift due to positional extrapolation, and controllability loss during interactive instruction switching. Current methods often tackle these issues in isolation, limiting long-term coherence. We introduce Grounded Forcing, a novel framework that bridges time-independent semantics and proximal dynamics through three interlocking mechanisms. First, to address semantic forgetting, we propose a Dual Memory KV Cache that decouples local temporal dynamics from global semantic anchors, ensuring long-term semantic coherence and identity stability. Second, to suppress visual drift, we design Dual-Reference RoPE Injection, which confines positional embeddings within the training manifold while rendering global semantics time-invariant. Third, to resolve controllability issues, we develop Asymmetric Proximity Recache, which facilitates smooth semantic inheritance during prompt transitions via proximity-weighted cache updates. These components operate synergistically to tether the generative process to stable semantic cores while accommodating flexible local dynamics. Extensive experiments demonstrate that Grounded Forcing significantly enhances long-range consistency and visual stability, establishing a robust foundation for interactive long-form video synthesis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Grounded Forcing bundles three mechanisms for stabilizing long autoregressive videos, but the experiments don't isolate their combined effects.

read the letter

The main thing to know is that this paper puts forward Grounded Forcing as a way to improve long autoregressive video generation by using a dual memory KV cache, special RoPE injection, and asymmetric recaching to keep semantics stable while allowing local changes. What's actually new is this particular bundle of techniques to bridge the time-independent and proximal aspects. It builds on known issues but packages the solutions in an interlocking way that hasn't been presented exactly like this. It does well in clearly stating the three challenges and explaining how each mechanism addresses one while trying to support the others. Where it gets soft is in the validation. The stress-test is on point: the paper describes the components and gives overall results, but skips the ablations that would show whether all three are needed or if they interfere. Without those, the synergy claim is not fully backed. The experiments are said to be extensive, but lacking the details on what was compared and how, the improvements are hard to evaluate fully. This is useful for folks in video generation research who are struggling with coherence over long sequences. They might get practical pointers from the mechanisms described. I would send it for peer review. The ideas are worth a closer look from experts who can push on the experimental rigor.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces Grounded Forcing, a framework for autoregressive video synthesis that addresses semantic forgetting, visual drift, and controllability loss via three interlocking mechanisms: Dual Memory KV Cache (decoupling local temporal dynamics from global semantic anchors), Dual-Reference RoPE Injection (confining positional embeddings within the training manifold), and Asymmetric Proximity Recache (enabling smooth semantic inheritance during prompt transitions). The central claim is that these components operate synergistically to maintain long-term semantic coherence and identity stability while supporting flexible local dynamics, with extensive experiments demonstrating significant gains in long-range consistency and visual stability for interactive long-form video synthesis.

Significance. If the claims hold after addressing the noted gaps, the work could offer a practical approach to improving coherence in infinite-horizon autoregressive video models, a persistent challenge in the field. The emphasis on bridging time-independent semantics with proximal dynamics might provide a useful template for future interactive video systems, though its impact would depend on clear isolation of each mechanism's contribution.

major comments (1)

[§4.2 and §4.3] §4.2 and §4.3: The experiments describe each of the three mechanisms separately and report only aggregate metrics on long-range consistency and visual stability. No ablation studies are presented that disable one mechanism at a time while holding the others fixed. This is load-bearing for the central claim, which asserts synergistic operation without new artifacts or failure modes (e.g., cache staleness or RoPE manifold violations); without such controls, it remains possible that gains derive from one or two components alone.

minor comments (1)

[Abstract] Abstract: The claim of 'extensive experiments' is stated without any summary of key quantitative results, baselines, or specific metrics; adding a concise statement of the strongest empirical findings would improve clarity for readers.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We agree that isolating the individual contributions of each mechanism through targeted ablations is essential to substantiate the claim of synergistic operation. We address this below and will revise the paper accordingly.

read point-by-point responses

Referee: [§4.2 and §4.3] §4.2 and §4.3: The experiments describe each of the three mechanisms separately and report only aggregate metrics on long-range consistency and visual stability. No ablation studies are presented that disable one mechanism at a time while holding the others fixed. This is load-bearing for the central claim, which asserts synergistic operation without new artifacts or failure modes (e.g., cache staleness or RoPE manifold violations); without such controls, it remains possible that gains derive from one or two components alone.

Authors: We acknowledge that the current experiments in Sections 4.2 and 4.3 describe the mechanisms individually but rely on aggregate metrics without full ablations that disable one component while holding the others fixed. This leaves open the possibility that observed gains stem primarily from a subset of the mechanisms. In the revised manuscript, we will add dedicated ablation studies that systematically disable Dual Memory KV Cache, Dual-Reference RoPE Injection, and Asymmetric Proximity Recache in turn. These will include quantitative metrics on long-range consistency and visual stability, as well as qualitative checks for introduced artifacts such as cache staleness or RoPE manifold violations. The results will be presented alongside the existing aggregate results to clarify the synergistic effects. revision: yes

Circularity Check

0 steps flagged

No derivation chain or self-referential reductions present; framework is purely descriptive.

full rationale

The manuscript introduces Grounded Forcing as a high-level framework with three named mechanisms (Dual Memory KV Cache, Dual-Reference RoPE Injection, Asymmetric Proximity Recache) whose synergistic operation is asserted without equations, fitted parameters, predictions, or derivations. No step reduces a claimed result to its own inputs by construction, no self-citation is invoked as a uniqueness theorem, and no ansatz or renaming of known results occurs. The experimental claims rest on aggregate metrics rather than any closed loop, rendering the presentation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 3 invented entities

Abstract provides no numerical parameters or explicit axioms; the three mechanisms are presented as novel inventions. No free parameters identified. Axioms are implicit domain assumptions about transformer memory and positional encodings.

axioms (2)

domain assumption Decoupling local temporal dynamics from global semantic anchors via separate KV caches preserves long-term coherence
Invoked in the design of Dual Memory KV Cache to address semantic forgetting
domain assumption Reference-based injection of positional embeddings can confine them to the training manifold
Basis for Dual-Reference RoPE Injection to suppress visual drift

invented entities (3)

Dual Memory KV Cache no independent evidence
purpose: Decouples local temporal dynamics from global semantic anchors for long-term coherence
New component introduced to solve semantic forgetting
Dual-Reference RoPE Injection no independent evidence
purpose: Confines positional embeddings within training manifold while keeping semantics time-invariant
New technique to address visual drift
Asymmetric Proximity Recache no independent evidence
purpose: Enables smooth semantic inheritance during prompt transitions via proximity-weighted updates
New method for maintaining controllability

pith-pipeline@v0.9.0 · 5512 in / 1744 out tokens · 79613 ms · 2026-05-10T18:28:20.949838+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Delta Forcing: Trust Region Steering for Interactive Autoregressive Video Generation
cs.CV 2026-05 unverdicted novelty 6.0

Delta Forcing uses latent trajectory deltas to adaptively limit unreliable teacher guidance while enforcing monotonic continuity, improving temporal consistency in interactive autoregressive video generation.

Reference graph

Works this paper leans on

34 extracted references · 23 canonical work pages · cited by 1 Pith paper · 12 internal anchors

[1]

arXiv preprint arXiv:2508.03142 (2025)

Bai, C., Chen, J., Bai, X., Chen, Y., She, Q., Lu, M., Zhang, S.: Uniedit-i: Training- free image editing for unified vlm via iterative understanding, editing and verifying. arXiv preprint arXiv:2508.03142 (2025)

work page arXiv 2025
[2]

Advances in Neural Information Processing Systems37, 24081–24125 (2024)

Chen, B., Martí Monsó, D., Du, Y., Simchowitz, M., Tedrake, R., Sitzmann, V.: Diffusion forcing: Next-token prediction meets full-sequence diffusion. Advances in Neural Information Processing Systems37, 24081–24125 (2024)

2024
[3]

SkyReels-V2: Infinite-length Film Generative Model

Chen, G., Lin, D., Yang, J., Lin, C., Zhu, J., Fan, M., Zhang, H., Chen, S., Chen, Z., Ma, C., et al.: Skyreels-v2: Infinite-length film generative model. arXiv preprint arXiv:2504.13074 (2025)

work page internal anchor Pith review arXiv 2025
[4]

arXiv preprint arXiv:2603.28493 (2026)

Chen, J., Hao, A., Chen, X., Bai, C., Chen, C., Li, Y., Wu, J., Chu, X., Zhang, S.: Conceptweaver: Weaving disentangled concepts with flow. arXiv preprint arXiv:2603.28493 (2026)

work page arXiv 2026
[5]

arXiv preprint arXiv:2602.06028 (2026)

Chen, S., Wei, C., Sun, S., Nie, P., Zhou, K., Zhang, G., Yang, M.H., Chen, W.: Context forcing: Consistent autoregressive video generation with long context. arXiv preprint arXiv:2602.06028 (2026)

work page arXiv 2026
[6]

arXiv preprint arXiv:2510.02283 (2025)

Cui, J., Wu, J., Li, M., Yang, T., Li, X., Wang, R., Bai, A., Ban, Y., Hsieh, C.J.: Self-forcing++: Towards minute-scale high-quality video generation. arXiv preprint arXiv:2510.02283 (2025)

work page arXiv 2025
[7]

arXiv preprint arXiv:2512.15702 (2025)

Guo, Y., Yang, C., He, H., Zhao, Y., Wei, M., Yang, Z., Huang, W., Lin, D.: End-to-end training for autoregressive video diffusion via self-resampling. arXiv preprint arXiv:2512.15702 (2025)

work page arXiv 2025
[8]

LTX-Video: Realtime Video Latent Diffusion

HaCohen, Y., Chiprut, N., Brazowski, B., Shalem, D., Moshe, D., Richardson, E., Levin, E., Shiran, G., Zabari, N., Gordon, O., et al.: Ltx-video: Realtime video latent diffusion. arXiv preprint arXiv:2501.00103 (2024)

work page internal anchor Pith review arXiv 2024
[9]

Advances in neural information processing systems33, 6840–6851 (2020)

Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems33, 6840–6851 (2020)

2020
[10]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Huang, X., Li, Z., He, G., Zhou, M., Shechtman, E.: Self forcing: Bridging the train-test gap in autoregressive video diffusion. arXiv preprint arXiv:2506.08009 (2025)

work page internal anchor Pith review arXiv 2025
[11]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., et al.: Vbench: Comprehensive benchmark suite for video gener- ative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 21807–21818 (2024)

2024
[12]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Kong, W., Tian, Q., Zhang, Z., Min, R., Dai, Z., Zhou, J., Xiong, J., Li, X., Wu, B., Zhang, J., et al.: Hunyuanvideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Flow Matching for Generative Modeling

Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. arXiv preprint arXiv:2210.02747 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[14]

Rolling forcing: Autoregressive long video diffusion in real time.arXiv preprint arXiv:2509.25161, 2025a

Liu, K., Hu, W., Xu, J., Shan, Y., Lu, S.: Rolling forcing: Autoregressive long video diffusion in real time. arXiv preprint arXiv:2509.25161 (2025)

work page arXiv 2025
[15]

Advances in Neural Information Processing Systems37, 131434–131455 (2024)

Lu, Y., Liang, Y., Zhu, L., Yang, Y.: Freelong: Training-free long video generation with spectralblend temporal attention. Advances in Neural Information Processing Systems37, 131434–131455 (2024)

2024
[16]

Reward forcing: Efficient streaming video generation with rewarded distribution matching distillation.arXiv preprint arXiv:2512.04678,

Lu, Y., Zeng, Y., Li, H., Ouyang, H., Wang, Q., Cheng, K.L., Zhu, J., Cao, H., Zhang, Z., Zhu, X., et al.: Reward forcing: Efficient streaming video generation with rewarded distribution matching distillation. arXiv preprint arXiv:2512.04678 (2025) 16 Jintao Chen, Chengyu Bai, Junjun Hu ‡, Xinda Xue, and Mu Xu

work page arXiv 2025
[17]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Mao, F., Hao, A., Chen, J., Liu, D., Feng, X., Zhu, J., Wu, M., Chen, C., Wu, J., Chu, X.: Omni-effects: Unified and spatially-controllable visual effects generation. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 40, pp. 7927–7935 (2026)

2026
[18]

Peebles,W.,Xie,S.:Scalablediffusionmodelswithtransformers.In:Proceedingsof the IEEE/CVF international conference on computer vision. pp. 4195–4205 (2023)

2023
[19]

Movie Gen: A Cast of Media Foundation Models

Polyak, A., Zohar, A., Brown, A., Tjandra, A., Sinha, A., Lee, A., Vyas, A., Shi, B., Ma, C.Y., Chuang, C.Y., et al.: Movie gen: A cast of media foundation models. arXiv preprint arXiv:2410.13720 (2024)

work page internal anchor Pith review arXiv 2024
[20]

Seaweed-7b: Cost-effective training of video generation foundation model

Seawead, T., Yang, C., Lin, Z., Zhao, Y., Lin, S., Ma, Z., Guo, H., Chen, H., Qi, L., Wang, S., et al.: Seaweed-7b: Cost-effective training of video generation foundation model. arXiv preprint arXiv:2504.08685 (2025)

work page arXiv 2025
[21]

Neurocomputing568, 127063 (2024)

Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., Liu, Y.: Roformer: Enhanced trans- former with rotary position embedding. Neurocomputing568, 127063 (2024)

2024
[22]

MAGI-1: Autoregressive Video Generation at Scale

Teng, H., Jia, H., Sun, L., Li, L., Li, M., Tang, M., Han, S., Zhang, T., Zhang, W., Luo, W., et al.: Magi-1: Autoregressive video generation at scale. arXiv preprint arXiv:2505.13211 (2025)

work page internal anchor Pith review arXiv 2025
[23]

Wan: Open and Advanced Large-Scale Video Generative Models

Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., et al.: Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

LongLive: Real-time Interactive Long Video Generation

Yang, S., Huang, W., Chu, R., Xiao, Y., Zhao, Y., Wang, X., Li, M., Xie, E., Chen, Y., Lu, Y., et al.: Longlive: Real-time interactive long video generation. arXiv preprint arXiv:2509.22622 (2025)

work page internal anchor Pith review arXiv 2025
[26]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y., Hong, W., Zhang, X., Feng, G., et al.: Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Infinity- rope: Action-controllable infinite video generation emerges from autoregressive self-rollout.arXiv preprint arXiv:2511.20649,

Yesiltepe, H., Meral, T.H.S., Akan, A.K., Oktay, K., Yanardag, P.: Infinity-rope: Action-controllable infinite video generation emerges from autoregressive self- rollout. arXiv preprint arXiv:2511.20649 (2025)

work page arXiv 2025
[28]

Advances in neural information processing systems37, 47455–47487 (2024)

Yin, T., Gharbi, M., Park, T., Zhang, R., Shechtman, E., Durand, F., Freeman, B.: Improved distribution matching distillation for fast image synthesis. Advances in neural information processing systems37, 47455–47487 (2024)

2024
[29]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Yin, T., Gharbi, M., Zhang, R., Shechtman, E., Durand, F., Freeman, W.T., Park, T.: One-step diffusion with distribution matching distillation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6613– 6623 (2024)

2024
[30]

In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion

Yin, T., Zhang, Q., Zhang, R., Freeman, W.T., Durand, F., Shechtman, E., Huang, X.: From slow bidirectional to fast autoregressive video diffusion models. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion. pp. 22963–22974 (2025)

2025
[31]

arXiv e-prints pp

Zhang, L., Agrawala, M.: Packing input frame context in next-frame prediction models for video generation. arXiv e-prints pp. arXiv–2504 (2025)

2025
[32]

Pretraining Frame Preservation in Autoregressive Video Memory Com- pression

Zhang, L., Cai, S., Li, M., Zeng, C., Lu, B., Rao, A., Han, S., Wetzstein, G., Agrawala, M.: Pretraining frame preservation in autoregressive video memory com- pression. arXiv preprint arXiv:2512.23851 (2025)

work page arXiv 2025
[33]

Riflex: A free lunch for length extrapolation in video diffusion transformers

Zhao, M., He, G., Chen, Y., Zhu, H., Li, C., Zhu, J.: Riflex: A free lunch for length extrapolation in video diffusion transformers. arXiv preprint arXiv:2502.15894 (2025) Grounded Forcing 17

work page arXiv 2025
[34]

Zheng, D., Huang, Z., Liu, H., Zou, K., He, Y., Zhang, F., Gu, L., Zhang, Y., He, J., Zheng, W.S., et al.: Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness. arXiv preprint arXiv:2503.21755 (2025) 18 Jintao Chen, Chengyu Bai, Junjun Hu ‡, Xinda Xue, and Mu Xu Supplementary Material A Additional Quantitative Results We provi...

work page internal anchor Pith review arXiv 2025