hub

Pretraining frame preservation in autoregressive video memory compression.arXiv preprint arXiv:2512.23851

Lvmin Zhang, Shengqu Cai, Muyang Li, Chong Zeng, Beijia Lu, Anyi Rao, Song Han, Gordon Wetzstein, Maneesh Agrawala · 2025 · cs.CV · arXiv 2512.23851

16 Pith papers cite this work. Polarity classification is still indexing.

16 Pith papers citing it

open full Pith review browse 16 citing papers arXiv PDF

abstract

History context is central to autoregressive video generation, driving consistency and storytelling for both commercial models and personal use cases. For example, personal users, offline workflows, and individual-scale finetuning need to encode longer video histories under tight compute and memory budgets. We observe that content and identity consistency is an essential requirement, and that complete, uninterrupted history coverage together with content query and interpretation capabilities is broadly desired. We present TinyHistory, a lightweight history embedding learned through two-stage context learning. In the first stage, we pretrain the encoder on large-scale video data with a randomized frame query objective; in the second stage, we repurpose the pretrained encoder within an autoregressive video diffusion model to learn content-level consistency. As a result, we show that the learned lightweight embeddings achieve consistency comparable (by VLM, VBench, ELO, etc) to heavier alternatives, while reducing training overhead and extending the encodable history length within a given memory budget. We conduct ablation studies to analyze the influence and trade-offs of each component.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2 other 1

citation-polarity summary

background 2 unclear 1

representative citing papers

LongLive-RAG: A General Retrieval-Augmented Framework for Long Video Generation

cs.CV · 2026-06-01 · unverdicted · novelty 7.0

LongLive-RAG formulates long video generation as retrieval-augmented generation by treating self-generated latents as a dynamic searchable history and adding a Window Temporal Delta Loss for better retrieval.

DySink: Dynamic Frame Sinks for Autoregressive Long Video Generation

cs.CV · 2026-05-20 · unverdicted · novelty 7.0 · 2 refs

DySink maintains a memory bank and retrieves relevant historical frames as dynamic sinks while using an anomaly gate to suppress collapse, yielding higher temporal quality and dynamic degree on minute-long videos.

ABC: Any-Subset Autoregression via Non-Markovian Diffusion Bridges in Continuous Time and Space

cs.LG · 2026-04-30 · unverdicted · novelty 7.0

ABC enables any-subset autoregressive generation of continuous stochastic processes via non-Markovian diffusion bridges that track physical time and allow path-dependent conditioning.

Efficient Video Diffusion Models: Advancements and Challenges

cs.CV · 2026-04-17 · unverdicted · novelty 7.0

A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.

Grounded Forcing: Bridging Time-Independent Semantics and Proximal Dynamics in Autoregressive Video Synthesis

cs.CV · 2026-04-08 · unverdicted · novelty 7.0

Grounded Forcing introduces dual memory caching, reference-based positional embeddings, and proximity-weighted recaching to bridge stable semantics with local dynamics, improving long-range consistency in autoregressive video synthesis.

Compression and Retrieval: Implicit Memory Retrieval for Video World Models

cs.CV · 2026-06-22 · unverdicted · novelty 6.0

CaR uses attention with viewpoint positional encoding and context compression for flexible memory retrieval in video world models, backed by a new SceneFly dataset, and reports SOTA results with open-domain generalization.

InteractiveAvatar: Real-Time Streaming Video Generation for Consistent and Intent-Aware Avatars

cs.CV · 2026-06-22 · unverdicted · novelty 6.0 · 2 refs

InteractiveAvatar is a real-time infinite-streaming avatar video generation system using autoregressive distillation, Long-Short Visual Memory for consistency, and a Reasoning-Reaction Module for intent-aware interactions.

PermaVid: Consistent Video Generation Across Edits via Disentangled Context Memory

cs.CV · 2026-06-15 · unverdicted · novelty 6.0

PermaVid disentangles spatial context into semantic appearance and geometric structure via multi-modal memory banks and edit-aware updates to maintain long-term consistency in video generation after edits.

Echo-Infinity: Learning Evolving Memory for Real-Time Infinite Video Generation

cs.MM · 2026-06-03 · unverdicted · novelty 6.0

Echo-Infinity replaces handcrafted KV-cache schedules with end-to-end optimized Memory Queries and a Unified Relative RoPE recipe to support real-time infinite video generation in diffusion transformers.

OmniMem: Scalable and Adaptive Memory Retrieval for Long Video Generation

cs.CV · 2026-05-28 · unverdicted · novelty 6.0

OmniMem enables scalable long video generation via adaptive sparse KV retrieval that addresses local bias and union explosion while preserving explicit historical access.

FlowLong: Inference-time Long Video Generation via Manifold-constrained Tweedie Matching

cs.CV · 2026-05-20 · unverdicted · novelty 6.0

FlowLong generates videos several times longer than native model windows by blending adjacent predictions with Tweedie matching to enforce manifold and temporal consistency while using stochastic noise injection early and deterministic sampling later.

Advancing Narrative Long Video Generation via Training-Free Identity-Aware Memory

cs.CV · 2026-05-18 · unverdicted · novelty 6.0

IAMFlow is a training-free identity-aware memory system that tracks entities via LLM global ID assignment and VLM frame verification to reduce identity drift in narrative long video generation from shifting prompts.

RAVEN: Real-time Autoregressive Video Extrapolation with Consistency-model GRPO

cs.CV · 2026-05-14 · unverdicted · novelty 6.0

RAVEN aligns training and inference for causal autoregressive video diffusion via interleaved rollout repacking and introduces CM-GRPO for direct RL on consistency-model kernels, claiming better quality than recent baselines.

EverAnimate: Minute-Scale Human Animation via Latent Flow Restoration

cs.CV · 2026-05-14 · unverdicted · novelty 6.0

EverAnimate restores drifted latent flow trajectories in chunked video generation via persistent latent propagation and restorative flow matching, achieving measurable gains in PSNR, SSIM, LPIPS, and FID over prior long-animation methods with only LoRA tuning.

AnchorWorld: Embodied Egocentric World Simulation with View-based Evolution Customization

cs.CV · 2026-06-05 · unverdicted · novelty 5.0

AnchorWorld proposes a simulation framework that adds exogenous viewpoint supervision for full-body grounding and anchor-view text customization for dynamic world evolution in egocentric settings.

AlayaWorld: Long-Horizon and Playable Video World Generation

cs.CV · 2026-07-07 · conditional · novelty 4.0

AlayaWorld is a full-stack open-source framework for interactive video world generation, combining 3D spatial caching, error-bank training, and few-step distillation for real-time playable worlds.

citing papers explorer

Showing 16 of 16 citing papers.

LongLive-RAG: A General Retrieval-Augmented Framework for Long Video Generation cs.CV · 2026-06-01 · unverdicted · none · ref 63 · internal anchor
LongLive-RAG formulates long video generation as retrieval-augmented generation by treating self-generated latents as a dynamic searchable history and adding a Window Temporal Delta Loss for better retrieval.
DySink: Dynamic Frame Sinks for Autoregressive Long Video Generation cs.CV · 2026-05-20 · unverdicted · none · ref 20 · 2 links · internal anchor
DySink maintains a memory bank and retrieves relevant historical frames as dynamic sinks while using an anomaly gate to suppress collapse, yielding higher temporal quality and dynamic degree on minute-long videos.
ABC: Any-Subset Autoregression via Non-Markovian Diffusion Bridges in Continuous Time and Space cs.LG · 2026-04-30 · unverdicted · none · ref 70 · internal anchor
ABC enables any-subset autoregressive generation of continuous stochastic processes via non-Markovian diffusion bridges that track physical time and allow path-dependent conditioning.
Efficient Video Diffusion Models: Advancements and Challenges cs.CV · 2026-04-17 · unverdicted · none · ref 183 · internal anchor
A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.
Grounded Forcing: Bridging Time-Independent Semantics and Proximal Dynamics in Autoregressive Video Synthesis cs.CV · 2026-04-08 · unverdicted · none · ref 32 · internal anchor
Grounded Forcing introduces dual memory caching, reference-based positional embeddings, and proximity-weighted recaching to bridge stable semantics with local dynamics, improving long-range consistency in autoregressive video synthesis.
Compression and Retrieval: Implicit Memory Retrieval for Video World Models cs.CV · 2026-06-22 · unverdicted · none · ref 22 · internal anchor
CaR uses attention with viewpoint positional encoding and context compression for flexible memory retrieval in video world models, backed by a new SceneFly dataset, and reports SOTA results with open-domain generalization.
InteractiveAvatar: Real-Time Streaming Video Generation for Consistent and Intent-Aware Avatars cs.CV · 2026-06-22 · unverdicted · none · ref 44 · 2 links · internal anchor
InteractiveAvatar is a real-time infinite-streaming avatar video generation system using autoregressive distillation, Long-Short Visual Memory for consistency, and a Reasoning-Reaction Module for intent-aware interactions.
PermaVid: Consistent Video Generation Across Edits via Disentangled Context Memory cs.CV · 2026-06-15 · unverdicted · none · ref 16 · internal anchor
PermaVid disentangles spatial context into semantic appearance and geometric structure via multi-modal memory banks and edit-aware updates to maintain long-term consistency in video generation after edits.
Echo-Infinity: Learning Evolving Memory for Real-Time Infinite Video Generation cs.MM · 2026-06-03 · unverdicted · none · ref 47 · internal anchor
Echo-Infinity replaces handcrafted KV-cache schedules with end-to-end optimized Memory Queries and a Unified Relative RoPE recipe to support real-time infinite video generation in diffusion transformers.
OmniMem: Scalable and Adaptive Memory Retrieval for Long Video Generation cs.CV · 2026-05-28 · unverdicted · none · ref 22 · internal anchor
OmniMem enables scalable long video generation via adaptive sparse KV retrieval that addresses local bias and union explosion while preserving explicit historical access.
FlowLong: Inference-time Long Video Generation via Manifold-constrained Tweedie Matching cs.CV · 2026-05-20 · unverdicted · none · ref 37 · internal anchor
FlowLong generates videos several times longer than native model windows by blending adjacent predictions with Tweedie matching to enforce manifold and temporal consistency while using stochastic noise injection early and deterministic sampling later.
Advancing Narrative Long Video Generation via Training-Free Identity-Aware Memory cs.CV · 2026-05-18 · unverdicted · none · ref 52 · internal anchor
IAMFlow is a training-free identity-aware memory system that tracks entities via LLM global ID assignment and VLM frame verification to reduce identity drift in narrative long video generation from shifting prompts.
RAVEN: Real-time Autoregressive Video Extrapolation with Consistency-model GRPO cs.CV · 2026-05-14 · unverdicted · none · ref 30 · internal anchor
RAVEN aligns training and inference for causal autoregressive video diffusion via interleaved rollout repacking and introduces CM-GRPO for direct RL on consistency-model kernels, claiming better quality than recent baselines.
EverAnimate: Minute-Scale Human Animation via Latent Flow Restoration cs.CV · 2026-05-14 · unverdicted · none · ref 54 · internal anchor
EverAnimate restores drifted latent flow trajectories in chunked video generation via persistent latent propagation and restorative flow matching, achieving measurable gains in PSNR, SSIM, LPIPS, and FID over prior long-animation methods with only LoRA tuning.
AnchorWorld: Embodied Egocentric World Simulation with View-based Evolution Customization cs.CV · 2026-06-05 · unverdicted · none · ref 61 · internal anchor
AnchorWorld proposes a simulation framework that adds exogenous viewpoint supervision for full-body grounding and anchor-view text customization for dynamic world evolution in egocentric settings.
AlayaWorld: Long-Horizon and Playable Video World Generation cs.CV · 2026-07-07 · conditional · none · ref 92 · internal anchor
AlayaWorld is a full-stack open-source framework for interactive video world generation, combining 3D spatial caching, error-bank training, and few-step distillation for real-time playable worlds.

Pretraining frame preservation in autoregressive video memory compression.arXiv preprint arXiv:2512.23851

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer