pith. sign in

arxiv: 2511.22973 · v2 · pith:EO4ULENEnew · submitted 2025-11-28 · 💻 cs.CV

BIFE: Better Interaction, Fewer Errors for Minute-Long Video Generation

classification 💻 cs.CV
keywords bifegenerationvideocachelong-rangeminute-longalibaba-damo-academyconsistency
0
0 comments X
read the original abstract

Long video generation is a critical step toward building realistic world models, requiring both high visual fidelity and long-range interaction consistency. Recent autoregressive diffusion models enable long-horizon generation through KV cache reuse, yet suffer from two fundamental challenges: failure to preserve long-range interactions due to sliding-window KV cache and error accumulation that progressively degrades generation quality over time. To address these issues, we propose BIFE, a framework that introduces a semantic sparse KV cache for retrieval-based long-range conditioning and a Block Forcing training strategy to enforce cross-block consistency. Together, these designs preserve historical interactions while mitigating drift, enabling stable and coherent minute-long video generation. We also introduce InterVBench, a minute-long video benchmark with fine-grained block-level annotations and Video Drift Error metrics. Extensive experiments on InterVBench and VBench-Long demonstrate that BIFE achieves state-of-the-art performance, including a 22.2% improvement on VDE-Subject and a 19.4% improvement on VDE-Clarity over baselines. Website: https://alibaba-damo-academy.github.io/BIFE. Code: https://github.com/alibaba-damo-academy/BIFE.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Latent Block-Diffusion Temporal Point Processes: A Semi-Autoregressive Framework for Asynchronous Event Sequence Generation

    cs.LG 2026-06 unverdicted novelty 6.0

    LBDTPP generates high-quality variable-length event sequences by autoregressing over latent blocks and diffusing within blocks, with Wasserstein bounds claiming reduced error accumulation under local approximation and...

  2. OmniMem: Scalable and Adaptive Memory Retrieval for Long Video Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    OmniMem enables scalable long video generation via adaptive sparse KV retrieval that addresses local bias and union explosion while preserving explicit historical access.

  3. Advancing Narrative Long Video Generation via Training-Free Identity-Aware Memory

    cs.CV 2026-05 unverdicted novelty 6.0

    IAMFlow is a training-free identity-aware memory system that tracks entities via LLM global ID assignment and VLM frame verification to reduce identity drift in narrative long video generation from shifting prompts.

  4. WorldOlympiad: Can Your World Model Survive a Triathlon?

    cs.CV 2026-06 unverdicted novelty 5.0

    WorldOlympiad is a new benchmark decomposing world-model evaluation into physical, geometry, and interaction tracks using segmentation, MLLM judges, Gaussian splatting, and action prompts on diverse scenarios.