pith. sign in
Pith Number

pith:LJJFWT47

pith:2025:LJJFWT474TT3TFHKS5YCQNSXOX
not attested not anchored not stored refs resolved

Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

Andrew Bai, Cho-Jui Hsieh, Jie Wu, Justin Cui, Ming Li, Rui Wang, Tao Yang, Xiaojie Li, Yuanhao Ban

Self-generated segments from a video model steer it to produce coherent four-minute clips without long-video teachers or retraining.

arxiv:2510.02283 v1 · 2025-10-02 · cs.CV · cs.AI

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{LJJFWT474TT3TFHKS5YCQNSXOX}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

our method shows the capability of generating videos up to 4 minutes and 15 seconds, equivalent to 99.9% of the maximum span supported by our base model's position embedding and more than 50x longer than that of our baseline model

C2weakest assumption

That segments sampled from the model's own long self-generated videos supply reliable, non-degrading guidance equivalent to teacher supervision without introducing new compounding errors in the latent space.

C3one line summary

Self-Forcing++ scales autoregressive video diffusion to over 4 minutes by using self-generated segments for guidance, reducing error accumulation and outperforming baselines in fidelity and consistency.

References

72 extracted · 72 resolved · 28 Pith anchors

[1] Diffusion for world modeling: Visual details matter in atari.Advancesin Neural Information Processing Systems, 37:58757–58791, 2024 2024
[2] Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets 2023 · arXiv:2311.15127
[3] Genie: Generative interactive environments 2024
[4] Videojam: Joint appearance-motion representations for en- hanced motion generation in video models 2025
[5] Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advancesin Neural Information Processing Systems, 37:24081–24125, 2024 2024

Cited by

36 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:49.903873Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

5a525b4f9fe4e7b994ea977028365775f96458ee822d0503435ba3c6bcfd0005

Aliases

arxiv: 2510.02283 · arxiv_version: 2510.02283v1 · doi: 10.48550/arxiv.2510.02283 · pith_short_12: LJJFWT474TT3 · pith_short_16: LJJFWT474TT3TFHK · pith_short_8: LJJFWT47
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/LJJFWT474TT3TFHKS5YCQNSXOX \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 5a525b4f9fe4e7b994ea977028365775f96458ee822d0503435ba3c6bcfd0005
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "17ea339a40f1eab15b7d0eb4d0daa64de4838ce80bff720fb091434a659311e7",
    "cross_cats_sorted": [
      "cs.AI"
    ],
    "license": "http://creativecommons.org/licenses/by-nc-sa/4.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2025-10-02T17:55:42Z",
    "title_canon_sha256": "a0b6f0aa34f4263b2dac7cc1fe967c50048cdbe39ec9d2d0743700e505f43d47"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2510.02283",
    "kind": "arxiv",
    "version": 1
  }
}