{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2025:MHPVHFN3475547EDVCFGBT3TIT","short_pith_number":"pith:MHPVHFN3","schema_version":"1.0","canonical_sha256":"61df5395bbe7fbde7c83a88a60cf7344e9d43c097bf2dc4f759fc9c2b09394a6","source":{"kind":"arxiv","id":"2505.21996","version":4},"attestation_state":"computed","paper":{"title":"VRAG: Learning World Models for Interactive Video Generation","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Video retrieval augmented generation with explicit global state conditioning reduces compounding errors and improves consistency in interactive video world models.","cross_cats":["cs.AI"],"primary_cat":"cs.CV","authors_text":"Chi Jin, Taiye Chen, Xun Hu, Zihan Ding","submitted_at":"2025-05-28T05:55:44Z","abstract_excerpt":"Foundational world models must be both interactive and preserve spatiotemporal coherence for effective future planning with action choices. However, present models for long video generation have limited inherent world modeling capabilities due to two main challenges: compounding errors and insufficient memory mechanisms. We enhance image-to-video models with interactive capabilities through additional action conditioning and autoregressive framework, and reveal that compounding error is inherently irreducible in autoregressive video generation, while insufficient memory mechanism leads to inco"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":false,"formal_links_present":true},"canonical_record":{"source":{"id":"2505.21996","kind":"arxiv","version":4},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.CV","submitted_at":"2025-05-28T05:55:44Z","cross_cats_sorted":["cs.AI"],"title_canon_sha256":"906058bf64415bafb9626af18e1843c78960ae8b4cb9b8e2b873a54cccce5b00","abstract_canon_sha256":"f778a6e2a243324a96e9ed6472c7167356b11fecaa6a25643ab0c00663372736"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-29T01:04:53.476880Z","signature_b64":"6L6rYWpvl0+v6tg3m0v3MHBMFTQRunXHspr8d2qknj3uPs6aiOWIbvwhgBEwEGwGpjxr5U7gjcAAAWd+67VCAg==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"61df5395bbe7fbde7c83a88a60cf7344e9d43c097bf2dc4f759fc9c2b09394a6","last_reissued_at":"2026-05-29T01:04:53.476410Z","signature_status":"signed_v1","first_computed_at":"2026-05-29T01:04:53.476410Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"VRAG: Learning World Models for Interactive Video Generation","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Video retrieval augmented generation with explicit global state conditioning reduces compounding errors and improves consistency in interactive video world models.","cross_cats":["cs.AI"],"primary_cat":"cs.CV","authors_text":"Chi Jin, Taiye Chen, Xun Hu, Zihan Ding","submitted_at":"2025-05-28T05:55:44Z","abstract_excerpt":"Foundational world models must be both interactive and preserve spatiotemporal coherence for effective future planning with action choices. However, present models for long video generation have limited inherent world modeling capabilities due to two main challenges: compounding errors and insufficient memory mechanisms. We enhance image-to-video models with interactive capabilities through additional action conditioning and autoregressive framework, and reveal that compounding error is inherently irreducible in autoregressive video generation, while insufficient memory mechanism leads to inco"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"We propose video retrieval augmented generation (VRAG) with explicit global state conditioning, which significantly reduces long-term compounding errors and increases spatiotemporal consistency of world models.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The paper assumes that insufficient memory mechanisms are the primary cause of incoherence in current video world models and that retrieval of past clips plus explicit global state can overcome this without introducing new inconsistencies or requiring full retraining.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"The work introduces video retrieval augmented generation (VRAG) with explicit global state conditioning to reduce compounding errors and improve spatiotemporal consistency in interactive video world models.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Video retrieval augmented generation with explicit global state conditioning reduces compounding errors and improves consistency in interactive video world models.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"1473b209437479d5fb70d665cc9ec27459e724c35fdaddea600f53fbdccf50e6"},"source":{"id":"2505.21996","kind":"arxiv","version":4},"verdict":{"id":"cdb650e8-e501-4f62-9c01-ebeae64a7887","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-19T12:29:22.649684Z","strongest_claim":"We propose video retrieval augmented generation (VRAG) with explicit global state conditioning, which significantly reduces long-term compounding errors and increases spatiotemporal consistency of world models.","one_line_summary":"The work introduces video retrieval augmented generation (VRAG) with explicit global state conditioning to reduce compounding errors and improve spatiotemporal consistency in interactive video world models.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The paper assumes that insufficient memory mechanisms are the primary cause of incoherence in current video world models and that retrieval of past clips plus explicit global state can overcome this without introducing new inconsistencies or requiring full retraining.","pith_extraction_headline":"Video retrieval augmented generation with explicit global state conditioning reduces compounding errors and improves consistency in interactive video world models."},"integrity":{"clean":true,"summary":{"advisory":0,"critical":0,"by_detector":{},"informational":0},"endpoint":"/pith/2505.21996/integrity.json","findings":[],"available":true,"detectors_run":[],"snapshot_sha256":"c28c3603d3b5d939e8dc4c7e95fa8dfce3d595e45f758748cecf8e644a296938"},"references":{"count":0,"sample":[],"resolved_work":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57","internal_anchors":0},"formal_canon":{"evidence_count":3,"snapshot_sha256":"9a474ba13c68d0d07488698cbfa1f62de717b85c06bd86c7ac5326a58331fe44"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2505.21996","created_at":"2026-05-29T01:04:53.476465+00:00"},{"alias_kind":"arxiv_version","alias_value":"2505.21996v4","created_at":"2026-05-29T01:04:53.476465+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2505.21996","created_at":"2026-05-29T01:04:53.476465+00:00"},{"alias_kind":"pith_short_12","alias_value":"MHPVHFN34755","created_at":"2026-05-29T01:04:53.476465+00:00"},{"alias_kind":"pith_short_16","alias_value":"MHPVHFN3475547ED","created_at":"2026-05-29T01:04:53.476465+00:00"},{"alias_kind":"pith_short_8","alias_value":"MHPVHFN3","created_at":"2026-05-29T01:04:53.476465+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":0,"internal_anchor_count":0,"sample":[]},"formal_canon":{"evidence_count":3,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/MHPVHFN3475547EDVCFGBT3TIT","json":"https://pith.science/pith/MHPVHFN3475547EDVCFGBT3TIT.json","graph_json":"https://pith.science/api/pith-number/MHPVHFN3475547EDVCFGBT3TIT/graph.json","events_json":"https://pith.science/api/pith-number/MHPVHFN3475547EDVCFGBT3TIT/events.json","paper":"https://pith.science/paper/MHPVHFN3"},"agent_actions":{"view_html":"https://pith.science/pith/MHPVHFN3475547EDVCFGBT3TIT","download_json":"https://pith.science/pith/MHPVHFN3475547EDVCFGBT3TIT.json","view_paper":"https://pith.science/paper/MHPVHFN3","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2505.21996&json=true","fetch_graph":"https://pith.science/api/pith-number/MHPVHFN3475547EDVCFGBT3TIT/graph.json","fetch_events":"https://pith.science/api/pith-number/MHPVHFN3475547EDVCFGBT3TIT/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/MHPVHFN3475547EDVCFGBT3TIT/action/timestamp_anchor","attest_storage":"https://pith.science/pith/MHPVHFN3475547EDVCFGBT3TIT/action/storage_attestation","attest_author":"https://pith.science/pith/MHPVHFN3475547EDVCFGBT3TIT/action/author_attestation","sign_citation":"https://pith.science/pith/MHPVHFN3475547EDVCFGBT3TIT/action/citation_signature","submit_replication":"https://pith.science/pith/MHPVHFN3475547EDVCFGBT3TIT/action/replication_record"}},"created_at":"2026-05-29T01:04:53.476465+00:00","updated_at":"2026-05-29T01:04:53.476465+00:00"}