{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2026:A7RYB4AHIF4I7H7TK5J7BWWNAW","short_pith_number":"pith:A7RYB4AH","schema_version":"1.0","canonical_sha256":"07e380f00741788f9ff35753f0dacd05b2667982ed5178dd97a067593ae6a0fe","source":{"kind":"arxiv","id":"2605.15458","version":1},"attestation_state":"computed","paper":{"title":"Video Models Can Reason with Verifiable Rewards","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Reinforcement learning with rule-based rewards lets video diffusion models generate trajectories that satisfy explicit spatial and logical constraints.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Hoifung Poon, James Y. Huang, Muhao Chen, Selena Song, Sheng Zhang, Tinghui Zhu, Xiaofei Wen, Yuankai Li","submitted_at":"2026-05-14T22:40:56Z","abstract_excerpt":"Video diffusion models have made rapid progress in perceptual realism and temporal coherence, but they remain primarily optimized for plausible generation rather than verifiable reasoning. This limitation is especially pronounced in tasks where generated videos must satisfy explicit spatial, temporal, or logical constraints. Inspired by the role of reinforcement learning with verifiable rewards (RLVR) in reasoning-oriented language models, we introduce VideoRLVR, a practical recipe for optimizing video diffusion models with rule-based feedback. VideoRLVR formulates video reasoning as the gener"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2605.15458","kind":"arxiv","version":1},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.CV","submitted_at":"2026-05-14T22:40:56Z","cross_cats_sorted":[],"title_canon_sha256":"d5f25b7fc6ebb50f9cc96843e67cbb4368fae5c8f8b4b71d742ea333cadbf23a","abstract_canon_sha256":"e9ef74b397d1039da55069449d3b8bdf7f1ad11d912fc3e14d77890bebb21968"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-20T00:00:59.630924Z","signature_b64":"aRwYy0n6TOjIchy/6rj+WQlEhDrPR9Pq3hBqVPM9aMixAhCZW0WNdA/RfEB5zgOLlsBEPlbTcyfzPS+Qou9kAQ==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"07e380f00741788f9ff35753f0dacd05b2667982ed5178dd97a067593ae6a0fe","last_reissued_at":"2026-05-20T00:00:59.630083Z","signature_status":"signed_v1","first_computed_at":"2026-05-20T00:00:59.630083Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Video Models Can Reason with Verifiable Rewards","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Reinforcement learning with rule-based rewards lets video diffusion models generate trajectories that satisfy explicit spatial and logical constraints.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Hoifung Poon, James Y. Huang, Muhao Chen, Selena Song, Sheng Zhang, Tinghui Zhu, Xiaofei Wen, Yuankai Li","submitted_at":"2026-05-14T22:40:56Z","abstract_excerpt":"Video diffusion models have made rapid progress in perceptual realism and temporal coherence, but they remain primarily optimized for plausible generation rather than verifiable reasoning. This limitation is especially pronounced in tasks where generated videos must satisfy explicit spatial, temporal, or logical constraints. Inspired by the role of reinforcement learning with verifiable rewards (RLVR) in reasoning-oriented language models, we introduce VideoRLVR, a practical recipe for optimizing video diffusion models with rule-based feedback. VideoRLVR formulates video reasoning as the gener"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Across these tasks, VideoRLVR consistently improves over supervised fine-tuning baselines, with dense decomposed rewards proving especially important in low-success-rate settings. Our RL-optimized model also outperforms the evaluated proprietary and open-source video generation models on these verifiable reasoning benchmarks and out-of-domain benchmarks.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That success on three procedurally generated domains with objective success criteria (Maze, FlowFree, Sokoban) demonstrates reliable rule-consistent visual reasoning that generalizes beyond these specific environments and reward formulations.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"VideoRLVR uses SDE-GRPO optimization, dense decomposed rewards, and Early-Step Focus to train video diffusion models on verifiable reasoning tasks, outperforming supervised fine-tuning and other video generators on Maze, FlowFree, and Sokoban.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Reinforcement learning with rule-based rewards lets video diffusion models generate trajectories that satisfy explicit spatial and logical constraints.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"7fd1cd011b313fac42b950221481e1166d58ef141380ad3e49e5e640e0e0af5f"},"source":{"id":"2605.15458","kind":"arxiv","version":1},"verdict":{"id":"e12dcc70-f022-4e76-80a0-cfffe0045d8f","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-19T14:56:32.221238Z","strongest_claim":"Across these tasks, VideoRLVR consistently improves over supervised fine-tuning baselines, with dense decomposed rewards proving especially important in low-success-rate settings. Our RL-optimized model also outperforms the evaluated proprietary and open-source video generation models on these verifiable reasoning benchmarks and out-of-domain benchmarks.","one_line_summary":"VideoRLVR uses SDE-GRPO optimization, dense decomposed rewards, and Early-Step Focus to train video diffusion models on verifiable reasoning tasks, outperforming supervised fine-tuning and other video generators on Maze, FlowFree, and Sokoban.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That success on three procedurally generated domains with objective success criteria (Maze, FlowFree, Sokoban) demonstrates reliable rule-consistent visual reasoning that generalizes beyond these specific environments and reward formulations.","pith_extraction_headline":"Reinforcement learning with rule-based rewards lets video diffusion models generate trajectories that satisfy explicit spatial and logical constraints."},"integrity":{"clean":true,"summary":{"advisory":0,"critical":0,"by_detector":{},"informational":0},"endpoint":"/pith/2605.15458/integrity.json","findings":[],"available":true,"detectors_run":[{"name":"citation_quote_validity","ran_at":"2026-05-19T15:49:49.130610Z","status":"completed","version":"0.1.0","findings_count":0},{"name":"doi_title_agreement","ran_at":"2026-05-19T15:31:17.743736Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"cited_work_retraction","ran_at":"2026-05-19T15:23:29.112647Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"doi_compliance","ran_at":"2026-05-19T15:10:44.027712Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"claim_evidence","ran_at":"2026-05-19T14:21:54.100938Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"ai_meta_artifact","ran_at":"2026-05-19T13:33:22.672465Z","status":"skipped","version":"1.0.0","findings_count":0}],"snapshot_sha256":"7b8f44e1c9e6bc749ca2928d00dfcd409f7280aa6b3e541ced25de522f560536"},"references":{"count":49,"sample":[{"doi":"","year":2026,"title":"Onestory: Coherent multi-shot video generation with adaptive memory.CVPR, 2026a","work_id":"ee10761b-dd39-441e-9b68-159d0bbbf0c0","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Training Diffusion Models with Reinforcement Learning","work_id":"67684dda-3930-452a-b91a-36cbb8e2e219","ref_index":2,"cited_arxiv_id":"2305.13301","is_internal_anchor":true},{"doi":"","year":2024,"title":"Video generation models as world simulators","work_id":"6d25dc8f-02d4-4aab-84e4-1d87bf3567a6","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2025,"title":"MMGR: Multi-modal generative reasoning","work_id":"a0c3f635-dfea-4dbc-8c13-bf10cab2777c","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"Dgpo: discovering multiple strategies with diversity-guided policy optimization","work_id":"ca9e3dad-bd1e-49af-be7a-bb4fe74da2f7","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":49,"snapshot_sha256":"30fb2993fda52b0ff9f8ed094b985600e4948e6e012c86b6c69c07d74987af8f","internal_anchors":17},"formal_canon":{"evidence_count":2,"snapshot_sha256":"f76a9dcc143345158d915eda60068934e736bb5fca5a0ff8c4b17c09fb17993d"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2605.15458","created_at":"2026-05-20T00:00:59.630202+00:00"},{"alias_kind":"arxiv_version","alias_value":"2605.15458v1","created_at":"2026-05-20T00:00:59.630202+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2605.15458","created_at":"2026-05-20T00:00:59.630202+00:00"},{"alias_kind":"pith_short_12","alias_value":"A7RYB4AHIF4I","created_at":"2026-05-20T00:00:59.630202+00:00"},{"alias_kind":"pith_short_16","alias_value":"A7RYB4AHIF4I7H7T","created_at":"2026-05-20T00:00:59.630202+00:00"},{"alias_kind":"pith_short_8","alias_value":"A7RYB4AH","created_at":"2026-05-20T00:00:59.630202+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":0,"internal_anchor_count":0,"sample":[]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/A7RYB4AHIF4I7H7TK5J7BWWNAW","json":"https://pith.science/pith/A7RYB4AHIF4I7H7TK5J7BWWNAW.json","graph_json":"https://pith.science/api/pith-number/A7RYB4AHIF4I7H7TK5J7BWWNAW/graph.json","events_json":"https://pith.science/api/pith-number/A7RYB4AHIF4I7H7TK5J7BWWNAW/events.json","paper":"https://pith.science/paper/A7RYB4AH"},"agent_actions":{"view_html":"https://pith.science/pith/A7RYB4AHIF4I7H7TK5J7BWWNAW","download_json":"https://pith.science/pith/A7RYB4AHIF4I7H7TK5J7BWWNAW.json","view_paper":"https://pith.science/paper/A7RYB4AH","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2605.15458&json=true","fetch_graph":"https://pith.science/api/pith-number/A7RYB4AHIF4I7H7TK5J7BWWNAW/graph.json","fetch_events":"https://pith.science/api/pith-number/A7RYB4AHIF4I7H7TK5J7BWWNAW/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/A7RYB4AHIF4I7H7TK5J7BWWNAW/action/timestamp_anchor","attest_storage":"https://pith.science/pith/A7RYB4AHIF4I7H7TK5J7BWWNAW/action/storage_attestation","attest_author":"https://pith.science/pith/A7RYB4AHIF4I7H7TK5J7BWWNAW/action/author_attestation","sign_citation":"https://pith.science/pith/A7RYB4AHIF4I7H7TK5J7BWWNAW/action/citation_signature","submit_replication":"https://pith.science/pith/A7RYB4AHIF4I7H7TK5J7BWWNAW/action/replication_record"}},"created_at":"2026-05-20T00:00:59.630202+00:00","updated_at":"2026-05-20T00:00:59.630202+00:00"}