pith. sign in
Pith Number

pith:A7RYB4AH

pith:2026:A7RYB4AHIF4I7H7TK5J7BWWNAW
not attested not anchored not stored refs resolved

Video Models Can Reason with Verifiable Rewards

Hoifung Poon, James Y. Huang, Muhao Chen, Selena Song, Sheng Zhang, Tinghui Zhu, Xiaofei Wen, Yuankai Li

Reinforcement learning with rule-based rewards lets video diffusion models generate trajectories that satisfy explicit spatial and logical constraints.

arxiv:2605.15458 v1 · 2026-05-14 · cs.CV

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{A7RYB4AHIF4I7H7TK5J7BWWNAW}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Across these tasks, VideoRLVR consistently improves over supervised fine-tuning baselines, with dense decomposed rewards proving especially important in low-success-rate settings. Our RL-optimized model also outperforms the evaluated proprietary and open-source video generation models on these verifiable reasoning benchmarks and out-of-domain benchmarks.

C2weakest assumption

That success on three procedurally generated domains with objective success criteria (Maze, FlowFree, Sokoban) demonstrates reliable rule-consistent visual reasoning that generalizes beyond these specific environments and reward formulations.

C3one line summary

VideoRLVR uses SDE-GRPO optimization, dense decomposed rewards, and Early-Step Focus to train video diffusion models on verifiable reasoning tasks, outperforming supervised fine-tuning and other video generators on Maze, FlowFree, and Sokoban.

References

49 extracted · 49 resolved · 17 Pith anchors

[1] Onestory: Coherent multi-shot video generation with adaptive memory.CVPR, 2026a 2026
[2] Training Diffusion Models with Reinforcement Learning 2023 · arXiv:2305.13301
[3] Video generation models as world simulators 2024
[4] MMGR: Multi-modal generative reasoning 2025
[5] Dgpo: discovering multiple strategies with diversity-guided policy optimization 2024

Formal links

2 machine-checked theorem links

Receipt and verification
First computed 2026-05-20T00:00:59.630083Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

07e380f00741788f9ff35753f0dacd05b2667982ed5178dd97a067593ae6a0fe

Aliases

arxiv: 2605.15458 · arxiv_version: 2605.15458v1 · doi: 10.48550/arxiv.2605.15458 · pith_short_12: A7RYB4AHIF4I · pith_short_16: A7RYB4AHIF4I7H7T · pith_short_8: A7RYB4AH
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/A7RYB4AHIF4I7H7TK5J7BWWNAW \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 07e380f00741788f9ff35753f0dacd05b2667982ed5178dd97a067593ae6a0fe
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "e9ef74b397d1039da55069449d3b8bdf7f1ad11d912fc3e14d77890bebb21968",
    "cross_cats_sorted": [],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2026-05-14T22:40:56Z",
    "title_canon_sha256": "d5f25b7fc6ebb50f9cc96843e67cbb4368fae5c8f8b4b71d742ea333cadbf23a"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2605.15458",
    "kind": "arxiv",
    "version": 1
  }
}