Video Models Can Reason with Verifiable Rewards

Hoifung Poon; James Y. Huang; Muhao Chen; Selena Song; Sheng Zhang; Tinghui Zhu; Xiaofei Wen; Yuankai Li

arxiv: 2605.15458 · v1 · pith:A7RYB4AHnew · submitted 2026-05-14 · 💻 cs.CV

Video Models Can Reason with Verifiable Rewards

Tinghui Zhu , Sheng Zhang , James Y. Huang , Selena Song , Xiaofei Wen , Yuankai Li , Hoifung Poon , Muhao Chen This is my paper

Pith reviewed 2026-05-19 14:56 UTC · model grok-4.3

classification 💻 cs.CV

keywords video diffusion modelsreinforcement learningverifiable rewardsvisual reasoningSDE-GRPOdense rewardsvideo generation

0 comments

The pith

Reinforcement learning with rule-based rewards lets video diffusion models generate trajectories that satisfy explicit spatial and logical constraints.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that video diffusion models, which currently prioritize plausible appearance over rule adherence, can be optimized using reinforcement learning driven by verifiable success signals. It introduces VideoRLVR, which treats video generation as the production of visual trajectories and applies an SDE-GRPO optimizer together with dense decomposed rewards and an Early-Step Focus strategy that limits updates to the first part of denoising. On three procedurally generated environments with clear win conditions, the resulting models exceed supervised fine-tuning baselines, with the largest gains appearing when success rates are low. The approach also surpasses several proprietary and open-source video generators on both in-domain and out-of-domain checks. If the method holds, video models could move from imitation of appearance to reliable satisfaction of constraints.

Core claim

VideoRLVR formulates video reasoning as the generation of verifiable visual trajectories and optimizes diffusion models with an SDE-GRPO backbone, dense decomposed rewards, and an Early-Step Focus strategy that restricts policy updates to the early denoising steps, achieving consistent gains over supervised fine-tuning on Maze, FlowFree, and Sokoban while cutting training latency by roughly 40 percent.

What carries the argument

VideoRLVR optimization recipe built on SDE-GRPO policy updates, dense decomposed rewards that break success into spatial-temporal components, and Early-Step Focus that restricts gradients to the initial denoising phase.

If this is right

VideoRLVR improves success rates over supervised fine-tuning on Maze, FlowFree, and Sokoban.
Dense decomposed rewards deliver the largest gains precisely when baseline success rates are low.
The Early-Step Focus strategy preserves performance while cutting training time by about 40 percent.
The RL-tuned model exceeds both proprietary and open-source video generators on the tested reasoning benchmarks and on out-of-domain checks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same verifiable-reward loop could be applied to other video tasks that admit objective scoring, such as physical simulation or instructional video planning.
Because the method separates reward design from the diffusion backbone, it offers a route to incorporate new constraint types without retraining the entire generator.
The latency reduction from Early-Step Focus suggests that similar partial-trajectory optimization may be useful in other diffusion-based sequence models.

Load-bearing premise

Success on three procedurally generated domains with objective success criteria shows that the model has acquired reliable rule-consistent visual reasoning that works outside those exact environments.

What would settle it

Test the trained model on a fourth procedurally generated domain with a new set of rules never encountered in training and measure whether success rate remains high without additional fine-tuning.

Figures

Figures reproduced from arXiv: 2605.15458 by Hoifung Poon, James Y. Huang, Muhao Chen, Selena Song, Sheng Zhang, Tinghui Zhu, Xiaofei Wen, Yuankai Li.

**Figure 1.** Figure 1: Evolution towards verifiable video reasoning. Top: Comparison between perceptionfocused generation and reasoning-intensive tasks. Bottom: We introduce VideoRLVR, the missing puzzle in the training paradigm for video reasoning models. analogy to reasoning-oriented LLMs where pre-training provides broad generative competence, SFT teaches the format of reasoning traces, Reinforcement Learning with Verifiable… view at source ↗

**Figure 2.** Figure 2: Success rate of different grid size for maze. n represents the number of samples falling into this range As shown in [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Scaling results with group size. domain, as shown in [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative case study across three reasoning domains. We compare generations from the [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative example of reward hacking in the absence of KL-divergence regularization. It [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

read the original abstract

Video diffusion models have made rapid progress in perceptual realism and temporal coherence, but they remain primarily optimized for plausible generation rather than verifiable reasoning. This limitation is especially pronounced in tasks where generated videos must satisfy explicit spatial, temporal, or logical constraints. Inspired by the role of reinforcement learning with verifiable rewards (RLVR) in reasoning-oriented language models, we introduce VideoRLVR, a practical recipe for optimizing video diffusion models with rule-based feedback. VideoRLVR formulates video reasoning as the generation of verifiable visual trajectories and consists of an SDE-GRPO optimization backbone, dense decomposed rewards, and an Early-Step Focus strategy for efficient training. The Early-Step Focus strategy restricts policy optimization to the early denoising phase, reducing training latency by about 40% while preserving performance. We evaluate VideoRLVR on Maze, FlowFree, and Sokoban, three procedurally generated domains with objective success criteria. Across these tasks, VideoRLVR consistently improves over supervised fine-tuning baselines, with dense decomposed rewards proving especially important in low-success-rate settings. Our RL-optimized model also outperforms the evaluated proprietary and open-source video generation models on these verifiable reasoning benchmarks and out-of-domain benchmarks. These results suggest that verifiable RL can move video models beyond perceptual imitation toward more reliable rule-consistent visual reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VideoRLVR applies RL with dense rewards and an early-step trick to video diffusion and gets clean gains on grid puzzles, but the reasoning generalization claim is still narrow.

read the letter

The main thing to know is that this paper shows a workable recipe for using verifiable RL to improve rule-following in video diffusion models on tasks like maze navigation and Sokoban. They adapt the RLVR approach with an SDE-GRPO backbone, dense decomposed rewards, and Early-Step Focus that limits optimization to early denoising steps and cuts training time by about 40 percent while keeping results similar. On the three procedural domains they test, the method beats supervised fine-tuning baselines, with the dense rewards mattering most in low-success settings, and it also edges out other open and proprietary video models on both in-domain and out-of-domain checks. That combination of elements for video is new relative to the language-model work they cite, and the efficiency piece is a practical addition that stands on its own. The empirical pattern looks internally consistent and the benchmarks have objective success criteria, which helps avoid circularity. The soft spot is the evaluation scope. All the core results sit on similar low-dimensional grid worlds with explicit rules and full observability, so the gains could reflect optimization to those specific reward structures rather than broader spatial-temporal logic. The abstract flags out-of-domain benchmarks, but without more detail on the size of the shift or failure cases it is hard to judge how far the rule-consistent reasoning actually travels. If the full paper has thorough ablations and statistics that address this, the concern shrinks; otherwise it remains the main limit on the stronger claims. This is useful for researchers working on controllable video generation or RL for diffusion models. A reader already thinking about verifiable feedback in generative settings would get concrete ideas to try. The work shows clear method design and honest engagement with the benchmarks, so it deserves a serious referee to check the experiments and see whether the generalization holds up under closer scrutiny. I would send it to peer review.

Referee Report

1 major / 2 minor

Summary. The paper introduces VideoRLVR, a method to optimize video diffusion models for verifiable reasoning using reinforcement learning with rule-based feedback. It features an SDE-GRPO optimization backbone, dense decomposed rewards, and an Early-Step Focus strategy that reduces training latency by ~40%. Evaluated on three procedurally generated domains (Maze, FlowFree, Sokoban) with objective success criteria, VideoRLVR shows consistent gains over supervised fine-tuning baselines (especially with dense rewards in low-success settings) and outperforms other video generation models on both in-domain and out-of-domain benchmarks, suggesting a shift from perceptual imitation to reliable rule-consistent visual reasoning.

Significance. If the reported improvements hold under scrutiny, this represents a meaningful step in extending RLVR techniques from language models to video diffusion for tasks with explicit spatial-temporal-logical constraints. The practical efficiency gain from Early-Step Focus and the emphasis on dense rewards are useful contributions. The benchmark-driven evaluation with objective criteria offers concrete, falsifiable results, though broader impact hinges on evidence of generalization beyond the tested environments.

major comments (1)

[Evaluation section] Evaluation section (and Abstract): The central claim that gains on Maze, FlowFree, and Sokoban demonstrate 'reliable rule-consistent visual reasoning' that generalizes is load-bearing but rests on three domains sharing low-dimensional, fully observable, rule-explicit properties. This risks conflating environment-specific trajectory optimization with transferable logic. The mention of out-of-domain benchmarks lacks quantitative details on domain shift, reward sensitivity, or failure-case analysis, weakening support for the broader reasoning claim.

minor comments (2)

[Introduction] Expand the SDE-GRPO acronym and provide a short equation or reference on first use in the methods or introduction.
[Methods] Add a diagram or pseudocode for the Early-Step Focus strategy to clarify how it restricts policy optimization to the early denoising phase.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the concern about the evaluation section and the strength of our generalization claims below, and we outline the revisions we will make to strengthen the presentation.

read point-by-point responses

Referee: [Evaluation section] Evaluation section (and Abstract): The central claim that gains on Maze, FlowFree, and Sokoban demonstrate 'reliable rule-consistent visual reasoning' that generalizes is load-bearing but rests on three domains sharing low-dimensional, fully observable, rule-explicit properties. This risks conflating environment-specific trajectory optimization with transferable logic. The mention of out-of-domain benchmarks lacks quantitative details on domain shift, reward sensitivity, or failure-case analysis, weakening support for the broader reasoning claim.

Authors: We appreciate the referee's observation that the three domains share structural similarities. These environments were deliberately chosen because they provide objective, rule-based success criteria that enable verifiable rewards, which is a core requirement of the RLVR framework. The tasks nevertheless differ in their core mechanics—Maze emphasizes path planning, FlowFree requires constraint satisfaction under flow rules, and Sokoban involves object manipulation and planning—thereby exercising distinct aspects of spatial-temporal reasoning. We do not claim that success on these domains proves broad transfer to arbitrary visual reasoning; rather, the results demonstrate that RL with dense decomposed rewards can improve rule adherence beyond supervised fine-tuning in settings where correctness is unambiguously measurable. Regarding out-of-domain benchmarks, we acknowledge that the current text mentions performance improvements but provides insufficient quantitative analysis of domain shift, reward sensitivity, and failure modes. We will revise the Evaluation section (and update the Abstract accordingly) to include additional metrics, a description of the domain-shift characteristics, and representative success/failure examples. This revision will better contextualize the scope of our claims without overstating generalization. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical method with external objective benchmarks

full rationale

The paper presents VideoRLVR as an empirical optimization recipe (SDE-GRPO backbone, dense decomposed rewards, Early-Step Focus) evaluated on three procedurally generated domains with objective success criteria. Performance gains are reported via direct comparison to supervised fine-tuning baselines and other video models on in-domain and out-of-domain benchmarks. No first-principles derivations, predictions, or uniqueness theorems are claimed that reduce by construction to fitted parameters or self-citations within the paper. The central results rest on external verifiable rewards and benchmark metrics rather than internal redefinitions or load-bearing self-references.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions from diffusion model training and RL with verifiable rewards; no new free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Video diffusion models can be effectively optimized via policy gradients using rule-based verifiable rewards defined on generated trajectories.
Core premise enabling the transfer of RLVR from language to video models.

pith-pipeline@v0.9.0 · 5776 in / 1225 out tokens · 51096 ms · 2026-05-19T14:56:32.221238+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

VideoRLVR formulates video reasoning as the generation of verifiable visual trajectories and consists of an SDE-GRPO optimization backbone, dense decomposed rewards, and an Early-Step Focus strategy
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

dense decomposed rewards that break sparse task success into verifiable structural components

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 18 internal anchors

[1]

Vggrpo: Towards world-consistent video generation with 4d latent reward.arXiv preprint arXiv:2603.26599, 2026

Zhaochong An, Orest Kupyn, Théo Uscidda, Andrea Colaco, Karan Ahuja, Serge Belongie, Mar Gonzalez-Franco, and Marta Tintore Gazulla. Vggrpo: Towards world-consistent video generation with 4d latent reward.arXiv preprint arXiv:2603.26599, 2026

work page arXiv 2026
[2]

Training Diffusion Models with Reinforcement Learning

Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning.arXiv preprint arXiv:2305.13301, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Video generation models as world simulators

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Leo Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators. OpenAI Blog, 1(8):1, 2024

work page 2024
[4]

Mmgr: Multi-modal generative reasoning.arXiv preprint arXiv:2512.14691, 2025

Zefan Cai, Haoyi Qiu, Tianyi Ma, Haozhe Zhao, Gengze Zhou, Kung-Hsiang Huang, Parisa Kordjamshidi, Minjia Zhang, Wen Xiao, Jiuxiang Gu, et al. Mmgr: Multi-modal generative reasoning.arXiv preprint arXiv:2512.14691, 2025

work page arXiv 2025
[5]

Dgpo: discovering multiple strategies with diversity-guided policy optimization

Wentse Chen, Shiyu Huang, Yuan Chiang, Tim Pearce, Wei-Wei Tu, Ting Chen, and Jun Zhu. Dgpo: discovering multiple strategies with diversity-guided policy optimization. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 11390–11398, 2024

work page 2024
[6]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Language-conditioned world modeling for visual navigation.arXiv preprint arXiv:2603.26741, 2026

Yifei Dong, Fengyi Wu, Yilong Dai, Lingdong Kong, Guangyu Chen, Xu Zhu, Qiyu Hu, Tianyu Wang, Johnalbert Garnica, Feng Liu, et al. Language-conditioned world modeling for visual navigation.arXiv preprint arXiv:2603.26741, 2026

work page arXiv 2026
[8]

Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models.Advances in Neural Information Processing Systems, 36:79858–79885, 2023

Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models.Advances in Neural Information Processing Systems, 36:79858–79885, 2023

work page 2023
[9]

Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665–673, 2020

Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665–673, 2020

work page 2020
[10]

Gemini 3.1 Pro Model Card

Google DeepMind. Gemini 3.1 Pro Model Card. https://deepmind.google/models/ model-cards/gemini-3-1-pro/, February 2026

work page 2026
[11]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Are video models ready as zero-shot reasoners? an empirical study with the mme-cof benchmark.arXiv preprint arXiv:2510.26802, 2025

Ziyu Guo, Xinyan Chen, Renrui Zhang, Ruichuan An, Yu Qi, Dongzhi Jiang, Xiangtai Li, Manyuan Zhang, Hongsheng Li, and Pheng-Ann Heng. Are video models ready as zero-shot reasoners? an empirical study with the mme-cof benchmark.arXiv preprint arXiv:2510.26802, 2025

work page arXiv 2025
[13]

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers.arXiv preprint arXiv:2205.15868, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[14]

Puzzlefusion: Unleashing the power of diffusion models for spatial puzzle solving

Sepidehsadat Sepid Hossieni, Mohammad Amin Shabani, Saghar Irandoust, and Yasutaka Furukawa. Puzzlefusion: Unleashing the power of diffusion models for spatial puzzle solving. Advances in Neural Information Processing Systems, 36:9574–9597, 2023

work page 2023
[15]

Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022. 10

work page 2022
[16]

Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model.arXiv preprint arXiv:2503.24290, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Learning Adaptive Reasoning Paths for Efficient Visual Reasoning

Yixu Huang, Tinghui Zhu, and Muhao Chen. Learning adaptive reasoning paths for efficient visual reasoning.arXiv preprint arXiv:2604.14568, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[18]

Vchain: Chain-of-visual-thought for reasoning in video generation.arXiv preprint arXiv:2510.05094, 2025

Ziqi Huang, Ning Yu, Gordon Chen, Haonan Qiu, Paul Debevec, and Ziwei Liu. Vchain: Chain-of-visual-thought for reasoning in video generation.arXiv preprint arXiv:2510.05094, 2025

work page arXiv 2025
[19]

How Far is Video Generation from World Model: A Physical Law Perspective

Bingyi Kang, Yang Yue, Rui Lu, Zhijie Lin, Yang Zhao, Kaixin Wang, Gao Huang, and Jiashi Feng. How far is video generation from world model: A physical law perspective.arXiv preprint arXiv:2411.02385, 2024

work page internal anchor Pith review arXiv 2024
[20]

All you need to know about kling video 3.0

Kling AI. All you need to know about kling video 3.0. https://kling.ai/blog/ kling-video-3-0-ai-director-features-guide, February 2026

work page 2026
[21]

Growing with the generator: Self-paced grpo for video generation.arXiv preprint arXiv:2511.19356, 2025

Rui Li, Yuanzhi Liang, Ziqi Ni, Haibing Huang, Chi Zhang, and Xuelong Li. Growing with the generator: Self-paced grpo for video generation.arXiv preprint arXiv:2511.19356, 2025

work page arXiv 2025
[22]

Branchgrpo: Stable and efficient grpo with structured branching in diffusion models.arXiv preprint arXiv:2509.06040, 2025

Yuming Li, Yikai Wang, Yuying Zhu, Zhongyu Zhao, Ming Lu, Qi She, and Shanghang Zhang. Branchgrpo: Stable and efficient grpo with structured branching in diffusion models.arXiv preprint arXiv:2509.06040, 2025

work page arXiv 2025
[23]

From System 1 to System 2: A Survey of Reasoning Large Language Models

Zhong-Zhi Li, Duzhen Zhang, Ming-Liang Zhang, Jiaxin Zhang, Zengyan Liu, Yuxuan Yao, Haotian Xu, Junhao Zheng, Pei-Jie Wang, Xiuyi Chen, et al. From system 1 to system 2: A survey of reasoning large language models.arXiv preprint arXiv:2502.17419, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Flow-GRPO: Training Flow Matching Models via Online RL

Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Diff-control: A stateful diffusion-based policy for imitation learning

Xiao Liu, Yifan Zhou, Fabian Weigend, Shubham Sonawani, Shuhei Ikemoto, and Heni Ben Amor. Diff-control: A stateful diffusion-based policy for imitation learning. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 7453–7460. IEEE, 2024

work page 2024
[26]

V-reasonbench: Toward unified reasoning benchmark suite for video generation models.arXiv preprint arXiv:2511.16668, 2025

Yang Luo, Xuanlei Zhao, Baijiong Lin, Lingting Zhu, Liyao Tang, Yuqi Liu, Ying-Cong Chen, Shengju Qian, Xin Wang, and Yang You. V-reasonbench: Toward unified reasoning benchmark suite for video generation models.arXiv preprint arXiv:2511.16668, 2025

work page arXiv 2025
[27]

McAllister, S

David McAllister, Songwei Ge, Brent Yi, Chung Min Kim, Ethan Weber, Hongsuk Choi, Haiwen Feng, and Angjoo Kanazawa. Flow matching policy gradients.arXiv preprint arXiv:2507.21053, 2025

work page arXiv 2025
[28]

Video generation models in robotics-applications, research challenges, future directions.arXiv preprint arXiv:2601.07823, 2026

Zhiting Mei, Tenny Yin, Ola Shorinwa, Apurva Badithela, Zhonghe Zheng, Joseph Bruno, Madi- son Bland, Lihan Zha, Asher Hancock, Jaime Fernández Fisac, et al. Video generation models in robotics-applications, research challenges, future directions.arXiv preprint arXiv:2601.07823, 2026

work page arXiv 2026
[29]

Do gener- ative video models understand physical principles? InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 948–958, 2026

Saman Motamed, Laura Culp, Kevin Swersky, Priyank Jaini, and Robert Geirhos. Do gener- ative video models understand physical principles? InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 948–958, 2026

work page 2026
[30]

Sora 2 system card

OpenAI. Sora 2 system card. https://openai.com/index/sora-2-system-card/ , September 2025

work page 2025
[31]

GPT-5.5 System Card

OpenAI. GPT-5.5 System Card. https://openai.com/index/gpt-5-5-system-card/ , April 2026

work page 2026
[32]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 11

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Learning plug-and-play memory for guiding video diffusion models.arXiv preprint arXiv:2511.19229, 2025

Selena Song, Ziming Xu, Zijun Zhang, Kun Zhou, Jiaxian Guo, Lianhui Qin, and Biwei Huang. Learning plug-and-play memory for guiding video diffusion models.arXiv preprint arXiv:2511.19229, 2025

work page arXiv 2025
[35]

Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm

Jingqi Tong, Yurong Mou, Hangcheng Li, Mingzhe Li, Yongzhuo Yang, Ming Zhang, Qiguang Chen, Tianyi Liang, Xiaomeng Hu, Yining Zheng, et al. Thinking with video: Video generation as a promising multimodal reasoning paradigm.arXiv preprint arXiv:2511.04570, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Diffusion model alignment using direct preference optimization

Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8228–8238, 2024

work page 2024
[37]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

A very big video reasoning suite.arXiv preprint arXiv:2602.20159, 2026

Maijunxian Wang, Ruisi Wang, Juyi Lin, Ran Ji, Thaddäus Wiedemer, Qingying Gao, Dezhi Luo, Yaoyao Qian, Lianyu Huang, Zelong Hong, et al. A very big video reasoning suite.arXiv preprint arXiv:2602.20159, 2026

work page arXiv 2026
[39]

Demystifing video reasoning.arXiv preprint arXiv:2603.16870, 2026

Ruisi Wang, Zhongang Cai, Fanyi Pu, Junxiang Xu, Wanqi Yin, Maijunxian Wang, Ran Ji, Chenyang Gu, Bo Li, Ziqi Huang, et al. Demystifing video reasoning.arXiv preprint arXiv:2603.16870, 2026

work page arXiv 2026
[40]

Video models are zero-shot learners and reasoners

Thaddäus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, and Robert Geirhos. Video models are zero-shot learners and reasoners. arXiv preprint arXiv:2509.20328, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

HunyuanVideo 1.5 Technical Report

Bing Wu, Chang Zou, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Jack Peng, Jianbing Wu, Jiangfeng Xiong, Jie Jiang, et al. Hunyuanvideo 1.5 technical report.arXiv preprint arXiv:2511.18870, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Arm: Adaptive reasoning model.arXiv preprint arXiv:2505.20258, 2025

Siye Wu, Jian Xie, Yikai Zhang, Aili Chen, Kai Zhang, Yu Su, and Yanghua Xiao. Arm: Adaptive reasoning model.arXiv preprint arXiv:2505.20258, 2025

work page arXiv 2025
[43]

A Systematic Post-Train Framework for Video Generation

Zeyue Xue, Siming Fu, Jie Huang, Shuai Lu, Haoran Li, Yijun Liu, Yuming Li, Xiaoxuan He, Mengzhao Chen, Haoyang Huang, et al. A systematic post-train framework for video generation.arXiv preprint arXiv:2604.25427, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[44]

DanceGRPO: Unleashing GRPO on Visual Generation

Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. Dancegrpo: Unleashing grpo on visual generation. arXiv preprint arXiv:2505.07818, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

Reasoning via video: The first evaluation of video models’ reasoning abilities through maze-solving tasks.arXiv preprint arXiv:2511.15065, 2025

Cheng Yang, Haiyuan Wan, Yiran Peng, Xin Cheng, Zhaoyang Yu, Jiayi Zhang, Junchi Yu, Xinlei Yu, Xiawu Zheng, Dongzhan Zhou, et al. Reasoning via video: The first evaluation of video models’ reasoning abilities through maze-solving tasks.arXiv preprint arXiv:2511.15065, 2025

work page arXiv 2025
[46]

Diffusion probabilistic modeling for video generation.Entropy, 25(10):1469, 2023

Ruihan Yang, Prakhar Srivastava, and Stephan Mandt. Diffusion probabilistic modeling for video generation.Entropy, 25(10):1469, 2023

work page 2023
[47]

SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild

Weihao Zeng, Yuzhen Huang, Qian Liu, Wei Liu, Keqing He, Zejun Ma, and Junxian He. Simplerl-zoo: Investigating and taming zero reinforcement learning for open base models in the wild.arXiv preprint arXiv:2503.18892, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

Morpheus: Benchmarking physical reasoning of video generative models with real physical experiments.arXiv preprint arXiv:2504.02918, 2025

Chenyu Zhang, Daniil Cherniavskii, Antonios Tragoudaras, Antonios V ozikis, Thijmen Nijdam, Derck WE Prinzhorn, Mark Bodracska, Nicu Sebe, Andrii Zadaianchuk, and Efstratios Gavves. Morpheus: Benchmarking physical reasoning of video generative models with real physical experiments.arXiv preprint arXiv:2504.02918, 2025. 12

work page arXiv 2025
[49]

Deductive beam search: Decoding deducible rationale for chain-of-thought reasoning.arXiv preprint arXiv:2401.17686, 2024

Tinghui Zhu, Kai Zhang, Jian Xie, and Yu Su. Deductive beam search: Decoding deducible rationale for chain-of-thought reasoning.arXiv preprint arXiv:2401.17686, 2024. 13 A Dataset Generation Details We generate all training and evaluation instances using rule-based algorithms so that each sample has a known valid trajectory and task metadata for automatic...

work page arXiv 2024

[1] [1]

Vggrpo: Towards world-consistent video generation with 4d latent reward.arXiv preprint arXiv:2603.26599, 2026

Zhaochong An, Orest Kupyn, Théo Uscidda, Andrea Colaco, Karan Ahuja, Serge Belongie, Mar Gonzalez-Franco, and Marta Tintore Gazulla. Vggrpo: Towards world-consistent video generation with 4d latent reward.arXiv preprint arXiv:2603.26599, 2026

work page arXiv 2026

[2] [2]

Training Diffusion Models with Reinforcement Learning

Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning.arXiv preprint arXiv:2305.13301, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Video generation models as world simulators

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Leo Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators. OpenAI Blog, 1(8):1, 2024

work page 2024

[4] [4]

Mmgr: Multi-modal generative reasoning.arXiv preprint arXiv:2512.14691, 2025

Zefan Cai, Haoyi Qiu, Tianyi Ma, Haozhe Zhao, Gengze Zhou, Kung-Hsiang Huang, Parisa Kordjamshidi, Minjia Zhang, Wen Xiao, Jiuxiang Gu, et al. Mmgr: Multi-modal generative reasoning.arXiv preprint arXiv:2512.14691, 2025

work page arXiv 2025

[5] [5]

Dgpo: discovering multiple strategies with diversity-guided policy optimization

Wentse Chen, Shiyu Huang, Yuan Chiang, Tim Pearce, Wei-Wei Tu, Ting Chen, and Jun Zhu. Dgpo: discovering multiple strategies with diversity-guided policy optimization. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 11390–11398, 2024

work page 2024

[6] [6]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

Language-conditioned world modeling for visual navigation.arXiv preprint arXiv:2603.26741, 2026

Yifei Dong, Fengyi Wu, Yilong Dai, Lingdong Kong, Guangyu Chen, Xu Zhu, Qiyu Hu, Tianyu Wang, Johnalbert Garnica, Feng Liu, et al. Language-conditioned world modeling for visual navigation.arXiv preprint arXiv:2603.26741, 2026

work page arXiv 2026

[8] [8]

Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models.Advances in Neural Information Processing Systems, 36:79858–79885, 2023

Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models.Advances in Neural Information Processing Systems, 36:79858–79885, 2023

work page 2023

[9] [9]

Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665–673, 2020

Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665–673, 2020

work page 2020

[10] [10]

Gemini 3.1 Pro Model Card

Google DeepMind. Gemini 3.1 Pro Model Card. https://deepmind.google/models/ model-cards/gemini-3-1-pro/, February 2026

work page 2026

[11] [11]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

Are video models ready as zero-shot reasoners? an empirical study with the mme-cof benchmark.arXiv preprint arXiv:2510.26802, 2025

Ziyu Guo, Xinyan Chen, Renrui Zhang, Ruichuan An, Yu Qi, Dongzhi Jiang, Xiangtai Li, Manyuan Zhang, Hongsheng Li, and Pheng-Ann Heng. Are video models ready as zero-shot reasoners? an empirical study with the mme-cof benchmark.arXiv preprint arXiv:2510.26802, 2025

work page arXiv 2025

[13] [13]

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers.arXiv preprint arXiv:2205.15868, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[14] [14]

Puzzlefusion: Unleashing the power of diffusion models for spatial puzzle solving

Sepidehsadat Sepid Hossieni, Mohammad Amin Shabani, Saghar Irandoust, and Yasutaka Furukawa. Puzzlefusion: Unleashing the power of diffusion models for spatial puzzle solving. Advances in Neural Information Processing Systems, 36:9574–9597, 2023

work page 2023

[15] [15]

Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022. 10

work page 2022

[16] [16]

Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model.arXiv preprint arXiv:2503.24290, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

Learning Adaptive Reasoning Paths for Efficient Visual Reasoning

Yixu Huang, Tinghui Zhu, and Muhao Chen. Learning adaptive reasoning paths for efficient visual reasoning.arXiv preprint arXiv:2604.14568, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[18] [18]

Vchain: Chain-of-visual-thought for reasoning in video generation.arXiv preprint arXiv:2510.05094, 2025

Ziqi Huang, Ning Yu, Gordon Chen, Haonan Qiu, Paul Debevec, and Ziwei Liu. Vchain: Chain-of-visual-thought for reasoning in video generation.arXiv preprint arXiv:2510.05094, 2025

work page arXiv 2025

[19] [19]

How Far is Video Generation from World Model: A Physical Law Perspective

Bingyi Kang, Yang Yue, Rui Lu, Zhijie Lin, Yang Zhao, Kaixin Wang, Gao Huang, and Jiashi Feng. How far is video generation from world model: A physical law perspective.arXiv preprint arXiv:2411.02385, 2024

work page internal anchor Pith review arXiv 2024

[20] [20]

All you need to know about kling video 3.0

Kling AI. All you need to know about kling video 3.0. https://kling.ai/blog/ kling-video-3-0-ai-director-features-guide, February 2026

work page 2026

[21] [21]

Growing with the generator: Self-paced grpo for video generation.arXiv preprint arXiv:2511.19356, 2025

Rui Li, Yuanzhi Liang, Ziqi Ni, Haibing Huang, Chi Zhang, and Xuelong Li. Growing with the generator: Self-paced grpo for video generation.arXiv preprint arXiv:2511.19356, 2025

work page arXiv 2025

[22] [22]

Branchgrpo: Stable and efficient grpo with structured branching in diffusion models.arXiv preprint arXiv:2509.06040, 2025

Yuming Li, Yikai Wang, Yuying Zhu, Zhongyu Zhao, Ming Lu, Qi She, and Shanghang Zhang. Branchgrpo: Stable and efficient grpo with structured branching in diffusion models.arXiv preprint arXiv:2509.06040, 2025

work page arXiv 2025

[23] [23]

From System 1 to System 2: A Survey of Reasoning Large Language Models

Zhong-Zhi Li, Duzhen Zhang, Ming-Liang Zhang, Jiaxin Zhang, Zengyan Liu, Yuxuan Yao, Haotian Xu, Junhao Zheng, Pei-Jie Wang, Xiuyi Chen, et al. From system 1 to system 2: A survey of reasoning large language models.arXiv preprint arXiv:2502.17419, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[24] [24]

Flow-GRPO: Training Flow Matching Models via Online RL

Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

Diff-control: A stateful diffusion-based policy for imitation learning

Xiao Liu, Yifan Zhou, Fabian Weigend, Shubham Sonawani, Shuhei Ikemoto, and Heni Ben Amor. Diff-control: A stateful diffusion-based policy for imitation learning. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 7453–7460. IEEE, 2024

work page 2024

[26] [26]

V-reasonbench: Toward unified reasoning benchmark suite for video generation models.arXiv preprint arXiv:2511.16668, 2025

Yang Luo, Xuanlei Zhao, Baijiong Lin, Lingting Zhu, Liyao Tang, Yuqi Liu, Ying-Cong Chen, Shengju Qian, Xin Wang, and Yang You. V-reasonbench: Toward unified reasoning benchmark suite for video generation models.arXiv preprint arXiv:2511.16668, 2025

work page arXiv 2025

[27] [27]

McAllister, S

David McAllister, Songwei Ge, Brent Yi, Chung Min Kim, Ethan Weber, Hongsuk Choi, Haiwen Feng, and Angjoo Kanazawa. Flow matching policy gradients.arXiv preprint arXiv:2507.21053, 2025

work page arXiv 2025

[28] [28]

Video generation models in robotics-applications, research challenges, future directions.arXiv preprint arXiv:2601.07823, 2026

Zhiting Mei, Tenny Yin, Ola Shorinwa, Apurva Badithela, Zhonghe Zheng, Joseph Bruno, Madi- son Bland, Lihan Zha, Asher Hancock, Jaime Fernández Fisac, et al. Video generation models in robotics-applications, research challenges, future directions.arXiv preprint arXiv:2601.07823, 2026

work page arXiv 2026

[29] [29]

Do gener- ative video models understand physical principles? InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 948–958, 2026

Saman Motamed, Laura Culp, Kevin Swersky, Priyank Jaini, and Robert Geirhos. Do gener- ative video models understand physical principles? InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 948–958, 2026

work page 2026

[30] [30]

Sora 2 system card

OpenAI. Sora 2 system card. https://openai.com/index/sora-2-system-card/ , September 2025

work page 2025

[31] [31]

GPT-5.5 System Card

OpenAI. GPT-5.5 System Card. https://openai.com/index/gpt-5-5-system-card/ , April 2026

work page 2026

[32] [32]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 11

work page internal anchor Pith review Pith/arXiv arXiv 2024

[33] [33]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [34]

Learning plug-and-play memory for guiding video diffusion models.arXiv preprint arXiv:2511.19229, 2025

Selena Song, Ziming Xu, Zijun Zhang, Kun Zhou, Jiaxian Guo, Lianhui Qin, and Biwei Huang. Learning plug-and-play memory for guiding video diffusion models.arXiv preprint arXiv:2511.19229, 2025

work page arXiv 2025

[35] [35]

Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm

Jingqi Tong, Yurong Mou, Hangcheng Li, Mingzhe Li, Yongzhuo Yang, Ming Zhang, Qiguang Chen, Tianyi Liang, Xiaomeng Hu, Yining Zheng, et al. Thinking with video: Video generation as a promising multimodal reasoning paradigm.arXiv preprint arXiv:2511.04570, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[36] [36]

Diffusion model alignment using direct preference optimization

Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8228–8238, 2024

work page 2024

[37] [37]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[38] [38]

A very big video reasoning suite.arXiv preprint arXiv:2602.20159, 2026

Maijunxian Wang, Ruisi Wang, Juyi Lin, Ran Ji, Thaddäus Wiedemer, Qingying Gao, Dezhi Luo, Yaoyao Qian, Lianyu Huang, Zelong Hong, et al. A very big video reasoning suite.arXiv preprint arXiv:2602.20159, 2026

work page arXiv 2026

[39] [39]

Demystifing video reasoning.arXiv preprint arXiv:2603.16870, 2026

Ruisi Wang, Zhongang Cai, Fanyi Pu, Junxiang Xu, Wanqi Yin, Maijunxian Wang, Ran Ji, Chenyang Gu, Bo Li, Ziqi Huang, et al. Demystifing video reasoning.arXiv preprint arXiv:2603.16870, 2026

work page arXiv 2026

[40] [40]

Video models are zero-shot learners and reasoners

Thaddäus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, and Robert Geirhos. Video models are zero-shot learners and reasoners. arXiv preprint arXiv:2509.20328, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [41]

HunyuanVideo 1.5 Technical Report

Bing Wu, Chang Zou, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Jack Peng, Jianbing Wu, Jiangfeng Xiong, Jie Jiang, et al. Hunyuanvideo 1.5 technical report.arXiv preprint arXiv:2511.18870, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[42] [42]

Arm: Adaptive reasoning model.arXiv preprint arXiv:2505.20258, 2025

Siye Wu, Jian Xie, Yikai Zhang, Aili Chen, Kai Zhang, Yu Su, and Yanghua Xiao. Arm: Adaptive reasoning model.arXiv preprint arXiv:2505.20258, 2025

work page arXiv 2025

[43] [43]

A Systematic Post-Train Framework for Video Generation

Zeyue Xue, Siming Fu, Jie Huang, Shuai Lu, Haoran Li, Yijun Liu, Yuming Li, Xiaoxuan He, Mengzhao Chen, Haoyang Huang, et al. A systematic post-train framework for video generation.arXiv preprint arXiv:2604.25427, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[44] [44]

DanceGRPO: Unleashing GRPO on Visual Generation

Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. Dancegrpo: Unleashing grpo on visual generation. arXiv preprint arXiv:2505.07818, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[45] [45]

Reasoning via video: The first evaluation of video models’ reasoning abilities through maze-solving tasks.arXiv preprint arXiv:2511.15065, 2025

Cheng Yang, Haiyuan Wan, Yiran Peng, Xin Cheng, Zhaoyang Yu, Jiayi Zhang, Junchi Yu, Xinlei Yu, Xiawu Zheng, Dongzhan Zhou, et al. Reasoning via video: The first evaluation of video models’ reasoning abilities through maze-solving tasks.arXiv preprint arXiv:2511.15065, 2025

work page arXiv 2025

[46] [46]

Diffusion probabilistic modeling for video generation.Entropy, 25(10):1469, 2023

Ruihan Yang, Prakhar Srivastava, and Stephan Mandt. Diffusion probabilistic modeling for video generation.Entropy, 25(10):1469, 2023

work page 2023

[47] [47]

SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild

Weihao Zeng, Yuzhen Huang, Qian Liu, Wei Liu, Keqing He, Zejun Ma, and Junxian He. Simplerl-zoo: Investigating and taming zero reinforcement learning for open base models in the wild.arXiv preprint arXiv:2503.18892, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[48] [48]

Morpheus: Benchmarking physical reasoning of video generative models with real physical experiments.arXiv preprint arXiv:2504.02918, 2025

Chenyu Zhang, Daniil Cherniavskii, Antonios Tragoudaras, Antonios V ozikis, Thijmen Nijdam, Derck WE Prinzhorn, Mark Bodracska, Nicu Sebe, Andrii Zadaianchuk, and Efstratios Gavves. Morpheus: Benchmarking physical reasoning of video generative models with real physical experiments.arXiv preprint arXiv:2504.02918, 2025. 12

work page arXiv 2025

[49] [49]

Deductive beam search: Decoding deducible rationale for chain-of-thought reasoning.arXiv preprint arXiv:2401.17686, 2024

Tinghui Zhu, Kai Zhang, Jian Xie, and Yu Su. Deductive beam search: Decoding deducible rationale for chain-of-thought reasoning.arXiv preprint arXiv:2401.17686, 2024. 13 A Dataset Generation Details We generate all training and evaluation instances using rule-based algorithms so that each sample has a known valid trajectory and task metadata for automatic...

work page arXiv 2024