pith. sign in

arxiv: 2601.18577 · v2 · pith:2PIERT2Wnew · submitted 2026-01-26 · 💻 cs.CV · cs.LG

Self-Refining Video Sampling

Pith reviewed 2026-05-21 14:32 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords video generationself-refinementdenoising autoencodermotion coherenceself-consistencyinference-time refinementphysical realismuncertainty-aware sampling
0
0 comments X

The pith

A pre-trained video generator can refine its own outputs at inference time by treating itself as a denoising autoencoder and selectively updating inconsistent regions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that video generators can improve physical realism and motion coherence by repurposing their own denoising process for iterative refinement during sampling. This approach avoids external verifiers or extra training by running an inner loop where the model reprocesses its prior predictions. An uncertainty-aware step then limits updates to regions where repeated outputs disagree, which prevents new artifacts from over-refinement. A sympathetic reader would care because the method works on existing large-scale models and delivers measurable gains in human preference for more realistic video.

Core claim

By interpreting the generator as a denoising autoencoder, self-refining video sampling enables iterative inner-loop refinement at inference time without any external verifier or additional training. The method further introduces an uncertainty-aware refinement strategy that selectively refines regions based on self-consistency, which prevents artifacts caused by over-refinement. Experiments on state-of-the-art video generators show significant improvements in motion coherence and physics alignment, achieving over 70 percent human preference compared to the default sampler and guidance-based sampler.

What carries the argument

Self-refining video sampling, which repurposes a pre-trained generator for iterative inner-loop refinement at inference using self-consistency to guide selective updates.

If this is right

  • Generated videos exhibit higher motion coherence without changes to the underlying model.
  • Physics alignment improves through inference-only adjustments on existing generators.
  • Over-refinement artifacts are avoided by limiting updates to inconsistent regions.
  • The method applies directly to current state-of-the-art video models.
  • Human evaluators prefer the refined outputs more than 70 percent of the time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same self-refinement idea could extend to image or audio generators that rely on denoising steps.
  • Inference-time loops might reduce reliance on expensive fine-tuning for better physical realism in generative tasks.
  • If self-consistency reliably flags problem areas, similar checks could help diagnose failures in other sampling-based models.
  • Optimizing the number of refinement iterations could enable practical use in resource-limited settings.

Load-bearing premise

That the generator's built-in denoising behavior can be reliably reused for self-improvement and that agreement across repeated predictions correctly identifies regions that need refinement without creating new problems.

What would settle it

Running the refinement loop on standard video benchmarks produces no gain or a drop in motion quality scores such as optical flow consistency or human-rated physics alignment.

Figures

Figures reproduced from arXiv: 2601.18577 by Jaehong Yoon, Jaehyeong Jo, Saining Xie, Sangwon Jang, Sung Ju Hwang, Taekyung Ki.

Figure 1
Figure 1. Figure 1: Concept of the self-refining video sampling. Within the same noise level, the video latent zt is refined as the predicted endpoint zˆ1 is pulled toward the data manifold. lead to incorrect actions. Recent works attempt to address these limitations by either incorporating external models or additional training. One line of work employs external verifiers to improve physical plausibility via rejection sampli… view at source ↗
Figure 2
Figure 2. Figure 2: Sampling comparison on a 2D synthetic dataset. (a￾b) P&P generates samples closer to the data manifold than the Euler solver. (c-d) With a fixed timestep, iterative P&P pulls the prediction zˆ1 closer to the data manifold. where uθ is the trained vector field model, for which Dθ maps the noisy state zt to a prediction of the clean sample zˆ1. Moreover, the corruption of state z at timestep t corresponds to… view at source ↗
Figure 3
Figure 3. Figure 3: Visualization of uncertainty maps, showing higher values in motion-related regions. Maps are computed at t = 0.1T. Bottom row overlays the corresponding binary masks (τ = 0.25) on videos generated by Wan2.2-A14B T2V (Wang et al., 2025a). Notably, the proposed P&P can be integrated into existing ODE solvers in a plug-and-play manner, by simply replacing zt with the refined z ∗ t := z (Kf ) t with Kf ≤ 3: z_… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison on challenging motion generation. + Ours Cosmos -2.5 2B “… robotic arm descends towards the metallic bowl, gripping it … then places it down on the blue cloth at the right side...” [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison on I2V generation in robotics domain. Wan2.2 -A14B + Ours “… two children ... As they pull apart the sand, it divides into two distinct piles, revealing the textures and colors of the earthy material...” [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison on physics-aligned video generation. + Ours Wan2.2 -A14B “A first-person view… camera performs a quick, continuous 360-degree … vibrant neon sign glows with the word ‘SKYLINE’ …” … … 15th frame 20th frame 60th frame 65th frame 15th frame 24th frame 72nd frame 81st frame [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative comparison on spatially consistent video generation. 6 [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Examples of self-refinement applied to visual reasoning tasks: (Top) graph traversal and (Bottom) maze solving from Wiedemer et al. (2025). We use Wan2.2-A14B I2V as the base model. For graph traversal, self-refinement yields a dramatic improvement in the success rate from 0.1 to 0.8. For maze solving, self-refinement does not yield meaningful gain, with success remaining near zero. Benchmarks We generate … view at source ↗
Figure 9
Figure 9. Figure 9: Ablation on uncertainty-aware strategy. Multiple P&P updates without uncertainty-aware strategy cause over-saturation. Red arrow indicates motion misaligned with the prompt. Restart (Xu et al., 2023) alternates between forward noising restart steps and deterministic backward ODE integration, using stochasticity to reduce error accumulation. At a high level, it resembles our approach in that noise injection… view at source ↗
Figure 10
Figure 10. Figure 10: Full human evaluation results on Dynamic-Bench, including ties. Physical Commonsense Base ODE NFEx2 CFG-Zero 72% 24%4% 60% 30% 10% 64% 29% 8% Text Alignment 42% 46% 12% 42% 45% 13% 37% 45% 18% Ours Tie Baseline [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Full human evaluation results on VideoPhy2 (Bansal et al., 2025) hard subset, including ties. 2025b), which provides sufficiently strong performance on robot-domain evaluation. We provide the grasp success rate evaluated using Gemini 3 Flash (Google, 2025a) in [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Gemini prompt for evaluating grasp success rate in Tab. 2. We treat scores of 4 or 5 as successful grasps. results including ties are provided in [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗
Figure 14
Figure 14. Figure 14: Accumulated effect of iterative P&P at an early in￾ference step. We plot the L2 distance between the intermediate refined latent zˆ (k) 1 and the final refined latent zˆ ∗ 1 at a fixed inference step t = 0.009T. Results are obtained using Wan2.2-A14B T2V. A cat holding …“Predict-and-Perturb: Self-Refining Video Sampling” (a) FLUX.1-dev (50 NFEs) (b) + Ours (50+2 NFEs) [PITH_FULL_IMAGE:figures/full_fig_p0… view at source ↗
Figure 15
Figure 15. Figure 15: Image generation with P&P using FLUX.1-dev. With only two additional NFEs (4%), our method effectively reduces text-related artifacts, resulting in clearer and more coherent text. generate four samples with different random seeds using the prompt “A cat holding a sign that says ‘Predict-and-Perturb: Self-Refining Video Sampling’.” For image generation, P&P is applied only twice at the 10th inference step … view at source ↗
Figure 16
Figure 16. Figure 16: Ablation studies on the hyperparameters Kf and τ . (a) ODE (Euler) (c) P&P : 3-6 steps (ti < t𝛼𝛼𝛼𝛼 = 0.05) (b) P&P : 3-4 steps (ti < t𝛼𝛼𝛼𝛼 = 0.03) (e) P&P : 6-10 steps (0.05 < ti < 0.1) (d) P&P : 3-10 steps (ti < t𝛼𝛼𝛼𝛼 = 0.1) [PITH_FULL_IMAGE:figures/full_fig_p018_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Ablation studies on the hyperparameter α. Gray blocks indicate Euler method and orange blocks indicate P&P. P&P significantly improves motion coherence when applied in earlier steps (b-c), while providing only marginal gains at later steps (d-e). 18 [PITH_FULL_IMAGE:figures/full_fig_p018_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Graph traversal task in Wiedemer et al. (2025). We use Wan2.2-A14B I2V with an upsampled prompt: “Starting from the blue well, blue water begins to flow slowly through the connected channel system. The water gradually fills the nearest nodes first...”. The success rate increases from 0.1 to 0.8 with P&P method. Trial #1 Trial #2 Trial #1 Trial #2 (a) Wan2.2-A14B I2V (b) + Ours “The red square slides smoot… view at source ↗
Figure 19
Figure 19. Figure 19: Maze solving task in Wiedemer et al. (2025). We use Wan2.2-A14B I2V with a base prompt: “The red square slides smoothly along the white path, stopping perfectly on the green square.” Both the base model and P&P method achieve near-zero success rates. Generated Video Uncertainty Map 𝐔𝐔 t= 0.037T (5/40 step) t= 0.056T (7/40 step) t= 0.078T (9/40 step) [PITH_FULL_IMAGE:figures/full_fig_p019_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Visualization of uncertainty maps across inference timesteps. Overall uncertainty gradually decreases as inference progresses. Even at an early timestep (t = 0.0037T), higher uncertainty values are observed for objects exhibiting motion. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Toy experiment on a 2D Gaussian mixture. Repeated P&P iterations (i.e., Kf = 32) yield samples concentrated in the modes. “An animal” Base ODE (UniPC) P&P (Excessive iterations) [PITH_FULL_IMAGE:figures/full_fig_p020_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Mode-seeking behavior induced by excessive P&P iterations in image generation. We use Wan2.2-A14B T2V with a single frame and apply P&P with Kf = 8, τ = 0 at steps 16–20 of the 40 step flow matching inference. However, as shown in [PITH_FULL_IMAGE:figures/full_fig_p020_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: P&P is also applicable to diffusion-based video generation models (e.g., CogVideoX (Yang et al., 2025b)), where it corrects video artifacts, such as a truncated lightsaber and distortions around the teddy bear’s mouth. (Image credit: MuDI (Jang et al., 2024)) artifacts or flickering. From this perspective, our method can be viewed as an intended temporal mode-seeking for improving output consistency. B.6.… view at source ↗
Figure 24
Figure 24. Figure 24 [PITH_FULL_IMAGE:figures/full_fig_p022_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Additional visual examples of physics-aligned generation using Wan2.2-A14B T2V. Our method also captures realistic physical interactions and fine-grained visual details [PITH_FULL_IMAGE:figures/full_fig_p023_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: A screenshot of the human evaluation questionnaires used for (left) motion-enhanced video generation on Dynamic-Bench and (right) physics-aligned video generation on the VideoPhy2 hard subset. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Qualitative comparison with commercial closed models, Veo 3.1 (Google, 2025b) and Kling 2.6 (Kuaishou, 2025). While the commercial models produce more aesthetic visual quality, our method demonstrates competitive performance on complex motion scenarios. Prompt: “A parkour athlete runs up a vertical wall, grabs the ledge, and muscles up to stand on the roof in one fluid motion.” and “A gymnast on a pommel … view at source ↗
read the original abstract

Modern video generators still struggle with complex physical dynamics, often falling short of physical realism. Existing approaches address this using external verifiers or additional training on augmented data, which is computationally expensive and still limited in capturing fine-grained motion. In this work, we present self-refining video sampling, a simple method that uses a pre-trained video generator trained on large-scale datasets as its own self-refiner. By interpreting the generator as a denoising autoencoder, we enable iterative inner-loop refinement at inference time without any external verifier or additional training. We further introduce an uncertainty-aware refinement strategy that selectively refines regions based on self-consistency, which prevents artifacts caused by over-refinement. Experiments on state-of-the-art video generators demonstrate significant improvements in motion coherence and physics alignment, achieving over 70% human preference compared to the default sampler and guidance-based sampler.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper proposes self-refining video sampling, which interprets a pre-trained video generator as a denoising autoencoder to enable iterative inner-loop refinement at inference time without external verifiers or additional training. It introduces an uncertainty-aware strategy that uses self-consistency across repeated forward passes to selectively refine regions, claiming improvements in motion coherence and physics alignment with over 70% human preference versus default and guidance samplers on state-of-the-art generators.

Significance. If the central claims hold under rigorous validation, the work would be significant for generative video modeling: it provides a training-free inference-time mechanism to address physical realism using only the model's own denoising behavior, avoiding the cost of external verifiers or data augmentation. The self-consistency-based uncertainty mask is a potentially reusable idea for other generative settings.

major comments (3)
  1. [Abstract] Abstract: the headline claim of >70% human preference and improvements in motion coherence/physics alignment is presented without any quantitative metrics (FVD, motion scores, etc.), exact baseline implementations, sample counts, participant details, or statistical tests; these omissions are load-bearing for the central empirical claim.
  2. [Method] Uncertainty-aware refinement strategy: the assumption that low self-consistency regions are exactly those violating physics (and that refinement corrects them without new artifacts) is not directly tested. If the training distribution contains systematic but consistent errors (e.g., incorrect friction reproduced reliably across samples), self-consistency would be high precisely where refinement is most needed, leaving those regions untouched; a concrete counterexample or controlled test is required.
  3. [Experiments] Experiments: ablation details for the self-consistency threshold, number of refinement iterations, and direct comparisons to specific guidance-based samplers are absent, preventing assessment of whether the reported gains are attributable to the proposed components rather than generic extra denoising steps.
minor comments (1)
  1. [Experiments] Clarify the exact video generators and versions used in the experiments for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We appreciate the referee's feedback on our work. We address each of the major comments in detail below and indicate the changes we will make to the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline claim of >70% human preference and improvements in motion coherence/physics alignment is presented without any quantitative metrics (FVD, motion scores, etc.), exact baseline implementations, sample counts, participant details, or statistical tests; these omissions are load-bearing for the central empirical claim.

    Authors: We agree that the abstract could benefit from additional details to support the claims. In the revised manuscript, we will include references to key quantitative results from our experiments, such as FVD scores and motion metrics, along with a brief mention of the human study setup. Due to abstract length limits, we will ensure the Experiments section provides all requested details including sample counts, participant information, and statistical tests. revision: yes

  2. Referee: [Method] Uncertainty-aware refinement strategy: the assumption that low self-consistency regions are exactly those violating physics (and that refinement corrects them without new artifacts) is not directly tested. If the training distribution contains systematic but consistent errors (e.g., incorrect friction reproduced reliably across samples), self-consistency would be high precisely where refinement is most needed, leaving those regions untouched; a concrete counterexample or controlled test is required.

    Authors: This point highlights an important aspect of our method. Our current results show that the self-consistency uncertainty effectively identifies regions with motion artifacts, leading to improved physics alignment in human evaluations. However, we recognize that a direct counterexample test for systematic errors is not present. We will add a new subsection in the Experiments or Discussion to address this by providing a controlled analysis or acknowledging the limitation if such errors exist in the data. revision: partial

  3. Referee: [Experiments] Experiments: ablation details for the self-consistency threshold, number of refinement iterations, and direct comparisons to specific guidance-based samplers are absent, preventing assessment of whether the reported gains are attributable to the proposed components rather than generic extra denoising steps.

    Authors: We thank the referee for this observation. The manuscript does include comparisons to guidance-based samplers, but we will enhance the Experiments section with detailed ablations. Specifically, we will report results for different self-consistency thresholds, varying numbers of refinement iterations, and clarify the exact implementations of the baselines to demonstrate that the improvements stem from our uncertainty-aware approach rather than additional computation alone. revision: yes

Circularity Check

0 steps flagged

No circularity: inference-time procedure uses pre-trained model behavior without reducing predictions to self-defined quantities

full rationale

The paper describes an inference procedure that reinterprets an existing pre-trained video generator as a denoising autoencoder to perform iterative refinement at test time, plus an uncertainty mask from repeated self-consistent forward passes. No equations or central claims reduce a derived quantity to a parameter fitted from the method's own outputs, nor do they rely on self-citation chains or imported uniqueness theorems for load-bearing justification. The approach is presented as an independent application of the model's existing denoising behavior without additional training, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claim rests on the domain assumption that a diffusion-style video generator can be reused as an effective self-refiner and that self-consistency serves as a valid uncertainty proxy.

axioms (1)
  • domain assumption A pre-trained video generator can be interpreted and reused as a denoising autoencoder for iterative self-refinement.
    This interpretation is invoked to justify the inner-loop refinement without external components.

pith-pipeline@v0.9.0 · 5682 in / 1173 out tokens · 116522 ms · 2026-05-21T14:32:08.973105+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. $h$-control: Training-Free Camera Control via Block-Conditional Gibbs Refinement

    cs.CV 2026-05 unverdicted novelty 7.0

    h-control introduces block-conditional pseudo-Gibbs refinement for training-free camera control in flow-matching video generators, achieving superior FVD scores on RealEstate10K and DAVIS benchmarks.

  2. CollabVR: Collaborative Video Reasoning with Vision-Language and Video Generation Models

    cs.CV 2026-05 unverdicted novelty 7.0

    CollabVR improves video reasoning performance by coupling vision-language models and video generation models in a closed-loop step-level collaboration that detects and repairs generation failures.

  3. $h$-control: Training-Free Camera Control via Block-Conditional Gibbs Refinement

    cs.CV 2026-05 unverdicted novelty 6.0

    h-control augments hard-replacement guidance with block-conditional pseudo-Gibbs refinement on unobserved latent sites and adaptive 3D patch freezing to achieve superior FVD on RealEstate10K and DAVIS.

  4. On the Robustness of Distribution Support under Diffusion Guidance

    cs.LG 2026-05 unverdicted novelty 6.0

    Guided diffusion generates samples near the target distribution support under exact score access, explaining its empirical success in producing plausible outputs.

  5. Human Cognition in Machines: A Unified Perspective of World Models

    cs.RO 2026-04 unverdicted novelty 6.0

    The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and pro...

  6. On the Robustness of Distribution Support under Diffusion Guidance

    cs.LG 2026-05 unverdicted novelty 4.0

    Establishes robustness of distribution support for guided diffusion processes under exact score access across DDIM, DDPM, and exponential integrator discretizations.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · cited by 4 Pith papers · 2 internal anchors

  1. [1]

    Qwen2.5-VL Technical Report

    1, 5, 15 Bai, L., Shao, S., zhou, z., Qi, Z., Xu, Z., Xiong, H., and Xie, Z. Zigzag diffusion sampling: Diffusion models can self- improve via self-reflection. InInternational Conference on Learning Representations, 2025a. 2 Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al. Qwen2. 5-vl technical report....

  2. [2]

    Videophy-2: A challenging action-centric physical commonsense evaluation in video generation

    URL https://deepmind.google/blog/ genie-3-a-new-frontier-for-world-models/ . 1, 2 Bansal, H., Peng, C., Bitton, Y ., Goldenberg, R., Grover, A., and Chang, K.-W. Videophy-2: A challenging action- centric physical commonsense evaluation in video gen- eration.arXiv preprint arXiv:2503.06800, 2025. 1, 7, 15 Bengio, Y ., Yao, L., Alain, G., and Vincent, P. Ge...

  3. [3]

    Phygdpo: Physics-aware groupwise direct preference optimization for physically consistent text-to-video generation.arXiv preprint arXiv:2512.24551,

    URL https://openai.com/research/ video-generation-models-as-world-simulators . 1, 2 Cai, Y ., Li, K., Jia, M., Wang, J., Sun, J., Liang, F., Chen, W., Juefei-Xu, F., Wang, C., Thabet, A., et al. Phygdpo: Physics-aware groupwise direct preference optimization for physically consistent text-to-video generation.arXiv preprint arXiv:2512.24551, 2025a. 1 Cai, ...

  4. [4]

    phyworldbench

    20 Fan, W., Zheng, A. Y ., Yeh, R. A., and Liu, Z. Cfg-zero*: Improved classifier-free guidance for flow matching mod- els.arXiv preprint arXiv:2503.18886, 2025. 5, 14 10 Self-Refining Video Sampling Gillman, N., Herrmann, C., Freeman, M., Aggarwal, D., Luo, E., Sun, D., and Sun, C. Force prompting: Video generation models can learn and generalize physics...

  5. [5]

    5 Jang, S., Jo, J., Lee, K., and Hwang, S. J. Identity decou- pling for multi-subject personalization of text-to-image models.Advances in Neural Information Processing Sys- tems, 37:100895–100937, 2024. 21 Jang, S., Ki, T., Jo, J., Yoon, J., Kim, S. Y ., Lin, Z., and Hwang, S. J. Frame guidance: Training-free guidance for frame-level control in video diff...

  6. [6]

    Reasoning with Sampling: Your Base Model is Smarter Than You Think

    2 Kang, B., Yue, Y ., Lu, R., Lin, Z., Zhao, Y ., Wang, K., Huang, G., and Feng, J. How far is video generation from world model: A physical law perspective. InInternational Conference on Machine Learning, 2025. 1 Karan, A. and Du, Y . Reasoning with sampling: Your base model is smarter than you think.arXiv preprint arXiv:2510.14901, 2025. 2 Kong, W., Tia...

  7. [7]

    Vlipp: Towards physically plausible video generation with vision and language informed physical prior.arXiv:2503.23368, 2025a

    9 Yang, X., Li, B., Zhang, Y ., Yin, Z., Bai, L., Ma, L., Wang, Z., Cai, J., Wong, T.-T., Lu, H., et al. Vlipp: Towards physically plausible video generation with vision and language informed physical prior.arXiv:2503.23368, 2025a. 2 Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y ., Hong, W., Zhang, X., Feng, G., Yin, D., Yuxuan.Zhang...

  8. [8]

    Unipc: A unified predictor-corrector framework for fast sampling of diffusion models.Advances in Neural Information Processing Systems, 36:49842–49869, 2023

    2 Zhao, W., Bai, L., Rao, Y ., Zhou, J., and Lu, J. Unipc: A unified predictor-corrector framework for fast sampling of diffusion models.Advances in Neural Information Processing Systems, 36:49842–49869, 2023. 5 Zhou, F., Huang, J., Li, J., Ramanan, D., and Shi, H. Pai-bench: A comprehensive benchmark for physical ai.arXiv preprint arXiv:2512.01989, 2025....

  9. [9]

    Contact Detection: Does the robot’s gripper/end-effector make actual physical contact with the object?

  10. [10]

    Grasp Validity: Is the grasp physically plausible? (No floating objects, no penetration artifacts)

  11. [11]

    Object Manipulation: After grasping, is the object properly held/moved by the robot?

  12. [12]

    A person throws a throwing axe at a large pumpkin

    Visual Artifacts: Are there any visual artifacts such as objects floating without contact, gripper passing through objects, or impossible physical interactions? Instructions for Scoring: - 1 (Fail): No grasp attempt, or severe artifacts (object floats, no contact, gripper passes through object) - 2 (Poor): Grasp attempted but clear physical violations (pa...

  13. [13]

    2.A chef tosses a pizza dough high into the air, catching it on their knuckles and spinning it to expand its size

    A bowling ball rolls down a polished lane and strikes a perfect strike, sending all ten pins flying in different trajectories. 2.A chef tosses a pizza dough high into the air, catching it on their knuckles and spinning it to expand its size. 3.A playful Golden Retriever catches a frisbee in mid-air, causing the dog to twist its body and land on its hind legs

  14. [14]

    A robot arm on an assembly line picks up a car door and precisely welds it onto a chassis, creating sparks upon contact

  15. [15]

    A gust of wind blows a stack of papers off an outdoor table, causing a person to scramble and catch them before they fly away

  16. [16]

    A sword fighter parries a heavy blow from an opponent’s axe, causing the axe to slide down the blade and spark against the crossguard

  17. [17]

    8.A pool player executes a jump shot; the cue ball hops over a blocking ball to sink the 8-ball in the corner pocket

    A child builds a tower of wooden blocks, then pulls a bottom block out, causing the structure to wobble and collapse chaotically. 8.A pool player executes a jump shot; the cue ball hops over a blocking ball to sink the 8-ball in the corner pocket

  18. [18]

    A sweeping broom pushes a pile of dust and small debris into a dustpan, with some dust particles escaping into the air

  19. [19]

    11.A basketball hits the rim, bounces straight up, hits the backboard, and finally falls through the net

    A drone flies into a hanging wind chime, tangling its propellers in the strings and causing the chimes to swing violently. 11.A basketball hits the rim, bounces straight up, hits the backboard, and finally falls through the net. 12.A wrecking ball smashes through a brick wall, sending debris and dust clouding into the interior of the building. 13.A person...

  20. [20]

    17.A cat paws at a dangling yarn ball, causing it to swing in a pendulum motion while the cat tries to grab it again

    A bartender shakes a cocktail mixer vigorously, with ice cubes audibly clinking and condensation forming on the metal exterior. 17.A cat paws at a dangling yarn ball, causing it to swing in a pendulum motion while the cat tries to grab it again

  21. [21]

    19.A person opens a shaken soda can, causing foam to spray out and coat their hand and the table

    A heavy book falls from a shelf onto a beanbag chair, causing the chair to depress deeply and then slowly regain some shape. 19.A person opens a shaken soda can, causing foam to spray out and coat their hand and the table. 20.A skateboarder grinds along a metal rail, sparks flying from the trucks before they land on the concrete. 21.A knife slices through...

  22. [22]

    24.A soccer goalkeeper punches a high ball, changing its trajectory from toward the net to over the crossbar

    A wrecking crew uses a grapple to pull down a rusted metal tower, which twists and buckles before hitting the ground. 24.A soccer goalkeeper punches a high ball, changing its trajectory from toward the net to over the crossbar

  23. [23]

    A magnet is brought close to a pile of iron filings, causing them to leap up and attach to the magnet in a spiky pattern

  24. [24]

    27.A person struggles to close an overfilled suitcase, sitting on it to compress the clothes inside before zipping it shut

    A domino chain reaction begins, with the dominoes splitting into two separate paths that eventually trigger a small flag to raise. 27.A person struggles to close an overfilled suitcase, sitting on it to compress the clothes inside before zipping it shut. 25 Self-Refining Video Sampling 28.A hammer strikes a nail, driving it partially into the wood, but th...

  25. [25]

    30.A figure skater lifts their partner overhead, rotating while the partner holds a pose, their costumes flowing together

    A bird lands on a thin tree branch, causing the branch to bow significantly under the weight and bounce as the bird stabilizes. 30.A figure skater lifts their partner overhead, rotating while the partner holds a pose, their costumes flowing together. 31.A person uses a wrench to tighten a leaking pipe; as the nut turns, the water spray reduces to a drip. ...

  26. [26]

    35.A majestic eagle swoops down to the water surface, snatching a fish with its talons and creating a splash pattern

    A robotic vacuum bumps into a sleeping dog, causing the dog to lift its head and the vacuum to rotate and move away. 35.A majestic eagle swoops down to the water surface, snatching a fish with its talons and creating a splash pattern. 36.A person playing Jenga carefully pushes a block from the center, the tower swaying slightly but remaining upright. 37.A...

  27. [27]

    41.A breakdancer performs a headspin, transitioning smoothly into a freeze pose with legs crossed in the air

    A Newton’s Cradle is set in motion; one ball hits the stack, and the ball on the opposite end swings out, demonstrating momentum transfer. 41.A breakdancer performs a headspin, transitioning smoothly into a freeze pose with legs crossed in the air. 42.A parkour athlete runs up a vertical wall, grabs the ledge, and muscles up to stand on the roof in one fl...

  28. [28]

    A figure skater executes a triple axel, taking off forward and rotating three and a half times before landing backward on one foot

  29. [29]

    A capoeira practitioner performs a ginga movement followed immediately by a low sweeping leg kick (meia lua de compasso)

  30. [30]

    A high jumper performs the Fosbury Flop, arching their back severely over the bar and kicking their legs up at the last second

  31. [31]

    A yoga instructor flows from a downward dog into a scorpion handstand, balancing on their forearms with legs arched over their head

  32. [32]

    A sprinter explodes out of the starting blocks, body at a 45-degree angle, transitioning into an upright running posture

  33. [33]

    52.A rhythmic gymnast throws a hoop high into the air, performs a cartwheel, and catches the hoop with her foot

    A rock climber performs a dynamic "dyno" move, leaping from one hold to a distant hold, catching it with one hand and swinging. 52.A rhythmic gymnast throws a hoop high into the air, performs a cartwheel, and catches the hoop with her foot

  34. [34]

    54.A professional wrestler performs a suplex on a dummy, arching their back to throw the weight over their head

    A snowboarder rides up a halfpipe, performs a McTwist (inverted 540 degree spin), and lands cleanly on the transition. 54.A professional wrestler performs a suplex on a dummy, arching their back to throw the weight over their head

  35. [35]

    26 Self-Refining Video Sampling 56.A pole vaulter plants the pole, the pole bends dramatically, launching the athlete feet-first over the bar

    A salsa dancer spins their partner rapidly, then dips them low to the ground, pausing for a beat before pulling them back up. 26 Self-Refining Video Sampling 56.A pole vaulter plants the pole, the pole bends dramatically, launching the athlete feet-first over the bar

  36. [36]

    the worm,

    A surfer performs a sharp cutback on a wave, twisting their torso and shifting weight to spray water off the tail of the board. 58.A contortionist slowly bends backward from a standing position until they grab their own ankles. 59.A hip-hop dancer performs "the worm," rippling their body along the floor from chest to feet. 60.A soccer player performs a bi...

  37. [37]

    shot," jerking backward violently and falling over a railing, flailing arms. 68.A tai chi master performs

    A traditional Indian dancer (Bharatanatyam) stomps rhythmically while performing complex mudras (hand gestures) and eye movements. 65.A cheerleader is thrown into the air, performs a twist, and is caught in a cradle position by her teammates. 66.A skateboarder performs a tre-flip (360 pop shove-it plus a kickflip) down a set of stairs. 67.A stunt performe...

  38. [38]

    72.A person slips on a banana peel (cartoon style), feet flying up above their head before they land flat on their back

    A trapeze artist releases their bar, performs a triple somersault in mid-air, and is caught by the catcher on the opposing bar. 72.A person slips on a banana peel (cartoon style), feet flying up above their head before they land flat on their back

  39. [39]

    74.A baton twirler spins the baton around their body, under their legs, and over their neck without using their hands

    A cricket bowler runs up and delivers the ball with a straight-arm action, following through with their body momentum. 74.A baton twirler spins the baton around their body, under their legs, and over their neck without using their hands

  40. [40]

    76.A BMX rider performs a backflip tailwhip over a dirt jump, kicking the bike frame around while upside down

    A synchronized swimming team emerges from the water in a pyramid formation, holding the pose before sinking back down. 76.A BMX rider performs a backflip tailwhip over a dirt jump, kicking the bike frame around while upside down. 77.A slackliner walks across a loose line, arms flailing to maintain balance as the line shakes violently

  41. [41]

    79.A conductor leads an orchestra with vigorous arm movements, hair flying as they signal a crescendo

    An ice hockey goalie drops into a butterfly position to block a shot, then quickly scrambles back to a standing position. 79.A conductor leads an orchestra with vigorous arm movements, hair flying as they signal a crescendo

  42. [42]

    A gymnast on a pommel horse swings their legs in wide circles (flares), supporting their entire weight on alternating hands

  43. [43]

    A glass of red wine shatters on a marble floor, the liquid splashing outward in slow motion while shards glide across the surface

  44. [44]

    83.A silk scarf blows in a violent gale storm, rippling rapidly and snapping in the wind without tearing

    Thick, golden honey is poured from a jar onto a stack of pancakes, folding over itself and slowly dripping down the sides. 83.A silk scarf blows in a violent gale storm, rippling rapidly and snapping in the wind without tearing. 27 Self-Refining Video Sampling

  45. [45]

    A water balloon hits a person’s face in slow motion, the rubber expanding around their features before bursting and spraying water

  46. [46]

    86.A campfire crackles in the night, with sparks rising in a spiral pattern and smoke shifting direction with the breeze

    A large soap bubble floats through the air, wobbling and reflecting an iridescent rainbow before popping into tiny droplets. 86.A campfire crackles in the night, with sparks rising in a spiral pattern and smoke shifting direction with the breeze. 87.A car drives through thick fog, its headlights creating volumetric beams that illuminate the swirling mist ...

  47. [47]

    89.A handful of glitter is thrown into the air, catching the light and twinkling as it drifts slowly to the ground

    A block of dry ice is dropped into warm water, instantly generating a thick, heavy white fog that spills over the container’s edge. 89.A handful of glitter is thrown into the air, catching the light and twinkling as it drifts slowly to the ground. 90.A large wave crashes against a cliffside, the water atomizing into a fine mist and white foam running down...

  48. [48]

    92.A heavy velvet curtain is pulled back, bunching up in thick, heavy folds that sway heavily with the movement

    A cannonball is fired into a sand dune, displacing a massive crater of sand that sprays outward and slides back into the hole. 92.A heavy velvet curtain is pulled back, bunching up in thick, heavy folds that sway heavily with the movement. 93.A distinct drop of ink falls into a glass of clear water, blooming into abstract, smoke-like tendrils as it diffuses

  49. [49]

    A pristine snowbank collapses, triggering a small avalanche where clumps of snow break apart into powder as they slide

  50. [50]

    A jellyfish swims in the deep ocean, its translucent bell pulsing rhythmically and its long tentacles trailing fluidly behind

  51. [51]

    A person with long hair stands in front of a high-powered fan, the hair whipping chaotically and obscuring their face

  52. [52]

    98.A rubber ball bounces on a trampoline, depressing the surface deeply and launching higher with every bounce

    Molten lava flows slowly down a volcano, the surface cooling into black crust while red-hot magma breaks through the cracks. 98.A rubber ball bounces on a trampoline, depressing the surface deeply and launching higher with every bounce. 99.A stack of newspapers is left in the rain; the paper darkens, sags, and begins to disintegrate into pulp. 100.A torna...

  53. [53]

    A wet dog shakes itself dry in slow motion, the loose skin rippling and water droplets forming a halo around the animal

  54. [54]

    A porcelain vase is glued back together, but when filled with water, it slowly leaks from the cracks, forming beads on the surface

  55. [55]

    A huge flag waves in slow motion, showcasing the heavy fabric rolling and snapping, creating shadows within the folds

  56. [56]

    110.A meteor enters the atmosphere, burning up with a fiery tail and shedding glowing debris before disintegrating

    Oil and vinegar are shaken in a bottle, forming temporary emulsions of small bubbles that slowly separate back into layers. 110.A meteor enters the atmosphere, burning up with a fiery tail and shedding glowing debris before disintegrating. 111.A feather falls in a vacuum chamber (straight down) versus a feather falling in air (drifting side to side). 28 S...

  57. [57]

    113.Raindrops hit a puddle, creating concentric ripples that interfere with one another in a complex geometric pattern

    A mesmerizing ferrofluid spikes and dances in response to a moving magnetic field, the black liquid looking alien and sharp. 113.Raindrops hit a puddle, creating concentric ripples that interfere with one another in a complex geometric pattern. 114.A marshmallow is roasted over a fire, the outer skin bubbling, browning, and eventually catching a small blu...