pith. machine review for the scientific record. sign in

arxiv: 2605.15185 · v1 · submitted 2026-05-14 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Quantitative Video World Model Evaluation for Geometric-Consistency

Authors on Pith no claims yet

Pith reviewed 2026-05-15 03:06 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords video generationgeometric consistencyworld models3D reconstructionevaluation metricsperspective distortionmotion consistencystructural rigidity
0
0 comments X

The pith

PDI-Bench quantifies geometric coherence in generated videos by measuring projective residuals from 3D lifts of tracked points.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PDI-Bench as a quantitative framework to audit whether generative video models produce consistent 3D structure and motion. It segments objects, tracks points across frames, lifts them to world-space coordinates via monocular reconstruction, and computes residuals that capture scale-depth alignment, motion consistency, and structural rigidity. These signals expose specific geometric failures across current models that perceptual quality metrics overlook. A sympathetic reader would care because the method supplies an objective diagnostic for treating video generators as physical world models rather than just image synthesizers. The accompanying PDI-Dataset stresses these constraints across varied scenarios to enable systematic comparison.

Core claim

Given a generated video clip, object-centric observations are obtained via segmentation and point tracking, then lifted to 3D world-space coordinates via monocular reconstruction; a set of projective-geometry residuals is computed to quantify three failure dimensions: scale-depth alignment, 3D motion consistency, and 3D structural rigidity. Across state-of-the-art generators this index reveals consistent geometry-specific failure modes invisible to common perceptual metrics and supplies a diagnostic signal for progress toward physically grounded video generation.

What carries the argument

The Perspective Distortion Index (PDI), which aggregates projective-geometry residuals computed on 3D world coordinates lifted from segmented and tracked points to measure scale-depth alignment, motion consistency, and structural rigidity.

If this is right

  • Video generators can be ranked and improved by targeting measurable failures in scale consistency, motion trajectories, and rigidity instead of relying solely on visual appeal.
  • Training loops gain an objective gradient signal for enforcing projective constraints that current perceptual losses do not provide.
  • Evaluation of implicit world models shifts from subjective human ratings to repeatable 3D residual measurements across controlled datasets.
  • Models that reduce PDI scores on the benchmark are expected to produce outputs more suitable for downstream tasks requiring spatial reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If PDI scores improve over time while perceptual metrics plateau, the field may be making genuine progress on physical plausibility even when human raters notice little change.
  • PDI could be extended to multi-view or stereo video inputs to cross-validate the monocular reconstruction step and reduce its influence on the final score.
  • Combining PDI with existing 2D metrics might yield a composite benchmark that better predicts performance in robotics simulation or planning applications.

Load-bearing premise

Monocular 3D reconstruction from the generated video produces accurate enough world-space coordinates to reveal the generator's own geometric errors rather than injecting reconstruction artifacts.

What would settle it

Generate videos with deliberately perfect 3D geometry using known camera paths and rigid objects, run the full PDI pipeline including monocular lift, and verify whether the index scores remain near zero; persistently high scores on perfect inputs would falsify the claim that PDI isolates generator errors.

read the original abstract

Generative video models are increasingly studied as implicit world models, yet evaluating whether they produce physically plausible 3D structure and motion remains challenging. Most existing video evaluation pipelines rely heavily on human judgment or learned graders, which can be subjective and weakly diagnostic for geometric failures. We introduce PDI-Bench (Perspective Distortion Index), a quantitative framework for auditing geometric coherence in generated videos. Given a generated clip, we obtain object-centric observations via segmentation and point tracking (e.g., SAM 2, MegaSaM, and CoTracker3), lift them to 3D world-space coordinates via monocular reconstruction, and compute a set of projective-geometry residuals capturing three failure dimensions: scale-depth alignment, 3D motion consistency, and 3D structural rigidity. To support systematic evaluation, we build PDI-Dataset, covering diverse scenarios designed to stress these geometric constraints. Across state-of-the-art video generators, PDI reveals consistent geometry-specific failure modes that are not captured by common perceptual metrics, and provides a diagnostic signal for progress toward physically grounded video generation and physical world model. Our code and dataset can be found at https://pdi-bench.github.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces PDI-Bench, a quantitative framework for auditing geometric coherence in generated videos. Given a generated clip, it applies segmentation and point tracking (SAM 2, MegaSaM, CoTracker3), lifts observations to 3D world-space coordinates via monocular reconstruction, and computes projective-geometry residuals across three dimensions: scale-depth alignment, 3D motion consistency, and 3D structural rigidity. It also releases PDI-Dataset covering diverse scenarios and reports that, across state-of-the-art video generators, PDI exposes geometry-specific failure modes not captured by common perceptual metrics, offering a diagnostic signal for physically grounded video generation.

Significance. If the residuals can be shown to be dominated by generator-induced geometric errors rather than upstream reconstruction artifacts, PDI-Bench would supply an objective, geometry-specific complement to existing perceptual and human-judgment metrics, directly supporting evaluation of video models as implicit world models.

major comments (2)
  1. [Abstract] Abstract and evaluation description: the central claim that PDI residuals diagnose generator failures requires evidence that monocular lifting (MegaSaM) produces sufficiently accurate 3D coordinates on generated video; no quantitative validation (e.g., reconstruction error on synthetic ground-truth video, ablation swapping the reconstructor, or correlation with known geometric perturbations) is supplied, leaving open the possibility that residuals are confounded by reconstruction priors on inconsistent lighting, texture, or motion patterns typical of generated content.
  2. [Methods] Methods / PDI-Dataset construction: the three residual definitions (scale-depth alignment, 3D motion consistency, 3D structural rigidity) are derived directly from projective geometry applied to tracked points; without an explicit isolation experiment or ground-truth comparison, it remains unclear whether the reported failure modes are load-bearing for the generator or artifacts of the monocular pipeline.
minor comments (1)
  1. [Abstract] The abstract states that code and dataset are available at https://pdi-bench.github.io/; the manuscript should include a brief reproducibility checklist (exact versions of SAM 2, MegaSaM, CoTracker3, and any post-processing steps) to allow independent verification of the residual computations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback emphasizing the need to validate the monocular reconstruction step and isolate generator effects in PDI-Bench. We address each major comment below and will incorporate additional experiments and clarifications in the revised manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract and evaluation description: the central claim that PDI residuals diagnose generator failures requires evidence that monocular lifting (MegaSaM) produces sufficiently accurate 3D coordinates on generated video; no quantitative validation (e.g., reconstruction error on synthetic ground-truth video, ablation swapping the reconstructor, or correlation with known geometric perturbations) is supplied, leaving open the possibility that residuals are confounded by reconstruction priors on inconsistent lighting, texture, or motion patterns typical of generated content.

    Authors: We agree that explicit validation of the monocular lifting on generated content is essential to support the central claim. While the manuscript employs established state-of-the-art methods (MegaSaM for reconstruction alongside SAM 2 and CoTracker3), we acknowledge the absence of dedicated quantitative checks in the current version. In the revision we will add a new validation subsection that (i) measures reconstruction error on synthetic ground-truth videos with known 3D geometry, (ii) performs an ablation by swapping the reconstructor, and (iii) correlates PDI residuals against controlled geometric perturbations injected into otherwise consistent clips. These experiments will demonstrate that the reported residuals are dominated by generator-induced inconsistencies rather than upstream reconstruction artifacts. revision: yes

  2. Referee: [Methods] Methods / PDI-Dataset construction: the three residual definitions (scale-depth alignment, 3D motion consistency, 3D structural rigidity) are derived directly from projective geometry applied to tracked points; without an explicit isolation experiment or ground-truth comparison, it remains unclear whether the reported failure modes are load-bearing for the generator or artifacts of the monocular pipeline.

    Authors: The three residual definitions follow directly from projective geometry and are therefore independent of any particular reconstruction implementation. Nevertheless, we recognize the value of explicit isolation. In the revised manuscript we will include ground-truth comparison experiments using rendered videos that provide perfect 3D structure and motion; PDI scores will be computed both on the original renders and on versions with controlled generator-like perturbations. We will also report results across multiple reconstructors and trackers to confirm that the observed failure modes persist and are attributable to the video generators rather than the analysis pipeline. revision: yes

Circularity Check

0 steps flagged

No significant circularity in PDI derivation

full rationale

The paper defines PDI-Bench by lifting tracked points from generated video via external monocular reconstruction (MegaSaM, SAM 2, CoTracker3) then computing direct projective-geometry residuals on scale-depth alignment, 3D motion consistency, and structural rigidity. These residuals follow from standard projective constraints applied to the lifted coordinates; no equations, parameters, or self-citations reduce the reported values to quantities fitted on the same evaluation videos. The central claim therefore rests on an independent geometric calculation rather than tautological re-expression of inputs. This is the expected non-circular outcome for a metric constructed from first-principles geometry.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on abstract only; the framework assumes that off-the-shelf segmentation (SAM 2), tracking (CoTracker3), and monocular reconstruction (MegaSaM) produce 3D coordinates accurate enough to expose generator failures. No free parameters or invented entities are mentioned.

axioms (1)
  • domain assumption Monocular depth and point tracking tools yield sufficiently accurate 3D world coordinates for the purpose of measuring geometric inconsistency
    Invoked when lifting 2D observations to 3D and computing projective residuals

pith-pipeline@v0.9.0 · 5509 in / 1291 out tokens · 46999 ms · 2026-05-15T03:06:28.592571+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · 10 internal anchors

  1. [1]

    Allen, C

    K. Allen, C. Doersch, G. Zhou, M. Suhail, D. Driess, I. Rocco, Y. Rubanova, T. Kipf, M. S. M. Sajjadi, K. Murphy, J. Carreira, and S. van Steenkiste. Direct motion models for assessing generated videos,

  2. [2]

    URLhttps://arxiv.org/abs/2505.00209

  3. [3]

    M. Asim, C. Wewer, T. Wimmer, B. Schiele, and J. E. Lenssen. Met3r: Measuring multi-view consistency in generated images, 2026. URLhttps://arxiv.org/abs/2501.06336

  4. [4]

    Bansal, Z

    H. Bansal, Z. Lin, T. Xie, Z. Zong, M. Yarom, Y. Bitton, C. Jiang, Y. Sun, K.-W. Chang, and A. Grover. Videophy: Evaluating physical commonsense for video generation, 2024. URLhttps://arxiv. org/abs/2406.03520

  5. [5]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    A. Blattmann, T. Dockhorn, S. Kulal, D. Mendelevitch, M. Kilian, D. Lorenz, Y. Levi, Z. English, V. Voleti, A. Letts, V. Jampani, and R. Rombach. Stable video diffusion: Scaling latent video diffusion models to large datasets, 2023. URLhttps://arxiv.org/abs/2311.15127

  6. [6]

    Doubao: A family of large language models

    ByteDance. Doubao: A family of large language models. https://www.volcengine.com/ product/doubao, 2026. Accessed: 2026-05-06

  7. [7]

    Seedance 2.0 fast: High-efficiency video generation foundation model.https://www

    ByteDance. Seedance 2.0 fast: High-efficiency video generation foundation model.https://www. doubao.com/, 2026. Accessed: 2026-04-19

  8. [8]

    W. Chow, J. Mao, B. Li, D. Seita, V. Guizilini, and Y. Wang. Physbench: Benchmarking and enhancing vision-language models for physical world understanding, 2025. URLhttps://arxiv.org/abs/ 2501.16411

  9. [9]

    Duan, H.-X

    H. Duan, H.-X. Yu, S. Chen, L. Fei-Fei, and J. Wu. Worldscore: A unified evaluation benchmark for world generation, 2025. URLhttps://arxiv.org/abs/2504.00983

  10. [10]

    Flow: Where the next wave of storytelling happens.https://labs.google/fx/tools/ flow, 2026

    Google. Flow: Where the next wave of storytelling happens.https://labs.google/fx/tools/ flow, 2026. Accessed: 2026-03-04

  11. [11]

    J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models, 2020. URL https: //arxiv.org/abs/2006.11239

  12. [12]

    Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

    X. Huang, Z. Li, G. He, M. Zhou, and E. Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion, 2025. URLhttps://arxiv.org/abs/2506.08009

  13. [13]

    Huang, Y

    Z. Huang, Y. He, J. Yu, F. Zhang, C. Si, Y. Jiang, Y. Zhang, T. Wu, Q. Jin, N. Chanpaisit, Y. Wang, X. Chen, L. Wang, D. Lin, Y. Qiao, and Z. Liu. Vbench: Comprehensive benchmark suite for video generative models, 2023. URLhttps://arxiv.org/abs/2311.17982

  14. [14]

    Cotracker3: Simpler and better point tracking by pseudo-labelling real videos

    N. Karaev, I. Makarov, J. Wang, N. Neverova, A. Vedaldi, and C. Rupprecht. Cotracker3: Simpler and better point tracking by pseudo-labelling real videos, 2024. URLhttps://arxiv.org/abs/ 2410.11831

  15. [15]

    W. Kong, Q. Tian, Z. Zhang, R. Min, Z. Dai, J. Zhou, J. Xiong, X. Li, B. Wu, J. Zhang, K. Wu, Q. Lin, J. Yuan, Y. Long, A. Wang, A. Wang, C. Li, D. Huang, F. Yang, H. Tan, H. Wang, J. Song, J. Bai, J. Wu, J. Xue, J. Wang, K. Wang, M. Liu, P. Li, S. Li, W. Wang, W. Yu, X. Deng, Y. Li, Y. Chen, Y. Cui, Y. Peng, 13 Z. Yu, Z. He, Z. Xu, Z. Zhou, Z. Xu, Y. Tao...

  16. [16]

    URLhttps://arxiv.org/abs/2412.03603

  17. [17]

    D. Li, Y. Fang, Y. Chen, S. Yang, S. Cao, J. Wong, M. Luo, X. Wang, H. Yin, J. E. Gonzalez, I. Stoica, S. Han, and Y. Lu. Worldmodelbench: Judging video generation models as world models, 2025. URLhttps://arxiv.org/abs/2502.20694

  18. [18]

    Z. Li, R. Tucker, F. Cole, Q. Wang, L. Jin, V. Ye, A. Kanazawa, A. Holynski, and N. Snavely. Megasam: Accurate, fast, and robust structure and motion from casual dynamic videos, 2024. URLhttps: //arxiv.org/abs/2412.04463

  19. [19]

    Y. Liu, K. Zhang, Y. Li, Z. Yan, C. Gao, R. Chen, Z. Yuan, Y. Huang, H. Sun, J. Gao, L. He, and L. Sun. Sora: A review on background, technology, limitations, and opportunities of large vision models,

  20. [20]

    URLhttps://arxiv.org/abs/2402.17177

  21. [21]

    F. Meng, J. Liao, X. Tan, W. Shao, Q. Lu, K. Zhang, Y. Cheng, D. Li, Y. Qiao, and P. Luo. Towards world simulator: Crafting physical commonsense-based benchmark for video generation, 2024. URLhttps://arxiv.org/abs/2410.05363

  22. [22]

    Sora: Creating video from text.https://openai.com/sora, 2025

    OpenAI. Sora: Creating video from text.https://openai.com/sora, 2025. Accessed: 2026- 03-20

  23. [23]

    Learning Transferable Visual Models From Natural Language Supervision

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision, 2021. URLhttps://arxiv.org/abs/2103.00020

  24. [24]

    N. Ravi, V. Gabeur, Y.-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. Rädle, C. Rolland, L. Gustafson, E. Mintun, J. Pan, K. V. Alwala, N. Carion, C.-Y. Wu, R. Girshick, P. Dollár, and C. Feichtenhofer. Sam 2: Segment anything in images and videos, 2024. URLhttps://arxiv.org/abs/2408.00714

  25. [25]

    Improved Techniques for Training GANs

    T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans, 2016. URLhttps://arxiv.org/abs/1606.03498

  26. [26]

    K. Sun, K. Huang, X. Liu, Y. Wu, Z. Xu, Z. Li, and X. Liu. T2v-compbench: A comprehensive benchmark for compositional text-to-video generation, 2025. URLhttps://arxiv.org/abs/ 2407.14505

  27. [27]

    Towards Accurate Generative Models of Video: A New Metric & Challenges

    T. Unterthiner, S. van Steenkiste, K. Kurach, R. Marinier, M. Michalski, and S. Gelly. Towards accurate generative models of video: A new metric & challenges, 2019. URLhttps://arxiv. org/abs/1812.01717

  28. [28]

    Upadhyay, H

    R. Upadhyay, H. Zhang, J. Solomon, A. Agrawal, P. Boreddy, S. S. Narayana, Y. Ba, A. Wong, C. M. de Melo, and A. Kadambi. Worldbench: Disambiguating physics for diagnostic evaluation of world models, 2026. URLhttps://arxiv.org/abs/2601.21282

  29. [29]

    T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, J. Zeng, J. Wang, J. Zhang, J. Zhou, J. Wang, J. Chen, K. Zhu, K. Zhao, K. Yan, L. Huang, M. Feng, N. Zhang, P. Li, P. Wu, R. Chu, R. Feng, S. Zhang, S. Sun, T. Fang, T. Wang, T. Gui, T. Weng, T. Shen, W. Lin, W. Wang, W. Wang, W. Zhou, W. Wang, W. Shen, W. Yu, X. Shi, X....

  30. [30]

    Wan2.2: Wan: Open and advanced large-scale video generative models.https: //github.com/Wan-Video/Wan2.2, 2025

    Wan-Video. Wan2.2: Wan: Open and advanced large-scale video generative models.https: //github.com/Wan-Video/Wan2.2, 2025. GitHub repository

  31. [31]

    B. Xiao, H. Wu, W. Xu, X. Dai, H. Hu, Y. Lu, M. Zeng, C. Liu, and L. Yuan. Florence-2: Advancing a unified representation for a variety of vision tasks, 2023. URLhttps://arxiv.org/abs/2311. 06242

  32. [32]

    Cogvideox-3: Text-to-video diffusion models.https://chatglm.cn/video, 2026

    Zhipu AI. Cogvideox-3: Text-to-video diffusion models.https://chatglm.cn/video, 2026. Accessed: 2026-4-18. 15 A. Additional Experimental Details A.1. PDI-Dataset Construction The PDI-Dataset consists of 183 video sequences in total, partitioned into real-world and synthetic subsets. Real-world sequences.The real-world portion of PDI-Dataset contains 15 sh...

  33. [33]

    All synthetic videos presented in our benchmark reflect the baseline commercial performance available to end-users at the time of evaluation. Note that the Sora samples in our dataset were generated using the $20 monthly consumer subscription rather than the enterprise API, representing the baseline commercial performance of the model. The 28 text prompts...

  34. [34]

    A handheld following shot of a red vintage car driving away on a straight desert highway, harsh noon light and heat haze on the horizon, subtle shake and lateral drift

  35. [35]

    A high-speed train moving toward the viewer on a straight track, low-angle handheld perspective, rails and gravel receding toward a clear vanishing point

  36. [36]

    A yellow school bus driving away on a straight tree-lined suburban street, the shot tracking from a low position behind, morning light and clean asphalt

  37. [37]

    A silver metallic sphere rolling away on a long reflective marble floor in a bright gallery, the shot following closely with slight sway

  38. [38]

    A heavy cargo truck moving away on a straight bridge at night, tail lights glowing, subtle frame shake, city lights in the distance

  39. [39]

    Dynamic Tracking

    A large shipping container being pushed away on a straight industrial dock, cranes and water behind, moving viewpoint, overcast industrial light. Dynamic Tracking

  40. [40]

    A handheld following shot of a red sports car driving on a straight multi-lane highway, city skyline and roadside trees in the background receding rapidly with parallax

  41. [41]

    A smooth following shot of an autonomous suitcase moving through a vast airport terminal, repeated columns and floor patterns rushing past in frame

  42. [42]

    A close handheld shot following a large chrome sphere rolling along a straight, reflective museum corridor, exhibits and windows flowing past

  43. [43]

    A following shot from a vehicle alongside, keeping pace with a large truck carrying a blue container on a long bridge, waves and bridge cables creating dynamic background motion

  44. [44]

    A smooth following shot of a metal logistics crate moving along a straight automated conveyor, complex factory machinery in the background rushing past

  45. [45]

    Biological Motion Continued on next page 17 Table 4 –Continued from previous page Category Text Prompt

    A handheld following shot of a large metal ball rolling through a straight modern art gallery, surrounding artworks and viewers receding rapidly with parallax. Biological Motion Continued on next page 17 Table 4 –Continued from previous page Category Text Prompt

  46. [46]

    A smooth following shot of a large eagle flying at high speed parallel to a cliff, rock face and sea below, clear sky

  47. [47]

    A following shot from a moving boat of a dolphin swimming and leaping in the waves alongside, spray and sunlight

  48. [48]

    A handheld shot of a large octopus swimming away in a complex coral reef, tentacles waving, colorful fish and coral, blue water and light shafts

  49. [49]

    A backward-moving shot following a snake slithering through dense colorful flowers on the ground, petals and stems, soft daylight

  50. [50]

    Curved Motion

    A moving shot following a peacock walking and shaking its tail feathers in a palace garden, fountains and trimmed hedges, ornate tiles. Curved Motion

  51. [51]

    The view orbits slightly to capture the vehicle transitioning from a front-view to a side-view against the pine forest background

    A handheld tracking perspective follows a silver compact SUV navigating a sharp hairpin turn on a winding mountain road. The view orbits slightly to capture the vehicle transitioning from a front-view to a side-view against the pine forest background

  52. [52]

    The car rotates intensely while the moving shot emphasizes the shifting vanishing lines of the curb and tire marks

    A low-angle shot follows a sports car drifting through a 90-degree corner on a professional race track. The car rotates intensely while the moving shot emphasizes the shifting vanishing lines of the curb and tire marks

  53. [53]

    The view maintains a side perspective, showing the bus constantly changing its orientation relative to the central fountain and surrounding city traffic

    A cinematic tracking shot follows a city bus driving through a large, ornate stone round- about. The view maintains a side perspective, showing the bus constantly changing its orientation relative to the central fountain and surrounding city traffic

  54. [54]

    The shot stays close, highlighting the rotation of the robot’s boxy frame against the detailed brickwork

    A ground-level perspective tracking a small delivery robot as it makes a sharp turn at a sidewalk corner. The shot stays close, highlighting the rotation of the robot’s boxy frame against the detailed brickwork

  55. [55]

    The view moves with the vehicle, capturing the shifting angles of the heavy wheels and mechanical parts against the vast landscape

    A handheld shot follows a green tractor making a wide turn at the edge of a plowed field. The view moves with the vehicle, capturing the shifting angles of the heavy wheels and mechanical parts against the vast landscape. Partial Occlusion

  56. [56]

    A car driving along a street at night, wheels briefly obscured by a low roadside guardrail for under a second, handheld shot moving alongside, street lamps and storefronts

  57. [57]

    A train passing behind a row of thin vertical power line poles, the shot tracking its movement from a moving platform, sky and industrial landscape

  58. [58]

    A bus moving through a city street, briefly partially hidden by a thin traffic sign, the shot following from the sidewalk

  59. [59]

    A vintage car driving past a row of thin trees, never fully leaving the moving view, autumn leaves and road

  60. [60]

    Continued on next page 18 Table 4 –Continued from previous page Category Text Prompt

    A boat sailing behind a thin pier support, remaining partially visible throughout, handheld shot from the dock, sea and sky. Continued on next page 18 Table 4 –Continued from previous page Category Text Prompt

  61. [61]

    physics-perfect

    A robot crate moving through a warehouse, passing behind a thin metal rack, the shot following alongside, shelves and boxes, industrial lighting. Reconstruction-aware weighting.The final PDI score is synthesized as a weighted sum of three orthogonal physical residuals: PDI Score=𝑤 1 ⋅RMSE(𝜖 𝑠𝑐𝑎𝑙𝑒 )+𝑤 2 ⋅RMSE(𝜖𝑡𝑟𝑎 𝑗)+𝑤 3 ⋅𝜖 𝑟𝑖𝑔𝑖𝑑𝑖𝑡 𝑦 ,(11) where ∑𝑖 𝑤𝑖 = 1....

  62. [62]

    3D Pairwise Rigidity (Primary).We sample world-space pointsq𝑛 𝑡 from Mega-SAM pointmaps at CoTracker locations. Anchor pairs are selected at𝑡= 0by triple filtering: (i) visibility filtering, (ii) depth-gradient reliability filtering, and (iii) pair scoring that favors both large 3D separation and interior-region reliability (distance to mask boundary). Fo...

  63. [63]

    3D Height Stability (Fallback when Strategy 1 is not entered).If 3D points are valid but strategy 1 is unavailable at the dispatcher level, we compute per-frame 3D object height from foreground𝑦-span: ℎ3𝐷 𝑡 =𝑃 95(𝑦𝑡)−𝑃 5(𝑦𝑡), and use coefficient of variation: 𝜖 (2) rigid = std({ℎ3𝐷 𝑡 }𝑇 𝑡=1) mean({ℎ3𝐷 𝑡 }𝑇 𝑡=1)+𝜖 . 22

  64. [64]

    speed” of receding (trajectory) and its “rate

    2D Pairwise Consistency (Degraded fallback).When 3D evidence is unavailable, we use 2D CoTracker pairwise distance ratios: 𝑟2𝐷 𝑖 𝑗 (𝑡)= 𝑑𝑖 𝑗(𝑡) 𝑑𝑖 𝑗(0), 𝜌 2𝐷 𝑡 = std({𝑟2𝐷 𝑖 𝑗 (𝑡)}) mean({𝑟2𝐷 𝑖 𝑗 (𝑡)})+𝜖 , and compute 𝜖 (3) rigid = 1 𝑇 𝑇 ∑ 𝑡=1 𝜌2𝐷 𝑡 . Finally, the rigidity component used by PDI is 𝜖rigid = ⎧⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎩ 𝜖 (1) rigid,if Strategy 1 is sel...