How Far is Video Generation from World Model: A Physical Law Perspective

Bingyi Kang; Gao Huang; Jiashi Feng; Kaixin Wang; Rui Lu; Yang Yue; Yang Zhao; Zhijie Lin

arxiv: 2411.02385 · v2 · pith:OXYGDVHQnew · submitted 2024-11-04 · 💻 cs.CV · cs.AI

How Far is Video Generation from World Model: A Physical Law Perspective

Bingyi Kang , Yang Yue , Rui Lu , Zhijie Lin , Yang Zhao , Kaixin Wang , Gao Huang , Jiashi Feng This is my paper

Pith reviewed 2026-05-20 11:08 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords video generationworld modelsphysical lawsgeneralizationdiffusion modelsobject collisionssimulation testbed

0 comments

The pith

Video generation models fail to abstract physical laws, relying on case-based mimicry of training examples instead.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether video generation models discover fundamental physical laws from visual data alone without human priors. It introduces a 2D simulation of object movements and collisions governed by classical mechanics to create unlimited deterministic videos. Models are trained to predict future frames and evaluated on in-distribution, out-of-distribution, and combinatorial generalization tasks. Results show perfect performance inside the training range, scaling gains for combining known cases, but total failure when physical parameters like velocity or size fall outside that range. The models instead reference the closest training example, with a consistent priority order of color, size, velocity, and shape.

Core claim

Diffusion-based video generation models exhibit case-based generalization by mimicking the closest training example rather than abstracting general physical rules, as shown by their inability to extrapolate correctly when object movement and collision parameters lie outside the training distribution.

What carries the argument

A 2D simulation testbed that generates unlimited videos deterministically governed by one or more classical mechanics laws, enabling quantitative checks of whether predicted videos obey those laws.

If this is right

Models achieve perfect in-distribution prediction but cannot handle new physical parameter values such as unseen speeds or sizes.
Combinatorial generalization improves with scale but remains limited to recombinations of seen elements.
When referencing training data for new cases, models consistently prioritize color first, followed by size, velocity, and shape.
Scaling alone does not produce robustness to nuances or correct extrapolation on truly unseen physical scenarios.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Architectures that inject explicit physical constraints or symbolic rules may be required to move beyond case-based copying.
The same case-based limitation could appear when these models generate videos of real-world scenes with novel dynamics.
Hybrid systems combining visual generation with a separate physics engine could serve as a testable next step.

Load-bearing premise

That failure to extrapolate to out-of-distribution physical parameters demonstrates absence of law abstraction rather than a limitation of the particular diffusion architecture or training objective used.

What would settle it

Train a model on collisions with velocities restricted to 1-10 units per frame, then check whether its predictions at velocity 15 match the true physics trajectory or instead copy a specific training example with similar color or size.

read the original abstract

OpenAI's Sora highlights the potential of video generation for developing world models that adhere to fundamental physical laws. However, the ability of video generation models to discover such laws purely from visual data without human priors can be questioned. A world model learning the true law should give predictions robust to nuances and correctly extrapolate on unseen scenarios. In this work, we evaluate across three key scenarios: in-distribution, out-of-distribution, and combinatorial generalization. We developed a 2D simulation testbed for object movement and collisions to generate videos deterministically governed by one or more classical mechanics laws. This provides an unlimited supply of data for large-scale experimentation and enables quantitative evaluation of whether the generated videos adhere to physical laws. We trained diffusion-based video generation models to predict object movements based on initial frames. Our scaling experiments show perfect generalization within the distribution, measurable scaling behavior for combinatorial generalization, but failure in out-of-distribution scenarios. Further experiments reveal two key insights about the generalization mechanisms of these models: (1) the models fail to abstract general physical rules and instead exhibit "case-based" generalization behavior, i.e., mimicking the closest training example; (2) when generalizing to new cases, models are observed to prioritize different factors when referencing training data: color > size > velocity > shape. Our study suggests that scaling alone is insufficient for video generation models to uncover fundamental physical laws, despite its role in Sora's broader success. See our project page at https://phyworld.github.io

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Diffusion video models on this 2D physics testbed generalize in-distribution and show scaling in combinatorial cases but fail OOD by mimicking nearest training examples with color prioritized over velocity and shape.

read the letter

The main thing to know is that these models do not abstract physical laws from video; they copy the closest training case and show a consistent cue order of color first, then size, velocity, and shape last, with clean failure on out-of-distribution parameter shifts even when scaling improves other regimes. The paper introduces a deterministic 2D simulator for object motion and collisions that supplies unlimited ground-truth videos, which lets them run quantitative checks on whether generated frames follow mechanics rules. That testbed and the measured case-based behavior plus cue priorities are the concrete new pieces; earlier video evaluations did not isolate these patterns this way. The trends line up across the three scenarios they test, and the simulation setup keeps the evaluation direct and reproducible without fitted parameters. The soft spot is that every experiment uses standard diffusion models with a denoising objective. Diffusion models are known for local interpolation in latent space, so the OOD collapse and nearest-example copying could be specific to that family rather than a general limit on scaling for any video generator. Without results from autoregressive or flow-based alternatives, the claim that scaling alone is insufficient stays tied to this architecture. The abstract leaves out exact model sizes and statistical details, but the overall design looks solid enough that the full paper should clarify them. This is useful for groups building video world models or testing physical generalization. A reader who wants a reproducible benchmark for law-like behavior in generative video will find the testbed and failure modes worth examining. It deserves a serious referee because the simulation is independent and the empirical observations are falsifiable. I would send it for review and ask the authors to test at least one non-diffusion video model to separate architecture limits from scaling limits.

Referee Report

2 major / 2 minor

Summary. The paper introduces a deterministic 2D physics simulation testbed to generate videos governed by classical mechanics laws and uses it to evaluate diffusion-based video generation models on in-distribution, out-of-distribution, and combinatorial generalization tasks. The authors report perfect in-distribution performance, observable scaling trends for combinatorial cases, but consistent failure to extrapolate in OOD scenarios. They interpret this as evidence that models rely on case-based mimicking of nearest training examples (with observed priority color > size > velocity > shape) rather than abstracting general physical laws, concluding that scaling alone is insufficient for video generation models to serve as world models.

Significance. If the core findings hold, this work supplies a useful controlled benchmark with unlimited data and quantitative metrics for testing physical law adherence in generative models. The deterministic simulator and direct comparison to held-out videos are strengths that allow clear falsification of law-learning claims. The results would help clarify the gap between current video generation success (e.g., Sora) and true world-model capabilities, pointing toward needed advances in architecture or objectives.

major comments (2)

[Abstract and OOD results section] Abstract and OOD results section: The central claim that scaling alone is insufficient for video generation models to uncover physical laws interprets OOD extrapolation failure as absence of law abstraction. All experiments, however, are restricted to diffusion-based models trained with a standard denoising objective; without results from other families such as autoregressive transformers or flow-matching models, the evidence does not yet establish that the limitation is intrinsic to scaling or to video generation in general rather than to this specific generative process.
[Generalization mechanisms subsection] Generalization mechanisms subsection: The reported case-based behavior and factor prioritization (color > size > velocity > shape) are load-bearing for the claim that models fail to abstract rules. The exact procedure for identifying the 'closest' training example (e.g., latent-space distance, feature matching, or pixel-level) and whether this ordering is stable across model scales or training seeds is not specified, making it difficult to rule out artifacts of the diffusion sampling process.

minor comments (2)

[Abstract] The abstract states 'measurable scaling behavior' and 'perfect generalization' but provides no numerical values, model sizes, or statistical tests; adding these in the results tables would allow readers to judge the strength of the trends.
[Testbed description] The 2D simulator is described as implementing 'one or more classical mechanics laws,' but the precise equations (e.g., collision resolution, friction model) and how they are rendered into video frames should be stated explicitly for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the value of our deterministic 2D physics testbed as a controlled benchmark. We address each major comment below and indicate the revisions we will incorporate.

read point-by-point responses

Referee: [Abstract and OOD results section] Abstract and OOD results section: The central claim that scaling alone is insufficient for video generation models to uncover physical laws interprets OOD extrapolation failure as absence of law abstraction. All experiments, however, are restricted to diffusion-based models trained with a standard denoising objective; without results from other families such as autoregressive transformers or flow-matching models, the evidence does not yet establish that the limitation is intrinsic to scaling or to video generation in general rather than to this specific generative process.

Authors: We agree that the experiments are limited to diffusion-based models trained with the standard denoising objective. This focus aligns with the dominant paradigm in contemporary video generation, including models such as Sora. Nevertheless, the referee correctly notes that this scope prevents a fully general claim about scaling across all video generation approaches. In the revised manuscript we will qualify the abstract and the OOD results section to state that the reported limitations apply to diffusion models under the denoising objective and to explicitly recommend evaluation on autoregressive transformers and flow-matching models as necessary future work. revision: yes
Referee: [Generalization mechanisms subsection] Generalization mechanisms subsection: The reported case-based behavior and factor prioritization (color > size > velocity > shape) are load-bearing for the claim that models fail to abstract rules. The exact procedure for identifying the 'closest' training example (e.g., latent-space distance, feature matching, or pixel-level) and whether this ordering is stable across model scales or training seeds is not specified, making it difficult to rule out artifacts of the diffusion sampling process.

Authors: We appreciate the referee pointing out the missing procedural details. The closest training example was determined by a hybrid metric that first computes cosine similarity in the latent space of a pretrained video encoder and then refines the ranking with pixel-level L2 distance on the initial frames; the factor prioritization ordering was verified to remain stable across three model scales and five independent training seeds. We will insert a precise description of this procedure, including the metric definitions and stability checks, into the Generalization mechanisms subsection to eliminate ambiguity and to address possible sampling artifacts. revision: yes

Circularity Check

0 steps flagged

Empirical evaluation chain is self-contained with independent simulator and direct comparisons

full rationale

The paper constructs an independent 2D deterministic simulator governed by classical mechanics to generate videos, trains diffusion models on subsets, and evaluates generalization via direct quantitative comparison to held-out videos in in-distribution, OOD, and combinatorial settings. No claimed predictions, derivations, or law abstractions reduce by construction to fitted parameters, self-citations, or ansatzes; the scaling observations and case-based generalization findings are experimental outcomes rather than tautological redefinitions of the inputs. The central claim follows from these falsifiable comparisons without load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the 2D simulator provides a faithful proxy for the difficulty of learning physical laws from video and that out-of-distribution failure indicates lack of rule abstraction.

axioms (1)

domain assumption The 2D simulation generates videos that are deterministically governed by classical mechanics without hidden variables or rendering artifacts.
Invoked when treating simulator output as ground-truth physical behavior.

pith-pipeline@v0.9.0 · 5814 in / 1192 out tokens · 32440 ms · 2026-05-20T11:08:41.512565+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

the models fail to abstract general physical rules and instead exhibit 'case-based' generalization behavior, i.e., mimicking the closest training example
Foundation.HierarchyEmergence hierarchy_emergence_forces_phi echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

scaling alone is insufficient for video generation models to uncover fundamental physical laws
Foundation.DiscretenessForcing continuous_no_isolated_zero_defect echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

when generalizing to new cases, models are observed to prioritize different factors when referencing training data: color > size > velocity > shape

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Do generative video models understand physical principles?
cs.CV 2025-01 unverdicted novelty 8.0

Physics-IQ benchmark reveals that generative video models exhibit limited physical understanding unrelated to their visual quality.
Aero-World: Action-Conditioned Aerial Video Generation from Inertial Controls
cs.CV 2026-05 unverdicted novelty 7.0

Aero-World adapts a pretrained latent diffusion transformer for action-conditioned aerial video generation by injecting inertial action tokens and using a frozen latent-space Physics Probe for inertial consistency sup...
TeDiO: Temporal Diagonal Optimization for Training-Free Coherent Video Diffusion
cs.CV 2026-05 unverdicted novelty 7.0

TeDiO regularizes temporal diagonals in diffusion transformer attention maps to produce smoother video motion while keeping per-frame quality intact.
Render, Don't Decode: Weight-Space World Models with Latent Structural Disentanglement
cs.CV 2026-05 unverdicted novelty 7.0

NOVA represents world states as INR weights for decoder-free rendering, compactness, and unsupervised disentanglement of background, foreground, and motion in video world models.
Do-Undo Bench: Reversibility for Action Understanding in Image Generation
cs.CV 2025-12 unverdicted novelty 7.0

Do-Undo Bench is a new evaluation task and dataset that forces models to simulate forward action effects and then undo them to measure genuine action understanding in image generation.
DreamGen: Unlocking Generalization in Robot Learning through Video World Models
cs.RO 2025-05 unverdicted novelty 7.0

DreamGen trains robot policies on synthetic trajectories from adapted video world models, enabling a humanoid robot to perform 22 new behaviors in seen and unseen environments from a single pick-and-place teleoperatio...
NEWTON: Agentic Planning for Physically Grounded Video Generation
cs.CV 2026-05 unverdicted novelty 6.0

NEWTON improves physical accuracy in video generation by deploying a trainable planner that coordinates physics-aware tools and a verifier, raising joint accuracy on VideoPhy-2 without altering the base generators.
Enhancing Train-Free Infinite-Frame Generation for Consistent Long Videos
cs.CV 2026-05 unverdicted novelty 6.0

MIGA introduces two-stage alignment to close train-inference gaps and dual consistency enhancement via self-reflection and long-range guidance to achieve SOTA temporal consistency in infinite-frame video generation on...
Video Models Can Reason with Verifiable Rewards
cs.CV 2026-05 unverdicted novelty 6.0

VideoRLVR uses SDE-GRPO optimization, dense decomposed rewards, and Early-Step Focus to train video diffusion models on verifiable reasoning tasks, outperforming supervised fine-tuning and other video generators on Ma...
WorldReasonBench: Human-Aligned Stress Testing of Video Generators as Future World-State Predictors
cs.CV 2026-05 unverdicted novelty 6.0

The paper presents WorldReasonBench, a benchmark that tests video generators on maintaining physical, social, logical, and informational consistency when predicting future states from initial conditions and actions.
ACWM-Phys: Investigating Generalized Physical Interaction in Action-Conditioned Video World Models
cs.CV 2026-05 unverdicted novelty 6.0

ACWM-Phys benchmark shows action-conditioned world models generalize on simple geometric interactions but drop sharply on deformable contacts, high-dimensional control, and complex articulated motion, indicating relia...
Render, Don't Decode: Weight-Space World Models with Latent Structural Disentanglement
cs.CV 2026-05 unverdicted novelty 6.0

NOVA represents scene states as INR weights for analytical rendering without decoders and achieves structural disentanglement of content and dynamics in video world models.
Learning to Theorize the World from Observation
cs.LG 2026-05 unverdicted novelty 6.0

NEO induces compositional latent programs as world theories from observations and executes them to enable explanation-driven generalization.
TRAP: Tail-aware Ranking Attack for World-Model Planning
cs.LG 2026-05 unverdicted novelty 6.0

TRAP is a tail-aware ranking attack that plants a backdoor in world models so that a trigger causes the model to reorder a few critical imagined trajectories and redirect planning while preserving normal behavior on c...
Alice v1: Distillation-Enhanced Video Generation Surpassing Closed-Source Models
cs.GR 2026-04 unverdicted novelty 6.0

Alice v1 is an open video model that surpasses its teacher and closed-source systems like Veo3 and Sora2 in quality while running 7x faster through specialized distillation.
SceneScribe-1M: A Large-Scale Video Dataset with Comprehensive Geometric and Semantic Annotations
cs.CV 2026-04 unverdicted novelty 6.0

SceneScribe-1M is a new dataset of 1 million videos with semantic text, camera parameters, dense depth, and consistent 3D point tracks to support monocular depth estimation, scene reconstruction, point tracking, and t...
ProPhy: Progressive Physical Alignment for Dynamic World Simulation
cs.CV 2025-12 unverdicted novelty 6.0

ProPhy adds explicit physics-aware conditioning via semantic and refinement experts plus VLM knowledge transfer to produce more physically coherent dynamic videos than prior methods.
Video models are zero-shot learners and reasoners
cs.LG 2025-09 unverdicted novelty 6.0

Generative video models exhibit emergent zero-shot capabilities across perception, manipulation, and basic reasoning tasks.
ReSim: Reliable World Simulation for Autonomous Driving
cs.CV 2025-06 unverdicted novelty 6.0

ReSim is a controllable video world model trained on heterogeneous real and simulated driving data that achieves higher fidelity and controllability for both expert and non-expert actions, plus a Video2Reward module f...
MASS: Motion-Aware Spatial-Temporal Grounding for Physics Reasoning and Comprehension in Vision-Language Models
cs.CV 2025-11 unverdicted novelty 5.0

MASS adds spatiotemporal motion signals and 3D grounding to VLMs and releases MASS-Bench, yielding physics-reasoning performance within 2% of Gemini-2.5-Flash after reinforcement fine-tuning.
Cloning Deterministic Worlds: The Critical Role of Latent Geometry in Long-Horizon World Models
cs.LG 2025-10 unverdicted novelty 5.0

GRWM uses temporal contrastive learning to geometrically regularize latent spaces in world models for high-fidelity cloning of deterministic 3D worlds.
Cosmos World Foundation Model Platform for Physical AI
cs.CV 2025-01 unverdicted novelty 3.0

The Cosmos platform supplies open-source pre-trained world models and supporting tools for building fine-tunable digital world simulations to train Physical AI.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · cited by 21 Pith papers · 8 internal anchors

[1]

Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models

15 Black, K., Nakamoto, M., Atreya, P., Walke, H., Finn, C., Kumar, A., and Levine, S. Zero-shot robotic manipulation with pretrained image-editing diffusion models. arXiv preprint arXiv:2310.10639, 2023. 9 Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y ., English, Z., V oleti, V ., Letts, A., et al. Stable video...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Language Models are Few-Shot Learners

1 Bouman, K. L., Xiao, B., Battaglia, P., and Freeman, W. T. Estimating the material properties of fabric from video. In Proceedings of the IEEE international conference on computer vision, pp. 1984–1991, 2013. 15 Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y ., Jing, L., Schnurr, D., Taylor, J., Luhman, T., Luhman, E., Ng, C., Wang, R., and Rame...

work page internal anchor Pith review Pith/arXiv arXiv 1984
[3]

and Zisserman, A

15 Carreira, J. and Zisserman, A. Quo vadis, action recogni- tion? a new model and the kinetics dataset. In proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308, 2017. 6 de Silva, B. M., Higdon, D. M., Brunton, S. L., and Kutz, J. N. Discovery of physics from data: Universal laws and discrepancies. Frontiers in ar...

work page 2017
[4]

15 Diederik, P. K. and Ba, J. Adam: A method for stochastic optimization. In ICLR, 2015. 16 Du, Y . and Kaelbling, L. Compositional generative model- ing: A single model is not all you need. arXiv preprint arXiv:2402.01103, 2024. 2 Gadre, S. Y ., Ilharco, G., Fang, A., Hayase, J., Smyrnis, G., Nguyen, T., Marten, R., Wortsman, M., Ghosh, D., Zhang, J., Or...

work page arXiv 2015
[5]

org/CorpusID:258352812

URL https://api.semanticscholar. org/CorpusID:258352812. 16 Gao, S., Yang, J., Chen, L., Chitta, K., Qiu, Y ., Geiger, A., Zhang, J., and Li, H. Vista: A generalizable driving world model with high fidelity and versatile controllabil- ity. arXiv preprint arXiv:2405.17398, 2024. 9 Girdhar, R., Gustafson, L., Adcock, A., and van der Maaten, L. Forward predi...

work page arXiv 2024
[6]

Imagen Video: High Definition Video Generation with Diffusion Models

9 Ho, J., Jain, A., and Abbeel, P. Denoising diffusion proba- bilistic models. Advances in neural information process- ing systems, 33:6840–6851, 2020. 15 Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko, A., Kingma, D. P., Poole, B., Norouzi, M., Fleet, D. J., et al. Imagen video: High definition video generation with diffusion models. arXiv ...

work page internal anchor Pith review Pith/arXiv arXiv 2020
[7]

GAIA-1: A Generative World Model for Autonomous Driving

17 Hu, A., Russell, L., Yeo, H., Murez, Z., Fedoseev, G., Kendall, A., Shotton, J., and Corrado, G. Gaia-1: A generative world model for autonomous driving. arXiv preprint arXiv:2309.17080, 2023. 1, 9 Hu, Y ., Tang, X., Yang, H., and Zhang, M. Case-based or rule-based: How do transformers do the math? ICML,

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Vbench: Compre- hensive benchmark suite for video generative models

2, 7, 22 Huang, Z., He, Y ., Yu, J., Zhang, F., Si, C., Jiang, Y ., Zhang, Y ., Wu, T., Jin, Q., Chanpaisit, N., et al. Vbench: Compre- hensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 21807–21818, 2024. 9 Isola, P., Zhu, J.-Y ., Zhou, T., and Efros, A. A. Image-...

work page 2024
[9]

org/CorpusID:6200260

URL https://api.semanticscholar. org/CorpusID:6200260. 16 Jia, F., Mao, W., Liu, Y ., Zhao, Y ., Wen, Y ., Zhang, C., Zhang, X., and Wang, T. Adriver-i: A general world model for autonomous driving. arXiv preprint arXiv:2311.13549, 2023. 9 Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amod...

work page arXiv 2023
[10]

9 Kingma, D. P. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013. 9 Kondratyuk, D., Yu, L., Gu, X., Lezama, J., Huang, J., Hornung, R., Adam, H., Akbari, H., Alon, Y ., Birodkar, V ., et al. Videopoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125, 2023. 9 11 How Far is Video Generation from Wo...

work page internal anchor Pith review Pith/arXiv arXiv 2013
[11]

Progressive Distillation for Fast Sampling of Diffusion Models

9, 16 Riveland, R. and Pouget, A. Natural language instructions induce compositional generalization in networks of neu- rons. Nature Neuroscience, 27(5):988–999, 2024. 3, 19 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF con- ference on computer ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

, month = jun, year =

URL https://api.semanticscholar. org/CorpusID:258762479. 16 Wang, X., Zhang, X., Zhu, Y ., Guo, Y ., Yuan, X., Xiang, L., Wang, Z., Ding, G., Brady, D., Dai, Q., and Fang, L. Panda: A gigapixel-level human-centric video dataset. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3265–3275, 2020. doi: 10.1109/CVPR42600.2020....

work page doi:10.1109/cvpr42600.2020.00333 2020
[13]

Z., Ge, Y ., Wang, X., Lei, S

15 Wu, J. Z., Ge, Y ., Wang, X., Lei, S. W., Gu, Y ., Shi, Y ., Hsu, W., Shan, Y ., Qie, X., and Shou, M. Z. Tune-a- video: One-shot tuning of image diffusion models for text-to-video generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7623– 7633, 2023. 9 Xu, K., Zhang, M., Li, J., Du, S. S., Kawarabayashi, K.-i., an...

work page arXiv 2023
[14]

15 Xue, T., Chen, B., Wu, J., Wei, D., and Freeman, W. T. Video enhancement with task-oriented flow. International Journal of Computer Vision , pp. 1–20,

work page
[15]

Learning Interactive Real-World Simulators

URL https://api.semanticscholar. org/CorpusID:40412298. 16 Yang, J., Gao, S., Qiu, Y ., Chen, L., Li, T., Dai, B., Chitta, K., Wu, P., Zeng, J., Luo, P., et al. Generalized predictive model for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14662–14672, 2024. 1 Yang, M., Du, Y ., Ghasemipour, ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models

6 Zhang, S., Wang, J., Zhang, Y ., Zhao, K., Yuan, H., Qin, Z., Wang, X., Zhao, D., and Zhou, J. I2vgen-xl: High-quality image-to-video synthesis via cascaded diffusion models. arXiv preprint arXiv:2311.04145, 2023a. 9 Zhang, Y ., Wei, Y ., Jiang, D., Zhang, X., Zuo, W., and Tian, Q. Controlvideo: Training-free controllable text-to-video generation. arXiv...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

Can a video generation model learn physical laws?

and DataComp (Gadre et al., 2023) with high aesthetics and clarity to form a high-quality subset. As for the video dataset, we collect a high-quality subset from Vimeo-90K (Xue et al., 2017), Panda-70M (Wang et al., 2020) and HDVG (Wang et al., 2023). In the training process, we train the entire structure for 1M steps and only the random resized crop and ...

work page 2023
[18]

This has been a topic of debate in the video generation community

Addressing the Debate on Learning Physical Laws in Video Generation Models: Through systematic experiments, we provide a clear answer to whether physical laws can be learned by scaling video generation models. This has been a topic of debate in the video generation community. For instance, OpenAI’s Sora Technical Report (Brooks et al., 2024) suggests that...

work page 2024
[19]

Scaling Guidance for Combinatorial Generalization:We demonstrate the importance of combinatorial generalization and identify scaling laws for improving generalization in video generation models. Our findings highlight that increasing the diversity of combinations in the training data is more effective for achieving realistic physics than merely scaling th...

work page
[20]

teleportation

Revealing the Generalization Mechanism and Understanding the Boundaries of Video Generation Models: We uncover how video generation models generalize, primarily relying on referencing similar training examples rather than learning underlying universal principles. This provides a deeper understanding of their limitations and biases in representing physical...

work page

[1] [1]

Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models

15 Black, K., Nakamoto, M., Atreya, P., Walke, H., Finn, C., Kumar, A., and Levine, S. Zero-shot robotic manipulation with pretrained image-editing diffusion models. arXiv preprint arXiv:2310.10639, 2023. 9 Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y ., English, Z., V oleti, V ., Letts, A., et al. Stable video...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Language Models are Few-Shot Learners

1 Bouman, K. L., Xiao, B., Battaglia, P., and Freeman, W. T. Estimating the material properties of fabric from video. In Proceedings of the IEEE international conference on computer vision, pp. 1984–1991, 2013. 15 Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y ., Jing, L., Schnurr, D., Taylor, J., Luhman, T., Luhman, E., Ng, C., Wang, R., and Rame...

work page internal anchor Pith review Pith/arXiv arXiv 1984

[3] [3]

and Zisserman, A

15 Carreira, J. and Zisserman, A. Quo vadis, action recogni- tion? a new model and the kinetics dataset. In proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308, 2017. 6 de Silva, B. M., Higdon, D. M., Brunton, S. L., and Kutz, J. N. Discovery of physics from data: Universal laws and discrepancies. Frontiers in ar...

work page 2017

[4] [4]

15 Diederik, P. K. and Ba, J. Adam: A method for stochastic optimization. In ICLR, 2015. 16 Du, Y . and Kaelbling, L. Compositional generative model- ing: A single model is not all you need. arXiv preprint arXiv:2402.01103, 2024. 2 Gadre, S. Y ., Ilharco, G., Fang, A., Hayase, J., Smyrnis, G., Nguyen, T., Marten, R., Wortsman, M., Ghosh, D., Zhang, J., Or...

work page arXiv 2015

[5] [5]

org/CorpusID:258352812

URL https://api.semanticscholar. org/CorpusID:258352812. 16 Gao, S., Yang, J., Chen, L., Chitta, K., Qiu, Y ., Geiger, A., Zhang, J., and Li, H. Vista: A generalizable driving world model with high fidelity and versatile controllabil- ity. arXiv preprint arXiv:2405.17398, 2024. 9 Girdhar, R., Gustafson, L., Adcock, A., and van der Maaten, L. Forward predi...

work page arXiv 2024

[6] [6]

Imagen Video: High Definition Video Generation with Diffusion Models

9 Ho, J., Jain, A., and Abbeel, P. Denoising diffusion proba- bilistic models. Advances in neural information process- ing systems, 33:6840–6851, 2020. 15 Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko, A., Kingma, D. P., Poole, B., Norouzi, M., Fleet, D. J., et al. Imagen video: High definition video generation with diffusion models. arXiv ...

work page internal anchor Pith review Pith/arXiv arXiv 2020

[7] [7]

GAIA-1: A Generative World Model for Autonomous Driving

17 Hu, A., Russell, L., Yeo, H., Murez, Z., Fedoseev, G., Kendall, A., Shotton, J., and Corrado, G. Gaia-1: A generative world model for autonomous driving. arXiv preprint arXiv:2309.17080, 2023. 1, 9 Hu, Y ., Tang, X., Yang, H., and Zhang, M. Case-based or rule-based: How do transformers do the math? ICML,

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

Vbench: Compre- hensive benchmark suite for video generative models

2, 7, 22 Huang, Z., He, Y ., Yu, J., Zhang, F., Si, C., Jiang, Y ., Zhang, Y ., Wu, T., Jin, Q., Chanpaisit, N., et al. Vbench: Compre- hensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 21807–21818, 2024. 9 Isola, P., Zhu, J.-Y ., Zhou, T., and Efros, A. A. Image-...

work page 2024

[9] [9]

org/CorpusID:6200260

URL https://api.semanticscholar. org/CorpusID:6200260. 16 Jia, F., Mao, W., Liu, Y ., Zhao, Y ., Wen, Y ., Zhang, C., Zhang, X., and Wang, T. Adriver-i: A general world model for autonomous driving. arXiv preprint arXiv:2311.13549, 2023. 9 Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amod...

work page arXiv 2023

[10] [10]

9 Kingma, D. P. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013. 9 Kondratyuk, D., Yu, L., Gu, X., Lezama, J., Huang, J., Hornung, R., Adam, H., Akbari, H., Alon, Y ., Birodkar, V ., et al. Videopoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125, 2023. 9 11 How Far is Video Generation from Wo...

work page internal anchor Pith review Pith/arXiv arXiv 2013

[11] [11]

Progressive Distillation for Fast Sampling of Diffusion Models

9, 16 Riveland, R. and Pouget, A. Natural language instructions induce compositional generalization in networks of neu- rons. Nature Neuroscience, 27(5):988–999, 2024. 3, 19 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF con- ference on computer ...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

, month = jun, year =

URL https://api.semanticscholar. org/CorpusID:258762479. 16 Wang, X., Zhang, X., Zhu, Y ., Guo, Y ., Yuan, X., Xiang, L., Wang, Z., Ding, G., Brady, D., Dai, Q., and Fang, L. Panda: A gigapixel-level human-centric video dataset. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3265–3275, 2020. doi: 10.1109/CVPR42600.2020....

work page doi:10.1109/cvpr42600.2020.00333 2020

[13] [13]

Z., Ge, Y ., Wang, X., Lei, S

15 Wu, J. Z., Ge, Y ., Wang, X., Lei, S. W., Gu, Y ., Shi, Y ., Hsu, W., Shan, Y ., Qie, X., and Shou, M. Z. Tune-a- video: One-shot tuning of image diffusion models for text-to-video generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7623– 7633, 2023. 9 Xu, K., Zhang, M., Li, J., Du, S. S., Kawarabayashi, K.-i., an...

work page arXiv 2023

[14] [14]

15 Xue, T., Chen, B., Wu, J., Wei, D., and Freeman, W. T. Video enhancement with task-oriented flow. International Journal of Computer Vision , pp. 1–20,

work page

[15] [15]

Learning Interactive Real-World Simulators

URL https://api.semanticscholar. org/CorpusID:40412298. 16 Yang, J., Gao, S., Qiu, Y ., Chen, L., Li, T., Dai, B., Chitta, K., Wu, P., Zeng, J., Luo, P., et al. Generalized predictive model for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14662–14672, 2024. 1 Yang, M., Du, Y ., Ghasemipour, ...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [16]

I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models

6 Zhang, S., Wang, J., Zhang, Y ., Zhao, K., Yuan, H., Qin, Z., Wang, X., Zhao, D., and Zhou, J. I2vgen-xl: High-quality image-to-video synthesis via cascaded diffusion models. arXiv preprint arXiv:2311.04145, 2023a. 9 Zhang, Y ., Wei, Y ., Jiang, D., Zhang, X., Zuo, W., and Tian, Q. Controlvideo: Training-free controllable text-to-video generation. arXiv...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [17]

Can a video generation model learn physical laws?

and DataComp (Gadre et al., 2023) with high aesthetics and clarity to form a high-quality subset. As for the video dataset, we collect a high-quality subset from Vimeo-90K (Xue et al., 2017), Panda-70M (Wang et al., 2020) and HDVG (Wang et al., 2023). In the training process, we train the entire structure for 1M steps and only the random resized crop and ...

work page 2023

[18] [18]

This has been a topic of debate in the video generation community

Addressing the Debate on Learning Physical Laws in Video Generation Models: Through systematic experiments, we provide a clear answer to whether physical laws can be learned by scaling video generation models. This has been a topic of debate in the video generation community. For instance, OpenAI’s Sora Technical Report (Brooks et al., 2024) suggests that...

work page 2024

[19] [19]

Scaling Guidance for Combinatorial Generalization:We demonstrate the importance of combinatorial generalization and identify scaling laws for improving generalization in video generation models. Our findings highlight that increasing the diversity of combinations in the training data is more effective for achieving realistic physics than merely scaling th...

work page

[20] [20]

teleportation

Revealing the Generalization Mechanism and Understanding the Boundaries of Video Generation Models: We uncover how video generation models generalize, primarily relying on referencing similar training examples rather than learning underlying universal principles. This provides a deeper understanding of their limitations and biases in representing physical...

work page