pith. sign in

arxiv: 2308.08089 · v1 · pith:UDEIT65Rnew · submitted 2023-08-16 · 💻 cs.CV

DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory

Pith reviewed 2026-05-20 12:57 UTC · model grok-4.3

classification 💻 cs.CV
keywords video generationdiffusion modelstrajectory controlfine-grained controlopen-domain generationmultimodal conditioningmotion guidance
0
0 comments X

The pith

DragNUWA achieves fine-grained control in open-domain video generation by integrating text, image, and trajectory information.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DragNUWA, a diffusion-based model for video generation that accepts text, image, and trajectory inputs together. Text supplies semantic guidance, the image fixes spatial layout, and the trajectory dictates motion paths. Earlier methods handled only one of these signals or worked only on simple datasets like Human3.6M, which restricted their use on real scenes and complex motions. DragNUWA adds a Trajectory Sampler to accept arbitrary curves on any image, Multiscale Fusion to blend control at different resolutions, and an Adaptive Training procedure to keep generated frames consistent with the inputs. A sympathetic reader would see this as a step toward video creation tools that let users direct content, placement, and movement with combined instructions.

Core claim

DragNUWA is an open-domain diffusion-based video generation model that simultaneously introduces text, image, and trajectory information to provide fine-grained control over video content from semantic, spatial, and temporal perspectives. It resolves limited open-domain trajectory control by proposing a Trajectory Sampler to enable arbitrary trajectories, Multiscale Fusion to control trajectories at different granularities, and an Adaptive Training strategy to generate consistent videos that follow the trajectories.

What carries the argument

Trajectory Sampler for open-domain arbitrary trajectories, Multiscale Fusion for varying control granularities, and Adaptive Training for motion consistency, all combined with text and image conditioning in a diffusion video model.

If this is right

  • Videos can be created that follow user-specified arbitrary curved trajectories overlaid on any input image.
  • Semantic content from text, spatial details from the image, and temporal motion from the trajectory are controlled at the same time.
  • The model handles open-domain scenes rather than being limited to narrow datasets like Human3.6M.
  • Generated sequences maintain consistency with the trajectory across frames.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same multi-signal conditioning pattern could be tested on related tasks such as image animation or 3D scene generation.
  • Interactive interfaces might let users sketch paths directly on an image to define desired motion.
  • The method could reduce reliance on single-modality training data when scaling controllable generation systems.

Load-bearing premise

The Trajectory Sampler, Multiscale Fusion, and Adaptive Training can reliably produce consistent videos that follow arbitrary complex curved trajectories on open-domain images without motion artifacts or semantic drift.

What would settle it

Generate videos from complex curved trajectories drawn on diverse real-world open-domain images and measure whether the motion paths are followed accurately while content and semantics remain stable and free of visible artifacts.

read the original abstract

Controllable video generation has gained significant attention in recent years. However, two main limitations persist: Firstly, most existing works focus on either text, image, or trajectory-based control, leading to an inability to achieve fine-grained control in videos. Secondly, trajectory control research is still in its early stages, with most experiments being conducted on simple datasets like Human3.6M. This constraint limits the models' capability to process open-domain images and effectively handle complex curved trajectories. In this paper, we propose DragNUWA, an open-domain diffusion-based video generation model. To tackle the issue of insufficient control granularity in existing works, we simultaneously introduce text, image, and trajectory information to provide fine-grained control over video content from semantic, spatial, and temporal perspectives. To resolve the problem of limited open-domain trajectory control in current research, We propose trajectory modeling with three aspects: a Trajectory Sampler (TS) to enable open-domain control of arbitrary trajectories, a Multiscale Fusion (MF) to control trajectories in different granularities, and an Adaptive Training (AT) strategy to generate consistent videos following trajectories. Our experiments validate the effectiveness of DragNUWA, demonstrating its superior performance in fine-grained control in video generation. The homepage link is \url{https://www.microsoft.com/en-us/research/project/dragnuwa/}

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces DragNUWA, an open-domain diffusion-based video generation model that integrates text, image, and trajectory inputs to enable fine-grained control over video content from semantic, spatial, and temporal perspectives. It proposes a Trajectory Sampler (TS) to support arbitrary trajectories on open-domain images, Multiscale Fusion (MF) to handle trajectories at varying granularities, and an Adaptive Training (AT) strategy to produce consistent videos that follow the specified trajectories. The central claim is that this combination yields superior performance in fine-grained controllable video generation compared to prior single-modality or limited-domain approaches, with experiments asserted to validate the effectiveness.

Significance. If the results hold, the work would advance multi-modal controllable video generation by addressing the limitations of single-modality control and restriction to simple datasets such as Human3.6M. Enabling open-domain handling of complex curved trajectories via the proposed TS, MF, and AT components could support more precise applications in animation and content creation. The structured decomposition of trajectory modeling provides a clear technical contribution to the diffusion conditioning literature.

major comments (2)
  1. [Experiments] Experiments section: the abstract asserts experimental validation and superior performance, yet no quantitative metrics (e.g., FID, FVD, or user-study scores), dataset details, or ablation results on TS/MF/AT are provided in the summary; without these the central claim of effectiveness rests on an unverified assertion and requires explicit tables comparing against baselines on open-domain data.
  2. [Method] Method section, description of Adaptive Training (AT): the strategy is presented as ensuring consistency and avoiding motion artifacts or semantic drift for arbitrary curved trajectories, but the concrete loss formulation, sampling schedule, or conditioning weight schedule is not specified; this leaves the load-bearing claim that AT reliably produces artifact-free output ungrounded in the provided equations or pseudocode.
minor comments (2)
  1. The homepage link is given but the manuscript does not include a direct pointer to the released code or model weights, which would aid reproducibility.
  2. [Method] Notation for the three conditioning modalities (text, image, trajectory) should be introduced with explicit symbols in the method overview to improve clarity when describing the fusion step.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive comments. We address each major comment point by point below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the abstract asserts experimental validation and superior performance, yet no quantitative metrics (e.g., FID, FVD, or user-study scores), dataset details, or ablation results on TS/MF/AT are provided in the summary; without these the central claim of effectiveness rests on an unverified assertion and requires explicit tables comparing against baselines on open-domain data.

    Authors: We appreciate the referee highlighting the need for clearer quantitative support. The full manuscript contains experimental results including user studies demonstrating superior performance; however, to directly address this concern, we will add explicit tables in the revised version reporting quantitative metrics (such as FID and FVD where relevant), dataset details, and ablation studies isolating the contributions of TS, MF, and AT, with direct comparisons to baselines on open-domain data. revision: yes

  2. Referee: [Method] Method section, description of Adaptive Training (AT): the strategy is presented as ensuring consistency and avoiding motion artifacts or semantic drift for arbitrary curved trajectories, but the concrete loss formulation, sampling schedule, or conditioning weight schedule is not specified; this leaves the load-bearing claim that AT reliably produces artifact-free output ungrounded in the provided equations or pseudocode.

    Authors: We agree that the Adaptive Training description would benefit from greater specificity. In the revised manuscript, we will include the concrete loss formulation for AT, along with the sampling schedule and conditioning weight schedule. These additions will better substantiate the claims regarding consistency and the avoidance of motion artifacts or semantic drift for arbitrary trajectories. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper proposes DragNUWA as a diffusion-based architecture that adds text, image, and trajectory conditioning, along with three explicitly defined new components (Trajectory Sampler for open-domain paths, Multiscale Fusion for granularity, and Adaptive Training for consistency). These elements are introduced to address stated limitations in prior work and are described directly in the method without any equations or claims that reduce the performance gains to a fitted parameter, self-definition, or self-citation chain. The derivation remains self-contained as a coherent extension of standard diffusion conditioning, with validation left to experiments rather than internal reduction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 3 invented entities

The proposal rests on standard diffusion-model conditioning assumptions plus three newly introduced modules whose effectiveness is asserted rather than derived from first principles.

free parameters (1)
  • diffusion conditioning weights for text/image/trajectory
    Standard learned or hand-tuned scalars that balance the three control signals during sampling.
axioms (1)
  • domain assumption Diffusion models can be jointly conditioned on semantic, spatial, and temporal signals without destructive interference.
    Invoked when the paper states that simultaneous introduction of the three inputs yields fine-grained control.
invented entities (3)
  • Trajectory Sampler (TS) no independent evidence
    purpose: Enable open-domain control of arbitrary trajectories
    New module introduced to sample points along user-drawn paths on complex scenes.
  • Multiscale Fusion (MF) no independent evidence
    purpose: Control trajectories at different granularities
    New fusion mechanism to combine trajectory information across scales.
  • Adaptive Training (AT) strategy no independent evidence
    purpose: Generate consistent videos following trajectories
    New training procedure claimed to enforce trajectory adherence.

pith-pipeline@v0.9.0 · 5782 in / 1277 out tokens · 38395 ms · 2026-05-20T12:57:38.194907+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Aero-World: Action-Conditioned Aerial Video Generation from Inertial Controls

    cs.CV 2026-05 unverdicted novelty 7.0

    Aero-World adapts a pretrained latent diffusion transformer for action-conditioned aerial video generation by injecting inertial action tokens and using a frozen latent-space Physics Probe for inertial consistency sup...

  2. Functionalization via Structure Completion and Motion Rectification

    cs.CV 2026-05 unverdicted novelty 7.0

    Object functionalization is cast as neural graph completion over a functional graph of parts, contacts, and motions, followed by geometry realization that also rectifies erroneous motions, demonstrated on furniture wi...

  3. R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow

    cs.CV 2026-05 unverdicted novelty 7.0

    R-DMesh generates high-fidelity 4D meshes aligned to video by disentangling base mesh, motion, and a learned rectification jump offset inside a VAE, then using Triflow Attention and rectified-flow diffusion.

  4. Eulerian Motion Guidance: Robust Image Animation via Bidirectional Geometric Consistency

    cs.CV 2026-05 unverdicted novelty 7.0

    Eulerian adjacent-frame motion fields with bidirectional cycle consistency checks enable faster parallel training and fewer artifacts in diffusion model image animation compared to initial-frame Lagrangian guidance.

  5. Eulerian Motion Guidance: Robust Image Animation via Bidirectional Geometric Consistency

    cs.CV 2026-05 unverdicted novelty 7.0

    Eulerian adjacent-frame motion guidance plus bidirectional geometric consistency improves training speed, temporal coherence, and artifact reduction in diffusion-based image animation.

  6. Reshoot-Anything: A Self-Supervised Model for In-the-Wild Video Reshooting

    cs.CV 2026-04 unverdicted novelty 7.0

    Reshoot-Anything trains a diffusion transformer on pseudo multi-view triplets created by cropping and warping monocular videos to achieve temporally consistent video reshooting with robust camera control on dynamic scenes.

  7. MoRight: Motion Control Done Right

    cs.CV 2026-04 unverdicted novelty 7.0

    MoRight disentangles object and camera motion via canonical-view specification and temporal cross-view attention, while decomposing motion into active user-driven and passive consequence components to learn and apply ...

  8. ASTRA: Let Arbitrary Subjects Transform in Video Editing

    cs.CV 2025-10 unverdicted novelty 7.0

    ASTRA is a plug-and-play training-free method for precise multi-subject video editing that uses prompt-guided multimodal alignment and prior-based mask retargeting to avoid attention dilution and boundary issues.

  9. AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

    cs.CV 2023-07 unverdicted novelty 7.0

    A single motion module trained on videos adds temporally coherent animation to any personalized text-to-image model derived from the same base without additional tuning.

  10. ReactiveGWM: Steering NPC in Reactive Game World Models

    cs.CV 2026-05 unverdicted novelty 6.0

    ReactiveGWM introduces a decoupled diffusion architecture for player-NPC interactions that learns game-agnostic response logic for zero-shot strategy transfer across games.

  11. Eulerian Motion Guidance: Robust Image Animation via Bidirectional Geometric Consistency

    cs.CV 2026-05 unverdicted novelty 6.0

    Eulerian adjacent-frame motion guidance plus bidirectional geometric consistency yields faster training and more coherent diffusion-based image animation than first-frame reference methods.

  12. Bridging the Embodiment Gap: Disentangled Cross-Embodiment Video Editing

    cs.RO 2026-05 unverdicted novelty 6.0

    A dual-contrastive disentanglement method factorizes videos into independent task and embodiment latents, then uses a parameter-efficient adapter on a frozen video diffusion model to synthesize robot executions from s...

  13. PhyCo: Learning Controllable Physical Priors for Generative Motion

    cs.CV 2026-04 unverdicted novelty 6.0

    PhyCo adds continuous physical control to video diffusion models via physics-supervised fine-tuning on a large simulation dataset and VLM-guided rewards, yielding measurable gains in physical realism on the Physics-IQ...

  14. DailyArt: Discovering Articulation from Single Static Images via Latent Dynamics

    cs.CV 2026-04 unverdicted novelty 6.0

    DailyArt recovers full joint parameters of articulated objects from a single static image by synthesizing an opened state and comparing discrepancies, supporting downstream part-level novel state synthesis.

  15. HVG-3D: Bridging Real and Simulation Domains for 3D-Conditional Hand-Object Interaction Video Synthesis

    cs.CV 2026-03 unverdicted novelty 6.0

    HVG-3D uses a 3D-aware diffusion architecture with ControlNet to synthesize high-fidelity hand-object interaction videos from 3D control signals, achieving state-of-the-art spatial fidelity and temporal coherence on t...

  16. Diff4Splat: Controllable 4D Scene Generation with Latent Dynamic Reconstruction Models

    cs.CV 2025-11 unverdicted novelty 6.0

    A feed-forward video latent transformer that predicts time-varying 3D Gaussian primitives from one image to produce controllable 4D scenes with appearance, geometry, and motion.

  17. ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis

    cs.CV 2024-09 unverdicted novelty 6.0

    ViewCrafter tames video diffusion models with point-based 3D guidance and iterative trajectory planning to produce high-fidelity novel views from single or sparse images.

  18. CameraCtrl: Enabling Camera Control for Text-to-Video Generation

    cs.CV 2024-04 unverdicted novelty 6.0

    CameraCtrl enables accurate camera pose control in video diffusion models through a trained plug-and-play module and dataset choices emphasizing diverse camera trajectories with matching appearance.

  19. VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

    cs.CV 2023-10 unverdicted novelty 6.0

    Open-source text-to-video and image-to-video diffusion models generate high-quality 1024x576 videos, with the I2V variant claimed as the first to strictly preserve reference image content.

  20. R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow

    cs.CV 2026-05 unverdicted novelty 5.0

    R-DMesh uses a VAE with a learned rectification jump offset and Triflow Attention inside a rectified-flow diffusion transformer to produce video-aligned 4D meshes despite initial pose misalignment.

  21. Tora3: Trajectory-Guided Audio-Video Generation with Physical Coherence

    cs.CV 2026-04 unverdicted novelty 5.0

    Tora3 uses shared object trajectories as kinematic priors to jointly guide visual motion and acoustic events in audio-video generation, improving realism and synchronization.

  22. I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models

    cs.CV 2023-11 unverdicted novelty 5.0

    I2VGen-XL applies cascaded diffusion models with a base stage for semantic preservation via hierarchical encoders and a refinement stage for detail and resolution, trained on 35 million text-video and 6 billion text-i...

  23. Evolution of Video Generative Foundations

    cs.CV 2026-04 unverdicted novelty 2.0

    This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.

Reference graph

Works this paper leans on

297 extracted references · 297 canonical work pages · cited by 20 Pith papers · 44 internal anchors

  1. [1]

    Click To Move : Controlling Video Generation With Sparse Motion

    Pierfrancesco Ardino, Marco De Nadai, Bruno Lepri, Elisa Ricci, and St \'e phane Lathuili \`e re. Click To Move : Controlling Video Generation With Sparse Motion . In Proceedings of the IEEE / CVF International Conference on Computer Vision , pp.\ 14749--14758, 2021

  2. [2]

    Frozen in time: A joint video and image encoder for end-to-end retrieval

    Max Bain, Arsha Nagrani, G \"u l Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE / CVF International Conference on Computer Vision , pp.\ 1728--1738, 2021

  3. [3]

    Ipoke: Poking a still image for controlled stochastic video synthesis

    Andreas Blattmann, Timo Milbich, Michael Dorkenwald, and Bj \"o rn Ommer. Ipoke: Poking a still image for controlled stochastic video synthesis. In Proceedings of the IEEE / CVF International Conference on Computer Vision , pp.\ 14707--14717, 2021 a

  4. [4]

    Understanding Object Dynamics for Interactive Image-to-Video Synthesis

    Andreas Blattmann, Timo Milbich, Michael Dorkenwald, and Bjorn Ommer. Understanding Object Dynamics for Interactive Image-to-Video Synthesis . In Proceedings of the IEEE / CVF Conference on Computer Vision and Pattern Recognition , pp.\ 5171--5181, 2021 b

  5. [5]

    Caroline Chan, Shiry Ginosar, Tinghui Zhou, and Alexei A. Efros. Everybody Dance Now . In Proceedings of the IEEE / CVF International Conference on Computer Vision , pp.\ 5933--5942, 2019

  6. [7]

    Recurrent Environment Simulators

    Silvia Chiappa, S \'e bastien Racaniere, Daan Wierstra, and Shakir Mohamed. Recurrent Environment Simulators . In International Conference on Learning Representations , November 2016

  7. [9]

    Controllable Video Generation With Sparse Trajectories

    Zekun Hao, Xun Huang, and Serge Belongie. Controllable Video Generation With Sparse Trajectories . In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pp.\ 7854--7863, 2018

  8. [10]

    Imagen Video: High Definition Video Generation with Diffusion Models

    Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P. Kingma, Ben Poole, Mohammad Norouzi, and David J. Fleet. Imagen video: High video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022

  9. [11]

    CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

    Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. CogVideo : Large-scale Pretraining for Text-to-Video Generation via Transformers . arXiv preprint arXiv:2205.15868, 2022

  10. [12]

    Make It Move : Controllable Image-to-Video Generation With Text Descriptions

    Yaosi Hu, Chong Luo, and Zhenzhong Chen. Make It Move : Controllable Image-to-Video Generation With Text Descriptions . In Proceedings of the IEEE / CVF Conference on Computer Vision and Pattern Recognition , pp.\ 18219--18228, 2022

  11. [13]

    Adam: A Method for Stochastic Optimization

    Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

  12. [14]

    BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2 : Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models . arXiv preprint arXiv:2301.12597, 2023

  13. [15]

    Xiaodan Liang, Lisa Lee, Wei Dai, and Eric P. Xing. Dual Motion GAN for Future-Flow Embedded Video Prediction . In Proceedings of the IEEE International Conference on Computer Vision , pp.\ 1744--1752, 2017

  14. [16]

    Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning

    William Lotter, Gabriel Kreiman, and David Cox. Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning . In International Conference on Learning Representations , November 2016

  15. [17]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, and Jack Clark. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning , pp.\ 8748--8763. PMLR , 2021

  16. [18]

    High- Resolution Image Synthesis With Latent Diffusion Models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj \"o rn Ommer. High- Resolution Image Synthesis With Latent Diffusion Models . In Proceedings of the IEEE / CVF Conference on Computer Vision and Pattern Recognition , pp.\ 10684--10695, 2022

  17. [19]

    Make- A-Video : Text-to-Video Generation without Text-Video Data , September 2022

    Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, and Yaniv Taigman. Make- A-Video : Text-to-Video Generation without Text-Video Data , September 2022

  18. [20]

    Unsupervised Learning of Video Representations using LSTMs

    Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudinov. Unsupervised Learning of Video Representations using LSTMs . In Proceedings of the 32nd International Conference on Machine Learning , pp.\ 843--852. PMLR , June 2015

  19. [21]

    Phenaki: Variable Length Video Generation from Open Domain Textual Descriptions

    Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kindermans, Hernan Moraldo, Han Zhang, Mohammad Taghi Saffar, Santiago Castro, Julius Kunze, and Dumitru Erhan. Phenaki: Variable Length Video Generation from Open Domain Textual Descriptions . In ICLR , September 2022

  20. [22]

    The Pose Knows : Video Forecasting by Generating Pose Futures

    Jacob Walker, Kenneth Marino, Abhinav Gupta, and Martial Hebert. The Pose Knows : Video Forecasting by Generating Pose Futures . In Proceedings of the IEEE International Conference on Computer Vision , pp.\ 3332--3341, 2017

  21. [23]

    Deep High-Resolution Representation Learning for Visual Recognition

    Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang, Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, Mingkui Tan, Xinggang Wang, Wenyu Liu, and Bin Xiao. Deep High-Resolution Representation Learning for Visual Recognition . IEEE Transactions on Pattern Analysis and Machine Intelligence, 43 0 (10): 0 3349--3364, October 2021. ISSN 1939-3539. doi:10.1109/TPAMI...

  22. [24]

    Few-shot video-to-video synthesis

    Ting-Chun Wang, Ming-Yu Liu, Andrew Tao, Guilin Liu, Jan Kautz, and Bryan Catanzaro. Few-shot video-to-video synthesis. In Proceedings of the 33rd International Conference on Neural Information Processing Systems , pp.\ 5013--5024, Red Hook, NY, USA , December 2019. Curran Associates Inc

  23. [25]

    VideoComposer : Compositional Video Synthesis with Motion Controllability , June 2023

    Xiang Wang, Hangjie Yuan, Shiwei Zhang, Dayou Chen, Jiuniu Wang, Yingya Zhang, Yujun Shen, Deli Zhao, and Jingren Zhou. VideoComposer : Compositional Video Synthesis with Motion Controllability , June 2023

  24. [26]

    Hierarchical Long-term Video Prediction without Supervision

    Nevan Wichers, Ruben Villegas, Dumitru Erhan, and Honglak Lee. Hierarchical Long-term Video Prediction without Supervision . In Proceedings of the 35th International Conference on Machine Learning , pp.\ 6038--6046. PMLR , July 2018

  25. [27]

    GODIVA : Generating Open-DomaIn Videos from nAtural Descriptions

    Chenfei Wu, Lun Huang, Qianxi Zhang, Binyang Li, Lei Ji, Fan Yang, Guillermo Sapiro, and Nan Duan. GODIVA : Generating Open-DomaIn Videos from nAtural Descriptions . arXiv:2104.14806 [cs], April 2021

  26. [28]

    N " UWA : Visual Synthesis Pre-training for Neural visUal World creAtion

    Chenfei Wu, Jian Liang, Lei Ji, Fan Yang, Yuejian Fang, Daxin Jiang, and Nan Duan. N " UWA : Visual Synthesis Pre-training for Neural visUal World creAtion . In Proceedings of the European Conference on Computer Vision ( ECCV ) , 2022

  27. [29]

    Future Video Synthesis With Object Motion Prediction

    Yue Wu, Rongrong Gao, Jaesik Park, and Qifeng Chen. Future Video Synthesis With Object Motion Prediction . In Proceedings of the IEEE / CVF Conference on Computer Vision and Pattern Recognition , pp.\ 5539--5548, 2020

  28. [30]

    Unifying flow, stereo and depth estimation

    Haofei Xu, Jing Zhang, Jianfei Cai, Hamid Rezatofighi, Fisher Yu, Dacheng Tao, and Andreas Geiger. Unifying flow, stereo and depth estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023

  29. [31]

    NUWA-XL : Diffusion over Diffusion for eXtremely Long Video Generation

    Shengming Yin, Chenfei Wu, Huan Yang, Jianfeng Wang, Xiaodong Wang, Minheng Ni, Zhengyuan Yang, Linjie Li, Shuguang Liu, and Fan Yang. NUWA-XL : Diffusion over Diffusion for eXtremely Long Video Generation . arXiv preprint arXiv:2303.12346, 2023

  30. [32]

    DTVNet : Dynamic Time-Lapse Video Generation via Single Still Image

    Jiangning Zhang, Chao Xu, Liang Liu, Mengmeng Wang, Xia Wu, Yong Liu, and Yunliang Jiang. DTVNet : Dynamic Time-Lapse Video Generation via Single Still Image . In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (eds.), Computer Vision ECCV 2020 , Lecture Notes in Computer Science , pp.\ 300--315, Cham , 2020. Springer International Publi...

  31. [33]

    IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

    Unifying Flow, Stereo and Depth Estimation , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

  32. [34]

    99firms.com , urldate =

    2019's. 99firms.com , urldate =

  33. [35]

    2019 , month = may, journal =

    Around 40,000 Songs Are Uploaded to. 2019 , month = may, journal =

  34. [36]

    2018 , pages =

    Watch. 2018 , pages =

  35. [37]

    1995 , pages =

    Financial Applications of Learning from Hints , booktitle =. 1995 , pages =

  36. [38]

    1995 , journal =

    Hints , author =. 1995 , journal =

  37. [39]

    1993 , pages =

    A Method for Learning from Hints , booktitle =. 1993 , pages =

  38. [40]

    Towards High Resolution Video Generation with Progressive Growing of Sliced Wasserstein GANs

    Towards High Resolution Video Generation with Progressive Growing of Sliced Wasserstein Gans , author =. 2018 , journal =. 1810.02419 , archiveprefix =

  39. [41]

    TallyQA: Answering Complex Counting Questions

    Acharya, Manoj and Kafle, Kushal and Kanan, Christopher , year =. arXiv:1810.12440 [cs] , eprint =

  40. [42]

    Explicit

    Aditya, Somak and Yang, Yezhou and Baral, Chitta , year =. Explicit

  41. [43]

    Question

    Adiwardana, Daniel De Freitas and Shakeri, Siamak , urldate =. Question

  42. [44]

    Analyzing the

    Agrawal, Aishwarya and Batra, Dhruv and Parikh, Devi , year =. Analyzing the

  43. [45]

    Agrawal, Aishwarya and Kembhavi, Aniruddha and Batra, Dhruv and Parikh, Devi , year =. C-. arXiv:1704.08243 [cs] , eprint =

  44. [46]

    Agrawal, Aishwarya and Batra, Dhruv and Parikh, Devi and Kembhavi, Aniruddha , year =. Don't. arxiv , file =:1712.00377 , urldate =

  45. [47]

    Lawrence and Parikh, Devi and Batra, Dhruv , year =

    Agrawal, Aishwarya and Lu, Jiasen and Antol, Stanislaw and Mitchell, Margaret and Zitnick, C. Lawrence and Parikh, Devi and Batra, Dhruv , year =. International Journal of Computer Vision , volume =. doi:10.1007/s11263-016-0966-6 , urldate =

  46. [48]

    Scale-Space Flow for End-to-End Optimized Video Compression , booktitle =

    Agustsson, Eirikur and Minnen, David and Johnston, Nick and Balle, Johannes and Hwang, Sung Jin and Toderici, George , year =. Scale-Space Flow for End-to-End Optimized Video Compression , booktitle =

  47. [49]

    A Neural Knowledge Language Model

    A Neural Knowledge Language Model , author =. 2016 , journal =. arxiv , file =:1608.00318 , urldate =

  48. [50]

    arXiv preprint arXiv:2004.08483 , eprint =

    Ainslie, Joshua and Ontanon, Santiago and Alberti, Chris and Pham, Philip and Ravula, Anirudh and Sanghai, Sumit , year =. arXiv preprint arXiv:2004.08483 , eprint =

  49. [51]

    Proceedings of the

    Akan, Adil Kaan and Erdem, Erkut and Erdem, Aykut and G. Proceedings of the. 2021 , pages =

  50. [52]

    2022 , journal =

    Stochastic Video Prediction with Structure and Motion , author =. 2022 , journal =. 2203.10528 , archiveprefix =

  51. [53]

    Contextual

    Akbik, Alan and Blythe, Duncan and Vollgraf, Roland , year =. Contextual

  52. [54]

    Flamingo: a Visual Language Model for Few-Shot Learning

    Alayrac, Jean-Baptiste and Donahue, Jeff and Luc, Pauline and Miech, Antoine and Barr, Iain and Hasson, Yana and Lenc, Karel and Mensch, Arthur and Millican, Katie and Reynolds, Malcolm and Ring, Roman and Rutherford, Eliza and Cabi, Serkan and Han, Tengda and Gong, Zhitao and Samangooei, Sina and Monteiro, Marianne and Menick, Jacob and Borgeaud, Sebasti...

  53. [55]

    2017 , journal =

    Learning from Narrated Instruction Videos , author =. 2017 , journal =

  54. [56]

    Alayrac, Jean-Baptiste and Recasens, Adri. Self-. 2020 , journal =. 2006.16228 , archiveprefix =

  55. [57]

    Albanie, Samuel and Liu, Yang and Nagrani, Arsha and Miech, Antoine and Coto, Ernesto and Laptev, Ivan and Sukthankar, Rahul and Ghanem, Bernard and Zisserman, Andrew and Gabeur, Valentin , year =. The. arXiv preprint arXiv:2008.00744 , eprint =

  56. [58]

    2019 , journal =

    Fusion of Detected Objects in Text for Visual Question Answering , author =. 2019 , journal =. 1908.05054 , archiveprefix =

  57. [59]

    Applications of Generative Adversarial Networks (Gans):

    Alqahtani, Hamed and. Applications of Generative Adversarial Networks (Gans):. 2021 , journal =

  58. [60]

    Bottom-up and Top-down Attention for Image Captioning and Visual Question Answering , booktitle =

    Anderson, Peter and He, Xiaodong and Buehler, Chris and Teney, Damien and Johnson, Mark and Gould, Stephen and Zhang, Lei , year =. Bottom-up and Top-down Attention for Image Captioning and Visual Question Answering , booktitle =

  59. [61]

    Anderson, Peter and Fernando, Basura and Johnson, Mark and Gould, Stephen , year =. Spice:. European

  60. [62]

    Learning to Compose Neural Networks for Question Answering , booktitle =

    Andreas, Jacob and Rohrbach, Marcus and Darrell, Trevor and Klein, Dan , year =. Learning to Compose Neural Networks for Question Answering , booktitle =

  61. [63]

    Neural Module Networks , booktitle =

    Andreas, Jacob and Rohrbach, Marcus and Darrell, Trevor and Klein, Dan , year =. Neural Module Networks , booktitle =

  62. [64]

    Relationships from

    Andrews, Martin and AI, Red Dragon and Witteveen, Sam , keywords =. Relationships from

  63. [65]

    and Parikh, Devi , year =

    Antol, Stanislaw and Agrawal, Aishwarya and Lu, Jiasen and Mitchell, Margaret and Batra, Dhruv and Lawrence Zitnick, C. and Parikh, Devi , year =. Vqa:

  64. [66]

    Ardino, Pierfrancesco and De Nadai, Marco and Lepri, Bruno and Ricci, Elisa and Lathuili. Click. Proceedings of the. 2021 , pages =

  65. [67]

    Arnab, Anurag and Dehghani, Mostafa and Heigold, Georg and Sun, Chen and Lu. Vivit:. Proceedings of the. 2021 , pages =

  66. [68]

    Variational Transformer Networks for Layout Generation , booktitle =

    Arroyo, Diego Martin and Postels, Janis and Tombari, Federico , year =. Variational Transformer Networks for Layout Generation , booktitle =

  67. [69]

    Avrahami, Omri and Lischinski, Dani and Fried, Ohad , year =. Blended. arXiv preprint arXiv:2111.14818 , eprint =

  68. [70]

    Avrahami, Omri and Fried, Ohad and Lischinski, Dani , year =. Blended. doi:10.48550/arXiv.2206.02779 , urldate =. arxiv , file =:2206.02779 , primaryclass =

  69. [71]

    Stochastic Variational Video Prediction

    Stochastic Variational Video Prediction , author =. 2017 , journal =. 1710.11252 , archiveprefix =

  70. [72]

    and Levine, Sergey , year =

    Babaeizadeh, Mohammad and Finn, Chelsea and Erhan, Dumitru and Campbell, Roy H. and Levine, Sergey , year =. Stochastic

  71. [73]

    Neural Machine Translation by Jointly Learning to Align and Translate

    Neural Machine Translation by Jointly Learning to Align and Translate , author =. 2014 , journal =. arxiv , file =:1409.0473 , urldate =

  72. [74]

    doi:10.48550/arXiv.2206.14797 , urldate =

    Bahmani, Sherwin and Park, Jeong Joon and Paschalidou, Despoina and Tang, Hao and Wetzstein, Gordon and Guibas, Leonidas and Van Gool, Luc and Timofte, Radu , year =. doi:10.48550/arXiv.2206.14797 , urldate =. arxiv , keywords =:2206.14797 , primaryclass =

  73. [75]

    航空计算技术 , volume =

    白, 林亭 and 文, 鹏程 and 李, 亚晖 , year =. 航空计算技术 , volume =

  74. [76]

    Frozen in Time:

    Bain, Max and Nagrani, Arsha and Varol, G. Frozen in Time:. Proceedings of the. 2021 , pages =

  75. [77]

    Conditional

    Balaji, Yogesh and Min, Martin Renqiang and Bai, Bing and Chellappa, Rama and Graf, Hans Peter , year =. Conditional

  76. [78]

    eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

    Balaji, Yogesh and Nah, Seungjun and Huang, Xun and Vahdat, Arash and Song, Jiaming and Kreis, Karsten and Aittala, Miika and Aila, Timo and Laine, Samuli and Catanzaro, Bryan , year =. arXiv preprint arXiv:2211.01324 , eprint =

  77. [79]

    and Kazemi, Hamid and Huang, Furong and Goldblum, Micah and Geiping, Jonas and Goldstein, Tom , year =

    Bansal, Arpit and Borgnia, Eitan and Chu, Hong-Min and Li, Jie S. and Kazemi, Hamid and Huang, Furong and Goldblum, Micah and Geiping, Jonas and Goldstein, Tom , year =. Cold Diffusion:. arXiv preprint arXiv:2208.09392 , eprint =

  78. [80]

    Analytic-

    Bao, Fan and Li, Chongxuan and Zhu, Jun and Zhang, Bo , year =. Analytic-. International

  79. [81]

    Bao, Fan and Nie, Shen and Xue, Kaiwen and Li, Chongxuan and Pu, Shi and Wang, Yaole and Yue, Gang and Cao, Yue and Su, Hang and Zhu, Jun , year =. One. arXiv preprint arXiv:2303.06555 , eprint =

  80. [82]

    Bao, Fan and Nie, Shen and Xue, Kaiwen and Li, Chongxuan and Pu, Shi and Wang, Yaole and Yue, Gang and Cao, Yue and Su, Hang and Zhu, Jun , year =. One. doi:10.48550/arXiv.2303.06555 , urldate =. arxiv , file =:2303.06555 , primaryclass =

Showing first 80 references.