DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory

Chenfei Wu; Gong Ming; Houqiang Li; Jian Liang; Jie Shi; Nan Duan; Shengming Yin

arxiv: 2308.08089 · v1 · pith:UDEIT65Rnew · submitted 2023-08-16 · 💻 cs.CV

DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory

Shengming Yin , Chenfei Wu , Jian Liang , Jie Shi , Houqiang Li , Gong Ming , Nan Duan This is my paper

Pith reviewed 2026-05-20 12:57 UTC · model grok-4.3

classification 💻 cs.CV

keywords video generationdiffusion modelstrajectory controlfine-grained controlopen-domain generationmultimodal conditioningmotion guidance

0 comments

The pith

DragNUWA achieves fine-grained control in open-domain video generation by integrating text, image, and trajectory information.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DragNUWA, a diffusion-based model for video generation that accepts text, image, and trajectory inputs together. Text supplies semantic guidance, the image fixes spatial layout, and the trajectory dictates motion paths. Earlier methods handled only one of these signals or worked only on simple datasets like Human3.6M, which restricted their use on real scenes and complex motions. DragNUWA adds a Trajectory Sampler to accept arbitrary curves on any image, Multiscale Fusion to blend control at different resolutions, and an Adaptive Training procedure to keep generated frames consistent with the inputs. A sympathetic reader would see this as a step toward video creation tools that let users direct content, placement, and movement with combined instructions.

Core claim

DragNUWA is an open-domain diffusion-based video generation model that simultaneously introduces text, image, and trajectory information to provide fine-grained control over video content from semantic, spatial, and temporal perspectives. It resolves limited open-domain trajectory control by proposing a Trajectory Sampler to enable arbitrary trajectories, Multiscale Fusion to control trajectories at different granularities, and an Adaptive Training strategy to generate consistent videos that follow the trajectories.

What carries the argument

Trajectory Sampler for open-domain arbitrary trajectories, Multiscale Fusion for varying control granularities, and Adaptive Training for motion consistency, all combined with text and image conditioning in a diffusion video model.

If this is right

Videos can be created that follow user-specified arbitrary curved trajectories overlaid on any input image.
Semantic content from text, spatial details from the image, and temporal motion from the trajectory are controlled at the same time.
The model handles open-domain scenes rather than being limited to narrow datasets like Human3.6M.
Generated sequences maintain consistency with the trajectory across frames.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same multi-signal conditioning pattern could be tested on related tasks such as image animation or 3D scene generation.
Interactive interfaces might let users sketch paths directly on an image to define desired motion.
The method could reduce reliance on single-modality training data when scaling controllable generation systems.

Load-bearing premise

The Trajectory Sampler, Multiscale Fusion, and Adaptive Training can reliably produce consistent videos that follow arbitrary complex curved trajectories on open-domain images without motion artifacts or semantic drift.

What would settle it

Generate videos from complex curved trajectories drawn on diverse real-world open-domain images and measure whether the motion paths are followed accurately while content and semantics remain stable and free of visible artifacts.

read the original abstract

Controllable video generation has gained significant attention in recent years. However, two main limitations persist: Firstly, most existing works focus on either text, image, or trajectory-based control, leading to an inability to achieve fine-grained control in videos. Secondly, trajectory control research is still in its early stages, with most experiments being conducted on simple datasets like Human3.6M. This constraint limits the models' capability to process open-domain images and effectively handle complex curved trajectories. In this paper, we propose DragNUWA, an open-domain diffusion-based video generation model. To tackle the issue of insufficient control granularity in existing works, we simultaneously introduce text, image, and trajectory information to provide fine-grained control over video content from semantic, spatial, and temporal perspectives. To resolve the problem of limited open-domain trajectory control in current research, We propose trajectory modeling with three aspects: a Trajectory Sampler (TS) to enable open-domain control of arbitrary trajectories, a Multiscale Fusion (MF) to control trajectories in different granularities, and an Adaptive Training (AT) strategy to generate consistent videos following trajectories. Our experiments validate the effectiveness of DragNUWA, demonstrating its superior performance in fine-grained control in video generation. The homepage link is \url{https://www.microsoft.com/en-us/research/project/dragnuwa/}

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DragNUWA adds practical text-image-trajectory control to diffusion video models via three targeted modules, but the superiority claim needs concrete numbers to land.

read the letter

The one thing to know about this paper is that it combines text, image, and trajectory conditioning in a single diffusion model for video generation, using three new components to handle open-domain and complex paths. What is new is the specific trio of Trajectory Sampler, Multiscale Fusion, and Adaptive Training. These target the problems of limited granularity and simple datasets mentioned in the abstract. The approach extends standard diffusion conditioning in a logical way, and the stress-test note confirms there is no internal contradiction in how the pieces fit together. The paper does well at framing the limitations of existing work and proposing matching solutions. It is honest about the early stage of trajectory control research and focuses on practical open-domain use. Soft spots are mainly around verification. The abstract claims superior performance from experiments, yet the provided summary lacks specific metrics or ablation results. This leaves the central claim resting on the assertion rather than shown numbers. If the full paper includes solid quantitative evidence and handles potential motion artifacts, that would strengthen it considerably. The assumption that the modules reliably avoid semantic drift on complex curves is plausible but needs checking against results. This work is for researchers in video synthesis who want finer control knobs. A reader looking for ideas on multi-signal conditioning would find value in the method section. It deserves a serious referee because the idea addresses a clear gap and the architecture is coherent. I recommend sending it to peer review with requests for more detailed evaluations.

Referee Report

2 major / 2 minor

Summary. The paper introduces DragNUWA, an open-domain diffusion-based video generation model that integrates text, image, and trajectory inputs to enable fine-grained control over video content from semantic, spatial, and temporal perspectives. It proposes a Trajectory Sampler (TS) to support arbitrary trajectories on open-domain images, Multiscale Fusion (MF) to handle trajectories at varying granularities, and an Adaptive Training (AT) strategy to produce consistent videos that follow the specified trajectories. The central claim is that this combination yields superior performance in fine-grained controllable video generation compared to prior single-modality or limited-domain approaches, with experiments asserted to validate the effectiveness.

Significance. If the results hold, the work would advance multi-modal controllable video generation by addressing the limitations of single-modality control and restriction to simple datasets such as Human3.6M. Enabling open-domain handling of complex curved trajectories via the proposed TS, MF, and AT components could support more precise applications in animation and content creation. The structured decomposition of trajectory modeling provides a clear technical contribution to the diffusion conditioning literature.

major comments (2)

[Experiments] Experiments section: the abstract asserts experimental validation and superior performance, yet no quantitative metrics (e.g., FID, FVD, or user-study scores), dataset details, or ablation results on TS/MF/AT are provided in the summary; without these the central claim of effectiveness rests on an unverified assertion and requires explicit tables comparing against baselines on open-domain data.
[Method] Method section, description of Adaptive Training (AT): the strategy is presented as ensuring consistency and avoiding motion artifacts or semantic drift for arbitrary curved trajectories, but the concrete loss formulation, sampling schedule, or conditioning weight schedule is not specified; this leaves the load-bearing claim that AT reliably produces artifact-free output ungrounded in the provided equations or pseudocode.

minor comments (2)

The homepage link is given but the manuscript does not include a direct pointer to the released code or model weights, which would aid reproducibility.
[Method] Notation for the three conditioning modalities (text, image, trajectory) should be introduced with explicit symbols in the method overview to improve clarity when describing the fusion step.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive comments. We address each major comment point by point below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Experiments] Experiments section: the abstract asserts experimental validation and superior performance, yet no quantitative metrics (e.g., FID, FVD, or user-study scores), dataset details, or ablation results on TS/MF/AT are provided in the summary; without these the central claim of effectiveness rests on an unverified assertion and requires explicit tables comparing against baselines on open-domain data.

Authors: We appreciate the referee highlighting the need for clearer quantitative support. The full manuscript contains experimental results including user studies demonstrating superior performance; however, to directly address this concern, we will add explicit tables in the revised version reporting quantitative metrics (such as FID and FVD where relevant), dataset details, and ablation studies isolating the contributions of TS, MF, and AT, with direct comparisons to baselines on open-domain data. revision: yes
Referee: [Method] Method section, description of Adaptive Training (AT): the strategy is presented as ensuring consistency and avoiding motion artifacts or semantic drift for arbitrary curved trajectories, but the concrete loss formulation, sampling schedule, or conditioning weight schedule is not specified; this leaves the load-bearing claim that AT reliably produces artifact-free output ungrounded in the provided equations or pseudocode.

Authors: We agree that the Adaptive Training description would benefit from greater specificity. In the revised manuscript, we will include the concrete loss formulation for AT, along with the sampling schedule and conditioning weight schedule. These additions will better substantiate the claims regarding consistency and the avoidance of motion artifacts or semantic drift for arbitrary trajectories. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper proposes DragNUWA as a diffusion-based architecture that adds text, image, and trajectory conditioning, along with three explicitly defined new components (Trajectory Sampler for open-domain paths, Multiscale Fusion for granularity, and Adaptive Training for consistency). These elements are introduced to address stated limitations in prior work and are described directly in the method without any equations or claims that reduce the performance gains to a fitted parameter, self-definition, or self-citation chain. The derivation remains self-contained as a coherent extension of standard diffusion conditioning, with validation left to experiments rather than internal reduction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 3 invented entities

The proposal rests on standard diffusion-model conditioning assumptions plus three newly introduced modules whose effectiveness is asserted rather than derived from first principles.

free parameters (1)

diffusion conditioning weights for text/image/trajectory
Standard learned or hand-tuned scalars that balance the three control signals during sampling.

axioms (1)

domain assumption Diffusion models can be jointly conditioned on semantic, spatial, and temporal signals without destructive interference.
Invoked when the paper states that simultaneous introduction of the three inputs yields fine-grained control.

invented entities (3)

Trajectory Sampler (TS) no independent evidence
purpose: Enable open-domain control of arbitrary trajectories
New module introduced to sample points along user-drawn paths on complex scenes.
Multiscale Fusion (MF) no independent evidence
purpose: Control trajectories at different granularities
New fusion mechanism to combine trajectory information across scales.
Adaptive Training (AT) strategy no independent evidence
purpose: Generate consistent videos following trajectories
New training procedure claimed to enforce trajectory adherence.

pith-pipeline@v0.9.0 · 5782 in / 1277 out tokens · 38395 ms · 2026-05-20T12:57:38.194907+00:00 · methodology

discussion (0)

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Aero-World: Action-Conditioned Aerial Video Generation from Inertial Controls
cs.CV 2026-05 unverdicted novelty 7.0

Aero-World adapts a pretrained latent diffusion transformer for action-conditioned aerial video generation by injecting inertial action tokens and using a frozen latent-space Physics Probe for inertial consistency sup...
Functionalization via Structure Completion and Motion Rectification
cs.CV 2026-05 unverdicted novelty 7.0

Object functionalization is cast as neural graph completion over a functional graph of parts, contacts, and motions, followed by geometry realization that also rectifies erroneous motions, demonstrated on furniture wi...
R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow
cs.CV 2026-05 unverdicted novelty 7.0

R-DMesh generates high-fidelity 4D meshes aligned to video by disentangling base mesh, motion, and a learned rectification jump offset inside a VAE, then using Triflow Attention and rectified-flow diffusion.
Eulerian Motion Guidance: Robust Image Animation via Bidirectional Geometric Consistency
cs.CV 2026-05 unverdicted novelty 7.0

Eulerian adjacent-frame motion fields with bidirectional cycle consistency checks enable faster parallel training and fewer artifacts in diffusion model image animation compared to initial-frame Lagrangian guidance.
Eulerian Motion Guidance: Robust Image Animation via Bidirectional Geometric Consistency
cs.CV 2026-05 unverdicted novelty 7.0

Eulerian adjacent-frame motion guidance plus bidirectional geometric consistency improves training speed, temporal coherence, and artifact reduction in diffusion-based image animation.
Reshoot-Anything: A Self-Supervised Model for In-the-Wild Video Reshooting
cs.CV 2026-04 unverdicted novelty 7.0

Reshoot-Anything trains a diffusion transformer on pseudo multi-view triplets created by cropping and warping monocular videos to achieve temporally consistent video reshooting with robust camera control on dynamic scenes.
MoRight: Motion Control Done Right
cs.CV 2026-04 unverdicted novelty 7.0

MoRight disentangles object and camera motion via canonical-view specification and temporal cross-view attention, while decomposing motion into active user-driven and passive consequence components to learn and apply ...
ASTRA: Let Arbitrary Subjects Transform in Video Editing
cs.CV 2025-10 unverdicted novelty 7.0

ASTRA is a plug-and-play training-free method for precise multi-subject video editing that uses prompt-guided multimodal alignment and prior-based mask retargeting to avoid attention dilution and boundary issues.
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
cs.CV 2023-07 unverdicted novelty 7.0

A single motion module trained on videos adds temporally coherent animation to any personalized text-to-image model derived from the same base without additional tuning.
ReactiveGWM: Steering NPC in Reactive Game World Models
cs.CV 2026-05 unverdicted novelty 6.0

ReactiveGWM introduces a decoupled diffusion architecture for player-NPC interactions that learns game-agnostic response logic for zero-shot strategy transfer across games.
Eulerian Motion Guidance: Robust Image Animation via Bidirectional Geometric Consistency
cs.CV 2026-05 unverdicted novelty 6.0

Eulerian adjacent-frame motion guidance plus bidirectional geometric consistency yields faster training and more coherent diffusion-based image animation than first-frame reference methods.
Bridging the Embodiment Gap: Disentangled Cross-Embodiment Video Editing
cs.RO 2026-05 unverdicted novelty 6.0

A dual-contrastive disentanglement method factorizes videos into independent task and embodiment latents, then uses a parameter-efficient adapter on a frozen video diffusion model to synthesize robot executions from s...
PhyCo: Learning Controllable Physical Priors for Generative Motion
cs.CV 2026-04 unverdicted novelty 6.0

PhyCo adds continuous physical control to video diffusion models via physics-supervised fine-tuning on a large simulation dataset and VLM-guided rewards, yielding measurable gains in physical realism on the Physics-IQ...
DailyArt: Discovering Articulation from Single Static Images via Latent Dynamics
cs.CV 2026-04 unverdicted novelty 6.0

DailyArt recovers full joint parameters of articulated objects from a single static image by synthesizing an opened state and comparing discrepancies, supporting downstream part-level novel state synthesis.
HVG-3D: Bridging Real and Simulation Domains for 3D-Conditional Hand-Object Interaction Video Synthesis
cs.CV 2026-03 unverdicted novelty 6.0

HVG-3D uses a 3D-aware diffusion architecture with ControlNet to synthesize high-fidelity hand-object interaction videos from 3D control signals, achieving state-of-the-art spatial fidelity and temporal coherence on t...
Diff4Splat: Controllable 4D Scene Generation with Latent Dynamic Reconstruction Models
cs.CV 2025-11 unverdicted novelty 6.0

A feed-forward video latent transformer that predicts time-varying 3D Gaussian primitives from one image to produce controllable 4D scenes with appearance, geometry, and motion.
ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis
cs.CV 2024-09 unverdicted novelty 6.0

ViewCrafter tames video diffusion models with point-based 3D guidance and iterative trajectory planning to produce high-fidelity novel views from single or sparse images.
CameraCtrl: Enabling Camera Control for Text-to-Video Generation
cs.CV 2024-04 unverdicted novelty 6.0

CameraCtrl enables accurate camera pose control in video diffusion models through a trained plug-and-play module and dataset choices emphasizing diverse camera trajectories with matching appearance.
VideoCrafter1: Open Diffusion Models for High-Quality Video Generation
cs.CV 2023-10 unverdicted novelty 6.0

Open-source text-to-video and image-to-video diffusion models generate high-quality 1024x576 videos, with the I2V variant claimed as the first to strictly preserve reference image content.
R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow
cs.CV 2026-05 unverdicted novelty 5.0

R-DMesh uses a VAE with a learned rectification jump offset and Triflow Attention inside a rectified-flow diffusion transformer to produce video-aligned 4D meshes despite initial pose misalignment.
Tora3: Trajectory-Guided Audio-Video Generation with Physical Coherence
cs.CV 2026-04 unverdicted novelty 5.0

Tora3 uses shared object trajectories as kinematic priors to jointly guide visual motion and acoustic events in audio-video generation, improving realism and synchronization.
I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models
cs.CV 2023-11 unverdicted novelty 5.0

I2VGen-XL applies cascaded diffusion models with a base stage for semantic preservation via hierarchical encoders and a refinement stage for detail and resolution, trained on 35 million text-video and 6 billion text-i...
Evolution of Video Generative Foundations
cs.CV 2026-04 unverdicted novelty 2.0

This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.

Reference graph

Works this paper leans on

297 extracted references · 297 canonical work pages · cited by 20 Pith papers · 44 internal anchors

[1]

Click To Move : Controlling Video Generation With Sparse Motion

Pierfrancesco Ardino, Marco De Nadai, Bruno Lepri, Elisa Ricci, and St \'e phane Lathuili \`e re. Click To Move : Controlling Video Generation With Sparse Motion . In Proceedings of the IEEE / CVF International Conference on Computer Vision , pp.\ 14749--14758, 2021

work page 2021
[2]

Frozen in time: A joint video and image encoder for end-to-end retrieval

Max Bain, Arsha Nagrani, G \"u l Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE / CVF International Conference on Computer Vision , pp.\ 1728--1738, 2021

work page 2021
[3]

Ipoke: Poking a still image for controlled stochastic video synthesis

Andreas Blattmann, Timo Milbich, Michael Dorkenwald, and Bj \"o rn Ommer. Ipoke: Poking a still image for controlled stochastic video synthesis. In Proceedings of the IEEE / CVF International Conference on Computer Vision , pp.\ 14707--14717, 2021 a

work page 2021
[4]

Understanding Object Dynamics for Interactive Image-to-Video Synthesis

Andreas Blattmann, Timo Milbich, Michael Dorkenwald, and Bjorn Ommer. Understanding Object Dynamics for Interactive Image-to-Video Synthesis . In Proceedings of the IEEE / CVF Conference on Computer Vision and Pattern Recognition , pp.\ 5171--5181, 2021 b

work page 2021
[5]

Caroline Chan, Shiry Ginosar, Tinghui Zhou, and Alexei A. Efros. Everybody Dance Now . In Proceedings of the IEEE / CVF International Conference on Computer Vision , pp.\ 5933--5942, 2019

work page 2019
[7]

Recurrent Environment Simulators

Silvia Chiappa, S \'e bastien Racaniere, Daan Wierstra, and Shakir Mohamed. Recurrent Environment Simulators . In International Conference on Learning Representations , November 2016

work page 2016
[9]

Controllable Video Generation With Sparse Trajectories

Zekun Hao, Xun Huang, and Serge Belongie. Controllable Video Generation With Sparse Trajectories . In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pp.\ 7854--7863, 2018

work page 2018
[10]

Imagen Video: High Definition Video Generation with Diffusion Models

Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P. Kingma, Ben Poole, Mohammad Norouzi, and David J. Fleet. Imagen video: High video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[11]

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. CogVideo : Large-scale Pretraining for Text-to-Video Generation via Transformers . arXiv preprint arXiv:2205.15868, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[12]

Make It Move : Controllable Image-to-Video Generation With Text Descriptions

Yaosi Hu, Chong Luo, and Zhenzhong Chen. Make It Move : Controllable Image-to-Video Generation With Text Descriptions . In Proceedings of the IEEE / CVF Conference on Computer Vision and Pattern Recognition , pp.\ 18219--18228, 2022

work page 2022
[13]

Adam: A Method for Stochastic Optimization

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[14]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2 : Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models . arXiv preprint arXiv:2301.12597, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Xiaodan Liang, Lisa Lee, Wei Dai, and Eric P. Xing. Dual Motion GAN for Future-Flow Embedded Video Prediction . In Proceedings of the IEEE International Conference on Computer Vision , pp.\ 1744--1752, 2017

work page 2017
[16]

Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning

William Lotter, Gabriel Kreiman, and David Cox. Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning . In International Conference on Learning Representations , November 2016

work page 2016
[17]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, and Jack Clark. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning , pp.\ 8748--8763. PMLR , 2021

work page 2021
[18]

High- Resolution Image Synthesis With Latent Diffusion Models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj \"o rn Ommer. High- Resolution Image Synthesis With Latent Diffusion Models . In Proceedings of the IEEE / CVF Conference on Computer Vision and Pattern Recognition , pp.\ 10684--10695, 2022

work page 2022
[19]

Make- A-Video : Text-to-Video Generation without Text-Video Data , September 2022

Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, and Yaniv Taigman. Make- A-Video : Text-to-Video Generation without Text-Video Data , September 2022

work page 2022
[20]

Unsupervised Learning of Video Representations using LSTMs

Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudinov. Unsupervised Learning of Video Representations using LSTMs . In Proceedings of the 32nd International Conference on Machine Learning , pp.\ 843--852. PMLR , June 2015

work page 2015
[21]

Phenaki: Variable Length Video Generation from Open Domain Textual Descriptions

Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kindermans, Hernan Moraldo, Han Zhang, Mohammad Taghi Saffar, Santiago Castro, Julius Kunze, and Dumitru Erhan. Phenaki: Variable Length Video Generation from Open Domain Textual Descriptions . In ICLR , September 2022

work page 2022
[22]

The Pose Knows : Video Forecasting by Generating Pose Futures

Jacob Walker, Kenneth Marino, Abhinav Gupta, and Martial Hebert. The Pose Knows : Video Forecasting by Generating Pose Futures . In Proceedings of the IEEE International Conference on Computer Vision , pp.\ 3332--3341, 2017

work page 2017
[23]

Deep High-Resolution Representation Learning for Visual Recognition

Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang, Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, Mingkui Tan, Xinggang Wang, Wenyu Liu, and Bin Xiao. Deep High-Resolution Representation Learning for Visual Recognition . IEEE Transactions on Pattern Analysis and Machine Intelligence, 43 0 (10): 0 3349--3364, October 2021. ISSN 1939-3539. doi:10.1109/TPAMI...

work page doi:10.1109/tpami.2020.2983686 2021
[24]

Few-shot video-to-video synthesis

Ting-Chun Wang, Ming-Yu Liu, Andrew Tao, Guilin Liu, Jan Kautz, and Bryan Catanzaro. Few-shot video-to-video synthesis. In Proceedings of the 33rd International Conference on Neural Information Processing Systems , pp.\ 5013--5024, Red Hook, NY, USA , December 2019. Curran Associates Inc

work page 2019
[25]

VideoComposer : Compositional Video Synthesis with Motion Controllability , June 2023

Xiang Wang, Hangjie Yuan, Shiwei Zhang, Dayou Chen, Jiuniu Wang, Yingya Zhang, Yujun Shen, Deli Zhao, and Jingren Zhou. VideoComposer : Compositional Video Synthesis with Motion Controllability , June 2023

work page 2023
[26]

Hierarchical Long-term Video Prediction without Supervision

Nevan Wichers, Ruben Villegas, Dumitru Erhan, and Honglak Lee. Hierarchical Long-term Video Prediction without Supervision . In Proceedings of the 35th International Conference on Machine Learning , pp.\ 6038--6046. PMLR , July 2018

work page 2018
[27]

GODIVA : Generating Open-DomaIn Videos from nAtural Descriptions

Chenfei Wu, Lun Huang, Qianxi Zhang, Binyang Li, Lei Ji, Fan Yang, Guillermo Sapiro, and Nan Duan. GODIVA : Generating Open-DomaIn Videos from nAtural Descriptions . arXiv:2104.14806 [cs], April 2021

work page arXiv 2021
[28]

N " UWA : Visual Synthesis Pre-training for Neural visUal World creAtion

Chenfei Wu, Jian Liang, Lei Ji, Fan Yang, Yuejian Fang, Daxin Jiang, and Nan Duan. N " UWA : Visual Synthesis Pre-training for Neural visUal World creAtion . In Proceedings of the European Conference on Computer Vision ( ECCV ) , 2022

work page 2022
[29]

Future Video Synthesis With Object Motion Prediction

Yue Wu, Rongrong Gao, Jaesik Park, and Qifeng Chen. Future Video Synthesis With Object Motion Prediction . In Proceedings of the IEEE / CVF Conference on Computer Vision and Pattern Recognition , pp.\ 5539--5548, 2020

work page 2020
[30]

Unifying flow, stereo and depth estimation

Haofei Xu, Jing Zhang, Jianfei Cai, Hamid Rezatofighi, Fisher Yu, Dacheng Tao, and Andreas Geiger. Unifying flow, stereo and depth estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023

work page 2023
[31]

NUWA-XL : Diffusion over Diffusion for eXtremely Long Video Generation

Shengming Yin, Chenfei Wu, Huan Yang, Jianfeng Wang, Xiaodong Wang, Minheng Ni, Zhengyuan Yang, Linjie Li, Shuguang Liu, and Fan Yang. NUWA-XL : Diffusion over Diffusion for eXtremely Long Video Generation . arXiv preprint arXiv:2303.12346, 2023

work page arXiv 2023
[32]

DTVNet : Dynamic Time-Lapse Video Generation via Single Still Image

Jiangning Zhang, Chao Xu, Liang Liu, Mengmeng Wang, Xia Wu, Yong Liu, and Yunliang Jiang. DTVNet : Dynamic Time-Lapse Video Generation via Single Still Image . In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (eds.), Computer Vision ECCV 2020 , Lecture Notes in Computer Science , pp.\ 300--315, Cham , 2020. Springer International Publi...

work page doi:10.1007/978-3-030-58558-7_18 2020
[33]

IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

Unifying Flow, Stereo and Depth Estimation , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

work page
[34]

99firms.com , urldate =

2019's. 99firms.com , urldate =

work page 2019
[35]

2019 , month = may, journal =

Around 40,000 Songs Are Uploaded to. 2019 , month = may, journal =

work page 2019
[36]

2018 , pages =

Watch. 2018 , pages =

work page 2018
[37]

1995 , pages =

Financial Applications of Learning from Hints , booktitle =. 1995 , pages =

work page 1995
[38]

1995 , journal =

Hints , author =. 1995 , journal =

work page 1995
[39]

1993 , pages =

A Method for Learning from Hints , booktitle =. 1993 , pages =

work page 1993
[40]

Towards High Resolution Video Generation with Progressive Growing of Sliced Wasserstein GANs

Towards High Resolution Video Generation with Progressive Growing of Sliced Wasserstein Gans , author =. 2018 , journal =. 1810.02419 , archiveprefix =

work page internal anchor Pith review Pith/arXiv arXiv 2018
[41]

TallyQA: Answering Complex Counting Questions

Acharya, Manoj and Kafle, Kushal and Kanan, Christopher , year =. arXiv:1810.12440 [cs] , eprint =

work page internal anchor Pith review Pith/arXiv arXiv
[42]

Explicit

Aditya, Somak and Yang, Yezhou and Baral, Chitta , year =. Explicit

work page
[43]

Question

Adiwardana, Daniel De Freitas and Shakeri, Siamak , urldate =. Question

work page
[44]

Analyzing the

Agrawal, Aishwarya and Batra, Dhruv and Parikh, Devi , year =. Analyzing the

work page
[45]

Agrawal, Aishwarya and Kembhavi, Aniruddha and Batra, Dhruv and Parikh, Devi , year =. C-. arXiv:1704.08243 [cs] , eprint =

work page internal anchor Pith review Pith/arXiv arXiv
[46]

Agrawal, Aishwarya and Batra, Dhruv and Parikh, Devi and Kembhavi, Aniruddha , year =. Don't. arxiv , file =:1712.00377 , urldate =

work page internal anchor Pith review Pith/arXiv arXiv
[47]

Lawrence and Parikh, Devi and Batra, Dhruv , year =

Agrawal, Aishwarya and Lu, Jiasen and Antol, Stanislaw and Mitchell, Margaret and Zitnick, C. Lawrence and Parikh, Devi and Batra, Dhruv , year =. International Journal of Computer Vision , volume =. doi:10.1007/s11263-016-0966-6 , urldate =

work page doi:10.1007/s11263-016-0966-6
[48]

Scale-Space Flow for End-to-End Optimized Video Compression , booktitle =

Agustsson, Eirikur and Minnen, David and Johnston, Nick and Balle, Johannes and Hwang, Sung Jin and Toderici, George , year =. Scale-Space Flow for End-to-End Optimized Video Compression , booktitle =

work page
[49]

A Neural Knowledge Language Model

A Neural Knowledge Language Model , author =. 2016 , journal =. arxiv , file =:1608.00318 , urldate =

work page internal anchor Pith review Pith/arXiv arXiv 2016
[50]

arXiv preprint arXiv:2004.08483 , eprint =

Ainslie, Joshua and Ontanon, Santiago and Alberti, Chris and Pham, Philip and Ravula, Anirudh and Sanghai, Sumit , year =. arXiv preprint arXiv:2004.08483 , eprint =

work page arXiv 2004
[51]

Proceedings of the

Akan, Adil Kaan and Erdem, Erkut and Erdem, Aykut and G. Proceedings of the. 2021 , pages =

work page 2021
[52]

2022 , journal =

Stochastic Video Prediction with Structure and Motion , author =. 2022 , journal =. 2203.10528 , archiveprefix =

work page arXiv 2022
[53]

Contextual

Akbik, Alan and Blythe, Duncan and Vollgraf, Roland , year =. Contextual

work page
[54]

Flamingo: a Visual Language Model for Few-Shot Learning

Alayrac, Jean-Baptiste and Donahue, Jeff and Luc, Pauline and Miech, Antoine and Barr, Iain and Hasson, Yana and Lenc, Karel and Mensch, Arthur and Millican, Katie and Reynolds, Malcolm and Ring, Roman and Rutherford, Eliza and Cabi, Serkan and Han, Tengda and Gong, Zhitao and Samangooei, Sina and Monteiro, Marianne and Menick, Jacob and Borgeaud, Sebasti...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2204.14198
[55]

2017 , journal =

Learning from Narrated Instruction Videos , author =. 2017 , journal =

work page 2017
[56]

Alayrac, Jean-Baptiste and Recasens, Adri. Self-. 2020 , journal =. 2006.16228 , archiveprefix =

work page arXiv 2020
[57]

Albanie, Samuel and Liu, Yang and Nagrani, Arsha and Miech, Antoine and Coto, Ernesto and Laptev, Ivan and Sukthankar, Rahul and Ghanem, Bernard and Zisserman, Andrew and Gabeur, Valentin , year =. The. arXiv preprint arXiv:2008.00744 , eprint =

work page arXiv 2008
[58]

2019 , journal =

Fusion of Detected Objects in Text for Visual Question Answering , author =. 2019 , journal =. 1908.05054 , archiveprefix =

work page arXiv 2019
[59]

Applications of Generative Adversarial Networks (Gans):

Alqahtani, Hamed and. Applications of Generative Adversarial Networks (Gans):. 2021 , journal =

work page 2021
[60]

Bottom-up and Top-down Attention for Image Captioning and Visual Question Answering , booktitle =

Anderson, Peter and He, Xiaodong and Buehler, Chris and Teney, Damien and Johnson, Mark and Gould, Stephen and Zhang, Lei , year =. Bottom-up and Top-down Attention for Image Captioning and Visual Question Answering , booktitle =

work page
[61]

Anderson, Peter and Fernando, Basura and Johnson, Mark and Gould, Stephen , year =. Spice:. European

work page
[62]

Learning to Compose Neural Networks for Question Answering , booktitle =

Andreas, Jacob and Rohrbach, Marcus and Darrell, Trevor and Klein, Dan , year =. Learning to Compose Neural Networks for Question Answering , booktitle =

work page
[63]

Neural Module Networks , booktitle =

Andreas, Jacob and Rohrbach, Marcus and Darrell, Trevor and Klein, Dan , year =. Neural Module Networks , booktitle =

work page
[64]

Relationships from

Andrews, Martin and AI, Red Dragon and Witteveen, Sam , keywords =. Relationships from

work page
[65]

and Parikh, Devi , year =

Antol, Stanislaw and Agrawal, Aishwarya and Lu, Jiasen and Mitchell, Margaret and Batra, Dhruv and Lawrence Zitnick, C. and Parikh, Devi , year =. Vqa:

work page
[66]

Ardino, Pierfrancesco and De Nadai, Marco and Lepri, Bruno and Ricci, Elisa and Lathuili. Click. Proceedings of the. 2021 , pages =

work page 2021
[67]

Arnab, Anurag and Dehghani, Mostafa and Heigold, Georg and Sun, Chen and Lu. Vivit:. Proceedings of the. 2021 , pages =

work page 2021
[68]

Variational Transformer Networks for Layout Generation , booktitle =

Arroyo, Diego Martin and Postels, Janis and Tombari, Federico , year =. Variational Transformer Networks for Layout Generation , booktitle =

work page
[69]

Avrahami, Omri and Lischinski, Dani and Fried, Ohad , year =. Blended. arXiv preprint arXiv:2111.14818 , eprint =

work page arXiv
[70]

Avrahami, Omri and Fried, Ohad and Lischinski, Dani , year =. Blended. doi:10.48550/arXiv.2206.02779 , urldate =. arxiv , file =:2206.02779 , primaryclass =

work page doi:10.48550/arxiv.2206.02779
[71]

Stochastic Variational Video Prediction

Stochastic Variational Video Prediction , author =. 2017 , journal =. 1710.11252 , archiveprefix =

work page internal anchor Pith review Pith/arXiv arXiv 2017
[72]

and Levine, Sergey , year =

Babaeizadeh, Mohammad and Finn, Chelsea and Erhan, Dumitru and Campbell, Roy H. and Levine, Sergey , year =. Stochastic

work page
[73]

Neural Machine Translation by Jointly Learning to Align and Translate

Neural Machine Translation by Jointly Learning to Align and Translate , author =. 2014 , journal =. arxiv , file =:1409.0473 , urldate =

work page internal anchor Pith review Pith/arXiv arXiv 2014
[74]

doi:10.48550/arXiv.2206.14797 , urldate =

Bahmani, Sherwin and Park, Jeong Joon and Paschalidou, Despoina and Tang, Hao and Wetzstein, Gordon and Guibas, Leonidas and Van Gool, Luc and Timofte, Radu , year =. doi:10.48550/arXiv.2206.14797 , urldate =. arxiv , keywords =:2206.14797 , primaryclass =

work page doi:10.48550/arxiv.2206.14797
[75]

航空计算技术 , volume =

白, 林亭 and 文, 鹏程 and 李, 亚晖 , year =. 航空计算技术 , volume =

work page
[76]

Frozen in Time:

Bain, Max and Nagrani, Arsha and Varol, G. Frozen in Time:. Proceedings of the. 2021 , pages =

work page 2021
[77]

Conditional

Balaji, Yogesh and Min, Martin Renqiang and Bai, Bing and Chellappa, Rama and Graf, Hans Peter , year =. Conditional

work page
[78]

eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

Balaji, Yogesh and Nah, Seungjun and Huang, Xun and Vahdat, Arash and Song, Jiaming and Kreis, Karsten and Aittala, Miika and Aila, Timo and Laine, Samuli and Catanzaro, Bryan , year =. arXiv preprint arXiv:2211.01324 , eprint =

work page internal anchor Pith review Pith/arXiv arXiv
[79]

and Kazemi, Hamid and Huang, Furong and Goldblum, Micah and Geiping, Jonas and Goldstein, Tom , year =

Bansal, Arpit and Borgnia, Eitan and Chu, Hong-Min and Li, Jie S. and Kazemi, Hamid and Huang, Furong and Goldblum, Micah and Geiping, Jonas and Goldstein, Tom , year =. Cold Diffusion:. arXiv preprint arXiv:2208.09392 , eprint =

work page arXiv
[80]

Analytic-

Bao, Fan and Li, Chongxuan and Zhu, Jun and Zhang, Bo , year =. Analytic-. International

work page
[81]

Bao, Fan and Nie, Shen and Xue, Kaiwen and Li, Chongxuan and Pu, Shi and Wang, Yaole and Yue, Gang and Cao, Yue and Su, Hang and Zhu, Jun , year =. One. arXiv preprint arXiv:2303.06555 , eprint =

work page arXiv
[82]

Bao, Fan and Nie, Shen and Xue, Kaiwen and Li, Chongxuan and Pu, Shi and Wang, Yaole and Yue, Gang and Cao, Yue and Su, Hang and Zhu, Jun , year =. One. doi:10.48550/arXiv.2303.06555 , urldate =. arxiv , file =:2303.06555 , primaryclass =

work page doi:10.48550/arxiv.2303.06555

Showing first 80 references.

[1] [1]

Click To Move : Controlling Video Generation With Sparse Motion

Pierfrancesco Ardino, Marco De Nadai, Bruno Lepri, Elisa Ricci, and St \'e phane Lathuili \`e re. Click To Move : Controlling Video Generation With Sparse Motion . In Proceedings of the IEEE / CVF International Conference on Computer Vision , pp.\ 14749--14758, 2021

work page 2021

[2] [2]

Frozen in time: A joint video and image encoder for end-to-end retrieval

Max Bain, Arsha Nagrani, G \"u l Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE / CVF International Conference on Computer Vision , pp.\ 1728--1738, 2021

work page 2021

[3] [3]

Ipoke: Poking a still image for controlled stochastic video synthesis

Andreas Blattmann, Timo Milbich, Michael Dorkenwald, and Bj \"o rn Ommer. Ipoke: Poking a still image for controlled stochastic video synthesis. In Proceedings of the IEEE / CVF International Conference on Computer Vision , pp.\ 14707--14717, 2021 a

work page 2021

[4] [4]

Understanding Object Dynamics for Interactive Image-to-Video Synthesis

Andreas Blattmann, Timo Milbich, Michael Dorkenwald, and Bjorn Ommer. Understanding Object Dynamics for Interactive Image-to-Video Synthesis . In Proceedings of the IEEE / CVF Conference on Computer Vision and Pattern Recognition , pp.\ 5171--5181, 2021 b

work page 2021

[5] [5]

Caroline Chan, Shiry Ginosar, Tinghui Zhou, and Alexei A. Efros. Everybody Dance Now . In Proceedings of the IEEE / CVF International Conference on Computer Vision , pp.\ 5933--5942, 2019

work page 2019

[6] [7]

Recurrent Environment Simulators

Silvia Chiappa, S \'e bastien Racaniere, Daan Wierstra, and Shakir Mohamed. Recurrent Environment Simulators . In International Conference on Learning Representations , November 2016

work page 2016

[7] [9]

Controllable Video Generation With Sparse Trajectories

Zekun Hao, Xun Huang, and Serge Belongie. Controllable Video Generation With Sparse Trajectories . In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pp.\ 7854--7863, 2018

work page 2018

[8] [10]

Imagen Video: High Definition Video Generation with Diffusion Models

Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P. Kingma, Ben Poole, Mohammad Norouzi, and David J. Fleet. Imagen video: High video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[9] [11]

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. CogVideo : Large-scale Pretraining for Text-to-Video Generation via Transformers . arXiv preprint arXiv:2205.15868, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[10] [12]

Make It Move : Controllable Image-to-Video Generation With Text Descriptions

Yaosi Hu, Chong Luo, and Zhenzhong Chen. Make It Move : Controllable Image-to-Video Generation With Text Descriptions . In Proceedings of the IEEE / CVF Conference on Computer Vision and Pattern Recognition , pp.\ 18219--18228, 2022

work page 2022

[11] [13]

Adam: A Method for Stochastic Optimization

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[12] [14]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2 : Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models . arXiv preprint arXiv:2301.12597, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[13] [15]

Xiaodan Liang, Lisa Lee, Wei Dai, and Eric P. Xing. Dual Motion GAN for Future-Flow Embedded Video Prediction . In Proceedings of the IEEE International Conference on Computer Vision , pp.\ 1744--1752, 2017

work page 2017

[14] [16]

Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning

William Lotter, Gabriel Kreiman, and David Cox. Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning . In International Conference on Learning Representations , November 2016

work page 2016

[15] [17]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, and Jack Clark. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning , pp.\ 8748--8763. PMLR , 2021

work page 2021

[16] [18]

High- Resolution Image Synthesis With Latent Diffusion Models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj \"o rn Ommer. High- Resolution Image Synthesis With Latent Diffusion Models . In Proceedings of the IEEE / CVF Conference on Computer Vision and Pattern Recognition , pp.\ 10684--10695, 2022

work page 2022

[17] [19]

Make- A-Video : Text-to-Video Generation without Text-Video Data , September 2022

Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, and Yaniv Taigman. Make- A-Video : Text-to-Video Generation without Text-Video Data , September 2022

work page 2022

[18] [20]

Unsupervised Learning of Video Representations using LSTMs

Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudinov. Unsupervised Learning of Video Representations using LSTMs . In Proceedings of the 32nd International Conference on Machine Learning , pp.\ 843--852. PMLR , June 2015

work page 2015

[19] [21]

Phenaki: Variable Length Video Generation from Open Domain Textual Descriptions

Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kindermans, Hernan Moraldo, Han Zhang, Mohammad Taghi Saffar, Santiago Castro, Julius Kunze, and Dumitru Erhan. Phenaki: Variable Length Video Generation from Open Domain Textual Descriptions . In ICLR , September 2022

work page 2022

[20] [22]

The Pose Knows : Video Forecasting by Generating Pose Futures

Jacob Walker, Kenneth Marino, Abhinav Gupta, and Martial Hebert. The Pose Knows : Video Forecasting by Generating Pose Futures . In Proceedings of the IEEE International Conference on Computer Vision , pp.\ 3332--3341, 2017

work page 2017

[21] [23]

Deep High-Resolution Representation Learning for Visual Recognition

Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang, Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, Mingkui Tan, Xinggang Wang, Wenyu Liu, and Bin Xiao. Deep High-Resolution Representation Learning for Visual Recognition . IEEE Transactions on Pattern Analysis and Machine Intelligence, 43 0 (10): 0 3349--3364, October 2021. ISSN 1939-3539. doi:10.1109/TPAMI...

work page doi:10.1109/tpami.2020.2983686 2021

[22] [24]

Few-shot video-to-video synthesis

Ting-Chun Wang, Ming-Yu Liu, Andrew Tao, Guilin Liu, Jan Kautz, and Bryan Catanzaro. Few-shot video-to-video synthesis. In Proceedings of the 33rd International Conference on Neural Information Processing Systems , pp.\ 5013--5024, Red Hook, NY, USA , December 2019. Curran Associates Inc

work page 2019

[23] [25]

VideoComposer : Compositional Video Synthesis with Motion Controllability , June 2023

Xiang Wang, Hangjie Yuan, Shiwei Zhang, Dayou Chen, Jiuniu Wang, Yingya Zhang, Yujun Shen, Deli Zhao, and Jingren Zhou. VideoComposer : Compositional Video Synthesis with Motion Controllability , June 2023

work page 2023

[24] [26]

Hierarchical Long-term Video Prediction without Supervision

Nevan Wichers, Ruben Villegas, Dumitru Erhan, and Honglak Lee. Hierarchical Long-term Video Prediction without Supervision . In Proceedings of the 35th International Conference on Machine Learning , pp.\ 6038--6046. PMLR , July 2018

work page 2018

[25] [27]

GODIVA : Generating Open-DomaIn Videos from nAtural Descriptions

Chenfei Wu, Lun Huang, Qianxi Zhang, Binyang Li, Lei Ji, Fan Yang, Guillermo Sapiro, and Nan Duan. GODIVA : Generating Open-DomaIn Videos from nAtural Descriptions . arXiv:2104.14806 [cs], April 2021

work page arXiv 2021

[26] [28]

N " UWA : Visual Synthesis Pre-training for Neural visUal World creAtion

Chenfei Wu, Jian Liang, Lei Ji, Fan Yang, Yuejian Fang, Daxin Jiang, and Nan Duan. N " UWA : Visual Synthesis Pre-training for Neural visUal World creAtion . In Proceedings of the European Conference on Computer Vision ( ECCV ) , 2022

work page 2022

[27] [29]

Future Video Synthesis With Object Motion Prediction

Yue Wu, Rongrong Gao, Jaesik Park, and Qifeng Chen. Future Video Synthesis With Object Motion Prediction . In Proceedings of the IEEE / CVF Conference on Computer Vision and Pattern Recognition , pp.\ 5539--5548, 2020

work page 2020

[28] [30]

Unifying flow, stereo and depth estimation

Haofei Xu, Jing Zhang, Jianfei Cai, Hamid Rezatofighi, Fisher Yu, Dacheng Tao, and Andreas Geiger. Unifying flow, stereo and depth estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023

work page 2023

[29] [31]

NUWA-XL : Diffusion over Diffusion for eXtremely Long Video Generation

Shengming Yin, Chenfei Wu, Huan Yang, Jianfeng Wang, Xiaodong Wang, Minheng Ni, Zhengyuan Yang, Linjie Li, Shuguang Liu, and Fan Yang. NUWA-XL : Diffusion over Diffusion for eXtremely Long Video Generation . arXiv preprint arXiv:2303.12346, 2023

work page arXiv 2023

[30] [32]

DTVNet : Dynamic Time-Lapse Video Generation via Single Still Image

Jiangning Zhang, Chao Xu, Liang Liu, Mengmeng Wang, Xia Wu, Yong Liu, and Yunliang Jiang. DTVNet : Dynamic Time-Lapse Video Generation via Single Still Image . In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (eds.), Computer Vision ECCV 2020 , Lecture Notes in Computer Science , pp.\ 300--315, Cham , 2020. Springer International Publi...

work page doi:10.1007/978-3-030-58558-7_18 2020

[31] [33]

IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

Unifying Flow, Stereo and Depth Estimation , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

work page

[32] [34]

99firms.com , urldate =

2019's. 99firms.com , urldate =

work page 2019

[33] [35]

2019 , month = may, journal =

Around 40,000 Songs Are Uploaded to. 2019 , month = may, journal =

work page 2019

[34] [36]

2018 , pages =

Watch. 2018 , pages =

work page 2018

[35] [37]

1995 , pages =

Financial Applications of Learning from Hints , booktitle =. 1995 , pages =

work page 1995

[36] [38]

1995 , journal =

Hints , author =. 1995 , journal =

work page 1995

[37] [39]

1993 , pages =

A Method for Learning from Hints , booktitle =. 1993 , pages =

work page 1993

[38] [40]

Towards High Resolution Video Generation with Progressive Growing of Sliced Wasserstein GANs

Towards High Resolution Video Generation with Progressive Growing of Sliced Wasserstein Gans , author =. 2018 , journal =. 1810.02419 , archiveprefix =

work page internal anchor Pith review Pith/arXiv arXiv 2018

[39] [41]

TallyQA: Answering Complex Counting Questions

Acharya, Manoj and Kafle, Kushal and Kanan, Christopher , year =. arXiv:1810.12440 [cs] , eprint =

work page internal anchor Pith review Pith/arXiv arXiv

[40] [42]

Explicit

Aditya, Somak and Yang, Yezhou and Baral, Chitta , year =. Explicit

work page

[41] [43]

Question

Adiwardana, Daniel De Freitas and Shakeri, Siamak , urldate =. Question

work page

[42] [44]

Analyzing the

Agrawal, Aishwarya and Batra, Dhruv and Parikh, Devi , year =. Analyzing the

work page

[43] [45]

Agrawal, Aishwarya and Kembhavi, Aniruddha and Batra, Dhruv and Parikh, Devi , year =. C-. arXiv:1704.08243 [cs] , eprint =

work page internal anchor Pith review Pith/arXiv arXiv

[44] [46]

Agrawal, Aishwarya and Batra, Dhruv and Parikh, Devi and Kembhavi, Aniruddha , year =. Don't. arxiv , file =:1712.00377 , urldate =

work page internal anchor Pith review Pith/arXiv arXiv

[45] [47]

Lawrence and Parikh, Devi and Batra, Dhruv , year =

Agrawal, Aishwarya and Lu, Jiasen and Antol, Stanislaw and Mitchell, Margaret and Zitnick, C. Lawrence and Parikh, Devi and Batra, Dhruv , year =. International Journal of Computer Vision , volume =. doi:10.1007/s11263-016-0966-6 , urldate =

work page doi:10.1007/s11263-016-0966-6

[46] [48]

Scale-Space Flow for End-to-End Optimized Video Compression , booktitle =

Agustsson, Eirikur and Minnen, David and Johnston, Nick and Balle, Johannes and Hwang, Sung Jin and Toderici, George , year =. Scale-Space Flow for End-to-End Optimized Video Compression , booktitle =

work page

[47] [49]

A Neural Knowledge Language Model

A Neural Knowledge Language Model , author =. 2016 , journal =. arxiv , file =:1608.00318 , urldate =

work page internal anchor Pith review Pith/arXiv arXiv 2016

[48] [50]

arXiv preprint arXiv:2004.08483 , eprint =

Ainslie, Joshua and Ontanon, Santiago and Alberti, Chris and Pham, Philip and Ravula, Anirudh and Sanghai, Sumit , year =. arXiv preprint arXiv:2004.08483 , eprint =

work page arXiv 2004

[49] [51]

Proceedings of the

Akan, Adil Kaan and Erdem, Erkut and Erdem, Aykut and G. Proceedings of the. 2021 , pages =

work page 2021

[50] [52]

2022 , journal =

Stochastic Video Prediction with Structure and Motion , author =. 2022 , journal =. 2203.10528 , archiveprefix =

work page arXiv 2022

[51] [53]

Contextual

Akbik, Alan and Blythe, Duncan and Vollgraf, Roland , year =. Contextual

work page

[52] [54]

Flamingo: a Visual Language Model for Few-Shot Learning

Alayrac, Jean-Baptiste and Donahue, Jeff and Luc, Pauline and Miech, Antoine and Barr, Iain and Hasson, Yana and Lenc, Karel and Mensch, Arthur and Millican, Katie and Reynolds, Malcolm and Ring, Roman and Rutherford, Eliza and Cabi, Serkan and Han, Tengda and Gong, Zhitao and Samangooei, Sina and Monteiro, Marianne and Menick, Jacob and Borgeaud, Sebasti...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2204.14198

[53] [55]

2017 , journal =

Learning from Narrated Instruction Videos , author =. 2017 , journal =

work page 2017

[54] [56]

Alayrac, Jean-Baptiste and Recasens, Adri. Self-. 2020 , journal =. 2006.16228 , archiveprefix =

work page arXiv 2020

[55] [57]

Albanie, Samuel and Liu, Yang and Nagrani, Arsha and Miech, Antoine and Coto, Ernesto and Laptev, Ivan and Sukthankar, Rahul and Ghanem, Bernard and Zisserman, Andrew and Gabeur, Valentin , year =. The. arXiv preprint arXiv:2008.00744 , eprint =

work page arXiv 2008

[56] [58]

2019 , journal =

Fusion of Detected Objects in Text for Visual Question Answering , author =. 2019 , journal =. 1908.05054 , archiveprefix =

work page arXiv 2019

[57] [59]

Applications of Generative Adversarial Networks (Gans):

Alqahtani, Hamed and. Applications of Generative Adversarial Networks (Gans):. 2021 , journal =

work page 2021

[58] [60]

Bottom-up and Top-down Attention for Image Captioning and Visual Question Answering , booktitle =

Anderson, Peter and He, Xiaodong and Buehler, Chris and Teney, Damien and Johnson, Mark and Gould, Stephen and Zhang, Lei , year =. Bottom-up and Top-down Attention for Image Captioning and Visual Question Answering , booktitle =

work page

[59] [61]

Anderson, Peter and Fernando, Basura and Johnson, Mark and Gould, Stephen , year =. Spice:. European

work page

[60] [62]

Learning to Compose Neural Networks for Question Answering , booktitle =

Andreas, Jacob and Rohrbach, Marcus and Darrell, Trevor and Klein, Dan , year =. Learning to Compose Neural Networks for Question Answering , booktitle =

work page

[61] [63]

Neural Module Networks , booktitle =

Andreas, Jacob and Rohrbach, Marcus and Darrell, Trevor and Klein, Dan , year =. Neural Module Networks , booktitle =

work page

[62] [64]

Relationships from

Andrews, Martin and AI, Red Dragon and Witteveen, Sam , keywords =. Relationships from

work page

[63] [65]

and Parikh, Devi , year =

Antol, Stanislaw and Agrawal, Aishwarya and Lu, Jiasen and Mitchell, Margaret and Batra, Dhruv and Lawrence Zitnick, C. and Parikh, Devi , year =. Vqa:

work page

[64] [66]

Ardino, Pierfrancesco and De Nadai, Marco and Lepri, Bruno and Ricci, Elisa and Lathuili. Click. Proceedings of the. 2021 , pages =

work page 2021

[65] [67]

Arnab, Anurag and Dehghani, Mostafa and Heigold, Georg and Sun, Chen and Lu. Vivit:. Proceedings of the. 2021 , pages =

work page 2021

[66] [68]

Variational Transformer Networks for Layout Generation , booktitle =

Arroyo, Diego Martin and Postels, Janis and Tombari, Federico , year =. Variational Transformer Networks for Layout Generation , booktitle =

work page

[67] [69]

Avrahami, Omri and Lischinski, Dani and Fried, Ohad , year =. Blended. arXiv preprint arXiv:2111.14818 , eprint =

work page arXiv

[68] [70]

Avrahami, Omri and Fried, Ohad and Lischinski, Dani , year =. Blended. doi:10.48550/arXiv.2206.02779 , urldate =. arxiv , file =:2206.02779 , primaryclass =

work page doi:10.48550/arxiv.2206.02779

[69] [71]

Stochastic Variational Video Prediction

Stochastic Variational Video Prediction , author =. 2017 , journal =. 1710.11252 , archiveprefix =

work page internal anchor Pith review Pith/arXiv arXiv 2017

[70] [72]

and Levine, Sergey , year =

Babaeizadeh, Mohammad and Finn, Chelsea and Erhan, Dumitru and Campbell, Roy H. and Levine, Sergey , year =. Stochastic

work page

[71] [73]

Neural Machine Translation by Jointly Learning to Align and Translate

Neural Machine Translation by Jointly Learning to Align and Translate , author =. 2014 , journal =. arxiv , file =:1409.0473 , urldate =

work page internal anchor Pith review Pith/arXiv arXiv 2014

[72] [74]

doi:10.48550/arXiv.2206.14797 , urldate =

Bahmani, Sherwin and Park, Jeong Joon and Paschalidou, Despoina and Tang, Hao and Wetzstein, Gordon and Guibas, Leonidas and Van Gool, Luc and Timofte, Radu , year =. doi:10.48550/arXiv.2206.14797 , urldate =. arxiv , keywords =:2206.14797 , primaryclass =

work page doi:10.48550/arxiv.2206.14797

[73] [75]

航空计算技术 , volume =

白, 林亭 and 文, 鹏程 and 李, 亚晖 , year =. 航空计算技术 , volume =

work page

[74] [76]

Frozen in Time:

Bain, Max and Nagrani, Arsha and Varol, G. Frozen in Time:. Proceedings of the. 2021 , pages =

work page 2021

[75] [77]

Conditional

Balaji, Yogesh and Min, Martin Renqiang and Bai, Bing and Chellappa, Rama and Graf, Hans Peter , year =. Conditional

work page

[76] [78]

eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

Balaji, Yogesh and Nah, Seungjun and Huang, Xun and Vahdat, Arash and Song, Jiaming and Kreis, Karsten and Aittala, Miika and Aila, Timo and Laine, Samuli and Catanzaro, Bryan , year =. arXiv preprint arXiv:2211.01324 , eprint =

work page internal anchor Pith review Pith/arXiv arXiv

[77] [79]

and Kazemi, Hamid and Huang, Furong and Goldblum, Micah and Geiping, Jonas and Goldstein, Tom , year =

Bansal, Arpit and Borgnia, Eitan and Chu, Hong-Min and Li, Jie S. and Kazemi, Hamid and Huang, Furong and Goldblum, Micah and Geiping, Jonas and Goldstein, Tom , year =. Cold Diffusion:. arXiv preprint arXiv:2208.09392 , eprint =

work page arXiv

[78] [80]

Analytic-

Bao, Fan and Li, Chongxuan and Zhu, Jun and Zhang, Bo , year =. Analytic-. International

work page

[79] [81]

Bao, Fan and Nie, Shen and Xue, Kaiwen and Li, Chongxuan and Pu, Shi and Wang, Yaole and Yue, Gang and Cao, Yue and Su, Hang and Zhu, Jun , year =. One. arXiv preprint arXiv:2303.06555 , eprint =

work page arXiv

[80] [82]

Bao, Fan and Nie, Shen and Xue, Kaiwen and Li, Chongxuan and Pu, Shi and Wang, Yaole and Yue, Gang and Cao, Yue and Su, Hang and Zhu, Jun , year =. One. doi:10.48550/arXiv.2303.06555 , urldate =. arxiv , file =:2303.06555 , primaryclass =

work page doi:10.48550/arxiv.2303.06555