DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory
Pith reviewed 2026-05-20 12:57 UTC · model grok-4.3
The pith
DragNUWA achieves fine-grained control in open-domain video generation by integrating text, image, and trajectory information.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DragNUWA is an open-domain diffusion-based video generation model that simultaneously introduces text, image, and trajectory information to provide fine-grained control over video content from semantic, spatial, and temporal perspectives. It resolves limited open-domain trajectory control by proposing a Trajectory Sampler to enable arbitrary trajectories, Multiscale Fusion to control trajectories at different granularities, and an Adaptive Training strategy to generate consistent videos that follow the trajectories.
What carries the argument
Trajectory Sampler for open-domain arbitrary trajectories, Multiscale Fusion for varying control granularities, and Adaptive Training for motion consistency, all combined with text and image conditioning in a diffusion video model.
If this is right
- Videos can be created that follow user-specified arbitrary curved trajectories overlaid on any input image.
- Semantic content from text, spatial details from the image, and temporal motion from the trajectory are controlled at the same time.
- The model handles open-domain scenes rather than being limited to narrow datasets like Human3.6M.
- Generated sequences maintain consistency with the trajectory across frames.
Where Pith is reading between the lines
- The same multi-signal conditioning pattern could be tested on related tasks such as image animation or 3D scene generation.
- Interactive interfaces might let users sketch paths directly on an image to define desired motion.
- The method could reduce reliance on single-modality training data when scaling controllable generation systems.
Load-bearing premise
The Trajectory Sampler, Multiscale Fusion, and Adaptive Training can reliably produce consistent videos that follow arbitrary complex curved trajectories on open-domain images without motion artifacts or semantic drift.
What would settle it
Generate videos from complex curved trajectories drawn on diverse real-world open-domain images and measure whether the motion paths are followed accurately while content and semantics remain stable and free of visible artifacts.
read the original abstract
Controllable video generation has gained significant attention in recent years. However, two main limitations persist: Firstly, most existing works focus on either text, image, or trajectory-based control, leading to an inability to achieve fine-grained control in videos. Secondly, trajectory control research is still in its early stages, with most experiments being conducted on simple datasets like Human3.6M. This constraint limits the models' capability to process open-domain images and effectively handle complex curved trajectories. In this paper, we propose DragNUWA, an open-domain diffusion-based video generation model. To tackle the issue of insufficient control granularity in existing works, we simultaneously introduce text, image, and trajectory information to provide fine-grained control over video content from semantic, spatial, and temporal perspectives. To resolve the problem of limited open-domain trajectory control in current research, We propose trajectory modeling with three aspects: a Trajectory Sampler (TS) to enable open-domain control of arbitrary trajectories, a Multiscale Fusion (MF) to control trajectories in different granularities, and an Adaptive Training (AT) strategy to generate consistent videos following trajectories. Our experiments validate the effectiveness of DragNUWA, demonstrating its superior performance in fine-grained control in video generation. The homepage link is \url{https://www.microsoft.com/en-us/research/project/dragnuwa/}
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces DragNUWA, an open-domain diffusion-based video generation model that integrates text, image, and trajectory inputs to enable fine-grained control over video content from semantic, spatial, and temporal perspectives. It proposes a Trajectory Sampler (TS) to support arbitrary trajectories on open-domain images, Multiscale Fusion (MF) to handle trajectories at varying granularities, and an Adaptive Training (AT) strategy to produce consistent videos that follow the specified trajectories. The central claim is that this combination yields superior performance in fine-grained controllable video generation compared to prior single-modality or limited-domain approaches, with experiments asserted to validate the effectiveness.
Significance. If the results hold, the work would advance multi-modal controllable video generation by addressing the limitations of single-modality control and restriction to simple datasets such as Human3.6M. Enabling open-domain handling of complex curved trajectories via the proposed TS, MF, and AT components could support more precise applications in animation and content creation. The structured decomposition of trajectory modeling provides a clear technical contribution to the diffusion conditioning literature.
major comments (2)
- [Experiments] Experiments section: the abstract asserts experimental validation and superior performance, yet no quantitative metrics (e.g., FID, FVD, or user-study scores), dataset details, or ablation results on TS/MF/AT are provided in the summary; without these the central claim of effectiveness rests on an unverified assertion and requires explicit tables comparing against baselines on open-domain data.
- [Method] Method section, description of Adaptive Training (AT): the strategy is presented as ensuring consistency and avoiding motion artifacts or semantic drift for arbitrary curved trajectories, but the concrete loss formulation, sampling schedule, or conditioning weight schedule is not specified; this leaves the load-bearing claim that AT reliably produces artifact-free output ungrounded in the provided equations or pseudocode.
minor comments (2)
- The homepage link is given but the manuscript does not include a direct pointer to the released code or model weights, which would aid reproducibility.
- [Method] Notation for the three conditioning modalities (text, image, trajectory) should be introduced with explicit symbols in the method overview to improve clarity when describing the fusion step.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive comments. We address each major comment point by point below and outline the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [Experiments] Experiments section: the abstract asserts experimental validation and superior performance, yet no quantitative metrics (e.g., FID, FVD, or user-study scores), dataset details, or ablation results on TS/MF/AT are provided in the summary; without these the central claim of effectiveness rests on an unverified assertion and requires explicit tables comparing against baselines on open-domain data.
Authors: We appreciate the referee highlighting the need for clearer quantitative support. The full manuscript contains experimental results including user studies demonstrating superior performance; however, to directly address this concern, we will add explicit tables in the revised version reporting quantitative metrics (such as FID and FVD where relevant), dataset details, and ablation studies isolating the contributions of TS, MF, and AT, with direct comparisons to baselines on open-domain data. revision: yes
-
Referee: [Method] Method section, description of Adaptive Training (AT): the strategy is presented as ensuring consistency and avoiding motion artifacts or semantic drift for arbitrary curved trajectories, but the concrete loss formulation, sampling schedule, or conditioning weight schedule is not specified; this leaves the load-bearing claim that AT reliably produces artifact-free output ungrounded in the provided equations or pseudocode.
Authors: We agree that the Adaptive Training description would benefit from greater specificity. In the revised manuscript, we will include the concrete loss formulation for AT, along with the sampling schedule and conditioning weight schedule. These additions will better substantiate the claims regarding consistency and the avoidance of motion artifacts or semantic drift for arbitrary trajectories. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper proposes DragNUWA as a diffusion-based architecture that adds text, image, and trajectory conditioning, along with three explicitly defined new components (Trajectory Sampler for open-domain paths, Multiscale Fusion for granularity, and Adaptive Training for consistency). These elements are introduced to address stated limitations in prior work and are described directly in the method without any equations or claims that reduce the performance gains to a fitted parameter, self-definition, or self-citation chain. The derivation remains self-contained as a coherent extension of standard diffusion conditioning, with validation left to experiments rather than internal reduction.
Axiom & Free-Parameter Ledger
free parameters (1)
- diffusion conditioning weights for text/image/trajectory
axioms (1)
- domain assumption Diffusion models can be jointly conditioned on semantic, spatial, and temporal signals without destructive interference.
invented entities (3)
-
Trajectory Sampler (TS)
no independent evidence
-
Multiscale Fusion (MF)
no independent evidence
-
Adaptive Training (AT) strategy
no independent evidence
Forward citations
Cited by 23 Pith papers
-
Aero-World: Action-Conditioned Aerial Video Generation from Inertial Controls
Aero-World adapts a pretrained latent diffusion transformer for action-conditioned aerial video generation by injecting inertial action tokens and using a frozen latent-space Physics Probe for inertial consistency sup...
-
Functionalization via Structure Completion and Motion Rectification
Object functionalization is cast as neural graph completion over a functional graph of parts, contacts, and motions, followed by geometry realization that also rectifies erroneous motions, demonstrated on furniture wi...
-
R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow
R-DMesh generates high-fidelity 4D meshes aligned to video by disentangling base mesh, motion, and a learned rectification jump offset inside a VAE, then using Triflow Attention and rectified-flow diffusion.
-
Eulerian Motion Guidance: Robust Image Animation via Bidirectional Geometric Consistency
Eulerian adjacent-frame motion fields with bidirectional cycle consistency checks enable faster parallel training and fewer artifacts in diffusion model image animation compared to initial-frame Lagrangian guidance.
-
Eulerian Motion Guidance: Robust Image Animation via Bidirectional Geometric Consistency
Eulerian adjacent-frame motion guidance plus bidirectional geometric consistency improves training speed, temporal coherence, and artifact reduction in diffusion-based image animation.
-
Reshoot-Anything: A Self-Supervised Model for In-the-Wild Video Reshooting
Reshoot-Anything trains a diffusion transformer on pseudo multi-view triplets created by cropping and warping monocular videos to achieve temporally consistent video reshooting with robust camera control on dynamic scenes.
-
MoRight: Motion Control Done Right
MoRight disentangles object and camera motion via canonical-view specification and temporal cross-view attention, while decomposing motion into active user-driven and passive consequence components to learn and apply ...
-
ASTRA: Let Arbitrary Subjects Transform in Video Editing
ASTRA is a plug-and-play training-free method for precise multi-subject video editing that uses prompt-guided multimodal alignment and prior-based mask retargeting to avoid attention dilution and boundary issues.
-
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
A single motion module trained on videos adds temporally coherent animation to any personalized text-to-image model derived from the same base without additional tuning.
-
ReactiveGWM: Steering NPC in Reactive Game World Models
ReactiveGWM introduces a decoupled diffusion architecture for player-NPC interactions that learns game-agnostic response logic for zero-shot strategy transfer across games.
-
Eulerian Motion Guidance: Robust Image Animation via Bidirectional Geometric Consistency
Eulerian adjacent-frame motion guidance plus bidirectional geometric consistency yields faster training and more coherent diffusion-based image animation than first-frame reference methods.
-
Bridging the Embodiment Gap: Disentangled Cross-Embodiment Video Editing
A dual-contrastive disentanglement method factorizes videos into independent task and embodiment latents, then uses a parameter-efficient adapter on a frozen video diffusion model to synthesize robot executions from s...
-
PhyCo: Learning Controllable Physical Priors for Generative Motion
PhyCo adds continuous physical control to video diffusion models via physics-supervised fine-tuning on a large simulation dataset and VLM-guided rewards, yielding measurable gains in physical realism on the Physics-IQ...
-
DailyArt: Discovering Articulation from Single Static Images via Latent Dynamics
DailyArt recovers full joint parameters of articulated objects from a single static image by synthesizing an opened state and comparing discrepancies, supporting downstream part-level novel state synthesis.
-
HVG-3D: Bridging Real and Simulation Domains for 3D-Conditional Hand-Object Interaction Video Synthesis
HVG-3D uses a 3D-aware diffusion architecture with ControlNet to synthesize high-fidelity hand-object interaction videos from 3D control signals, achieving state-of-the-art spatial fidelity and temporal coherence on t...
-
Diff4Splat: Controllable 4D Scene Generation with Latent Dynamic Reconstruction Models
A feed-forward video latent transformer that predicts time-varying 3D Gaussian primitives from one image to produce controllable 4D scenes with appearance, geometry, and motion.
-
ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis
ViewCrafter tames video diffusion models with point-based 3D guidance and iterative trajectory planning to produce high-fidelity novel views from single or sparse images.
-
CameraCtrl: Enabling Camera Control for Text-to-Video Generation
CameraCtrl enables accurate camera pose control in video diffusion models through a trained plug-and-play module and dataset choices emphasizing diverse camera trajectories with matching appearance.
-
VideoCrafter1: Open Diffusion Models for High-Quality Video Generation
Open-source text-to-video and image-to-video diffusion models generate high-quality 1024x576 videos, with the I2V variant claimed as the first to strictly preserve reference image content.
-
R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow
R-DMesh uses a VAE with a learned rectification jump offset and Triflow Attention inside a rectified-flow diffusion transformer to produce video-aligned 4D meshes despite initial pose misalignment.
-
Tora3: Trajectory-Guided Audio-Video Generation with Physical Coherence
Tora3 uses shared object trajectories as kinematic priors to jointly guide visual motion and acoustic events in audio-video generation, improving realism and synchronization.
-
I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models
I2VGen-XL applies cascaded diffusion models with a base stage for semantic preservation via hierarchical encoders and a refinement stage for detail and resolution, trained on 35 million text-video and 6 billion text-i...
-
Evolution of Video Generative Foundations
This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.
Reference graph
Works this paper leans on
-
[1]
Click To Move : Controlling Video Generation With Sparse Motion
Pierfrancesco Ardino, Marco De Nadai, Bruno Lepri, Elisa Ricci, and St \'e phane Lathuili \`e re. Click To Move : Controlling Video Generation With Sparse Motion . In Proceedings of the IEEE / CVF International Conference on Computer Vision , pp.\ 14749--14758, 2021
work page 2021
-
[2]
Frozen in time: A joint video and image encoder for end-to-end retrieval
Max Bain, Arsha Nagrani, G \"u l Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE / CVF International Conference on Computer Vision , pp.\ 1728--1738, 2021
work page 2021
-
[3]
Ipoke: Poking a still image for controlled stochastic video synthesis
Andreas Blattmann, Timo Milbich, Michael Dorkenwald, and Bj \"o rn Ommer. Ipoke: Poking a still image for controlled stochastic video synthesis. In Proceedings of the IEEE / CVF International Conference on Computer Vision , pp.\ 14707--14717, 2021 a
work page 2021
-
[4]
Understanding Object Dynamics for Interactive Image-to-Video Synthesis
Andreas Blattmann, Timo Milbich, Michael Dorkenwald, and Bjorn Ommer. Understanding Object Dynamics for Interactive Image-to-Video Synthesis . In Proceedings of the IEEE / CVF Conference on Computer Vision and Pattern Recognition , pp.\ 5171--5181, 2021 b
work page 2021
-
[5]
Caroline Chan, Shiry Ginosar, Tinghui Zhou, and Alexei A. Efros. Everybody Dance Now . In Proceedings of the IEEE / CVF International Conference on Computer Vision , pp.\ 5933--5942, 2019
work page 2019
-
[7]
Recurrent Environment Simulators
Silvia Chiappa, S \'e bastien Racaniere, Daan Wierstra, and Shakir Mohamed. Recurrent Environment Simulators . In International Conference on Learning Representations , November 2016
work page 2016
-
[9]
Controllable Video Generation With Sparse Trajectories
Zekun Hao, Xun Huang, and Serge Belongie. Controllable Video Generation With Sparse Trajectories . In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pp.\ 7854--7863, 2018
work page 2018
-
[10]
Imagen Video: High Definition Video Generation with Diffusion Models
Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P. Kingma, Ben Poole, Mohammad Norouzi, and David J. Fleet. Imagen video: High video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[11]
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers
Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. CogVideo : Large-scale Pretraining for Text-to-Video Generation via Transformers . arXiv preprint arXiv:2205.15868, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[12]
Make It Move : Controllable Image-to-Video Generation With Text Descriptions
Yaosi Hu, Chong Luo, and Zhenzhong Chen. Make It Move : Controllable Image-to-Video Generation With Text Descriptions . In Proceedings of the IEEE / CVF Conference on Computer Vision and Pattern Recognition , pp.\ 18219--18228, 2022
work page 2022
-
[13]
Adam: A Method for Stochastic Optimization
Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[14]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2 : Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models . arXiv preprint arXiv:2301.12597, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[15]
Xiaodan Liang, Lisa Lee, Wei Dai, and Eric P. Xing. Dual Motion GAN for Future-Flow Embedded Video Prediction . In Proceedings of the IEEE International Conference on Computer Vision , pp.\ 1744--1752, 2017
work page 2017
-
[16]
Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning
William Lotter, Gabriel Kreiman, and David Cox. Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning . In International Conference on Learning Representations , November 2016
work page 2016
-
[17]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, and Jack Clark. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning , pp.\ 8748--8763. PMLR , 2021
work page 2021
-
[18]
High- Resolution Image Synthesis With Latent Diffusion Models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj \"o rn Ommer. High- Resolution Image Synthesis With Latent Diffusion Models . In Proceedings of the IEEE / CVF Conference on Computer Vision and Pattern Recognition , pp.\ 10684--10695, 2022
work page 2022
-
[19]
Make- A-Video : Text-to-Video Generation without Text-Video Data , September 2022
Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, and Yaniv Taigman. Make- A-Video : Text-to-Video Generation without Text-Video Data , September 2022
work page 2022
-
[20]
Unsupervised Learning of Video Representations using LSTMs
Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudinov. Unsupervised Learning of Video Representations using LSTMs . In Proceedings of the 32nd International Conference on Machine Learning , pp.\ 843--852. PMLR , June 2015
work page 2015
-
[21]
Phenaki: Variable Length Video Generation from Open Domain Textual Descriptions
Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kindermans, Hernan Moraldo, Han Zhang, Mohammad Taghi Saffar, Santiago Castro, Julius Kunze, and Dumitru Erhan. Phenaki: Variable Length Video Generation from Open Domain Textual Descriptions . In ICLR , September 2022
work page 2022
-
[22]
The Pose Knows : Video Forecasting by Generating Pose Futures
Jacob Walker, Kenneth Marino, Abhinav Gupta, and Martial Hebert. The Pose Knows : Video Forecasting by Generating Pose Futures . In Proceedings of the IEEE International Conference on Computer Vision , pp.\ 3332--3341, 2017
work page 2017
-
[23]
Deep High-Resolution Representation Learning for Visual Recognition
Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang, Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, Mingkui Tan, Xinggang Wang, Wenyu Liu, and Bin Xiao. Deep High-Resolution Representation Learning for Visual Recognition . IEEE Transactions on Pattern Analysis and Machine Intelligence, 43 0 (10): 0 3349--3364, October 2021. ISSN 1939-3539. doi:10.1109/TPAMI...
-
[24]
Few-shot video-to-video synthesis
Ting-Chun Wang, Ming-Yu Liu, Andrew Tao, Guilin Liu, Jan Kautz, and Bryan Catanzaro. Few-shot video-to-video synthesis. In Proceedings of the 33rd International Conference on Neural Information Processing Systems , pp.\ 5013--5024, Red Hook, NY, USA , December 2019. Curran Associates Inc
work page 2019
-
[25]
VideoComposer : Compositional Video Synthesis with Motion Controllability , June 2023
Xiang Wang, Hangjie Yuan, Shiwei Zhang, Dayou Chen, Jiuniu Wang, Yingya Zhang, Yujun Shen, Deli Zhao, and Jingren Zhou. VideoComposer : Compositional Video Synthesis with Motion Controllability , June 2023
work page 2023
-
[26]
Hierarchical Long-term Video Prediction without Supervision
Nevan Wichers, Ruben Villegas, Dumitru Erhan, and Honglak Lee. Hierarchical Long-term Video Prediction without Supervision . In Proceedings of the 35th International Conference on Machine Learning , pp.\ 6038--6046. PMLR , July 2018
work page 2018
-
[27]
GODIVA : Generating Open-DomaIn Videos from nAtural Descriptions
Chenfei Wu, Lun Huang, Qianxi Zhang, Binyang Li, Lei Ji, Fan Yang, Guillermo Sapiro, and Nan Duan. GODIVA : Generating Open-DomaIn Videos from nAtural Descriptions . arXiv:2104.14806 [cs], April 2021
-
[28]
N " UWA : Visual Synthesis Pre-training for Neural visUal World creAtion
Chenfei Wu, Jian Liang, Lei Ji, Fan Yang, Yuejian Fang, Daxin Jiang, and Nan Duan. N " UWA : Visual Synthesis Pre-training for Neural visUal World creAtion . In Proceedings of the European Conference on Computer Vision ( ECCV ) , 2022
work page 2022
-
[29]
Future Video Synthesis With Object Motion Prediction
Yue Wu, Rongrong Gao, Jaesik Park, and Qifeng Chen. Future Video Synthesis With Object Motion Prediction . In Proceedings of the IEEE / CVF Conference on Computer Vision and Pattern Recognition , pp.\ 5539--5548, 2020
work page 2020
-
[30]
Unifying flow, stereo and depth estimation
Haofei Xu, Jing Zhang, Jianfei Cai, Hamid Rezatofighi, Fisher Yu, Dacheng Tao, and Andreas Geiger. Unifying flow, stereo and depth estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023
work page 2023
-
[31]
NUWA-XL : Diffusion over Diffusion for eXtremely Long Video Generation
Shengming Yin, Chenfei Wu, Huan Yang, Jianfeng Wang, Xiaodong Wang, Minheng Ni, Zhengyuan Yang, Linjie Li, Shuguang Liu, and Fan Yang. NUWA-XL : Diffusion over Diffusion for eXtremely Long Video Generation . arXiv preprint arXiv:2303.12346, 2023
-
[32]
DTVNet : Dynamic Time-Lapse Video Generation via Single Still Image
Jiangning Zhang, Chao Xu, Liang Liu, Mengmeng Wang, Xia Wu, Yong Liu, and Yunliang Jiang. DTVNet : Dynamic Time-Lapse Video Generation via Single Still Image . In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (eds.), Computer Vision ECCV 2020 , Lecture Notes in Computer Science , pp.\ 300--315, Cham , 2020. Springer International Publi...
-
[33]
IEEE Transactions on Pattern Analysis and Machine Intelligence , year=
Unifying Flow, Stereo and Depth Estimation , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , year=
- [34]
-
[35]
Around 40,000 Songs Are Uploaded to. 2019 , month = may, journal =
work page 2019
- [36]
-
[37]
Financial Applications of Learning from Hints , booktitle =. 1995 , pages =
work page 1995
- [38]
- [39]
-
[40]
Towards High Resolution Video Generation with Progressive Growing of Sliced Wasserstein GANs
Towards High Resolution Video Generation with Progressive Growing of Sliced Wasserstein Gans , author =. 2018 , journal =. 1810.02419 , archiveprefix =
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[41]
TallyQA: Answering Complex Counting Questions
Acharya, Manoj and Kafle, Kushal and Kanan, Christopher , year =. arXiv:1810.12440 [cs] , eprint =
work page internal anchor Pith review Pith/arXiv arXiv
- [42]
- [43]
-
[44]
Agrawal, Aishwarya and Batra, Dhruv and Parikh, Devi , year =. Analyzing the
-
[45]
Agrawal, Aishwarya and Kembhavi, Aniruddha and Batra, Dhruv and Parikh, Devi , year =. C-. arXiv:1704.08243 [cs] , eprint =
work page internal anchor Pith review Pith/arXiv arXiv
-
[46]
Agrawal, Aishwarya and Batra, Dhruv and Parikh, Devi and Kembhavi, Aniruddha , year =. Don't. arxiv , file =:1712.00377 , urldate =
work page internal anchor Pith review Pith/arXiv arXiv
-
[47]
Lawrence and Parikh, Devi and Batra, Dhruv , year =
Agrawal, Aishwarya and Lu, Jiasen and Antol, Stanislaw and Mitchell, Margaret and Zitnick, C. Lawrence and Parikh, Devi and Batra, Dhruv , year =. International Journal of Computer Vision , volume =. doi:10.1007/s11263-016-0966-6 , urldate =
-
[48]
Scale-Space Flow for End-to-End Optimized Video Compression , booktitle =
Agustsson, Eirikur and Minnen, David and Johnston, Nick and Balle, Johannes and Hwang, Sung Jin and Toderici, George , year =. Scale-Space Flow for End-to-End Optimized Video Compression , booktitle =
-
[49]
A Neural Knowledge Language Model
A Neural Knowledge Language Model , author =. 2016 , journal =. arxiv , file =:1608.00318 , urldate =
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[50]
arXiv preprint arXiv:2004.08483 , eprint =
Ainslie, Joshua and Ontanon, Santiago and Alberti, Chris and Pham, Philip and Ravula, Anirudh and Sanghai, Sumit , year =. arXiv preprint arXiv:2004.08483 , eprint =
-
[51]
Akan, Adil Kaan and Erdem, Erkut and Erdem, Aykut and G. Proceedings of the. 2021 , pages =
work page 2021
-
[52]
Stochastic Video Prediction with Structure and Motion , author =. 2022 , journal =. 2203.10528 , archiveprefix =
- [53]
-
[54]
Flamingo: a Visual Language Model for Few-Shot Learning
Alayrac, Jean-Baptiste and Donahue, Jeff and Luc, Pauline and Miech, Antoine and Barr, Iain and Hasson, Yana and Lenc, Karel and Mensch, Arthur and Millican, Katie and Reynolds, Malcolm and Ring, Roman and Rutherford, Eliza and Cabi, Serkan and Han, Tengda and Gong, Zhitao and Samangooei, Sina and Monteiro, Marianne and Menick, Jacob and Borgeaud, Sebasti...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2204.14198
-
[55]
Learning from Narrated Instruction Videos , author =. 2017 , journal =
work page 2017
- [56]
- [57]
-
[58]
Fusion of Detected Objects in Text for Visual Question Answering , author =. 2019 , journal =. 1908.05054 , archiveprefix =
-
[59]
Applications of Generative Adversarial Networks (Gans):
Alqahtani, Hamed and. Applications of Generative Adversarial Networks (Gans):. 2021 , journal =
work page 2021
-
[60]
Bottom-up and Top-down Attention for Image Captioning and Visual Question Answering , booktitle =
Anderson, Peter and He, Xiaodong and Buehler, Chris and Teney, Damien and Johnson, Mark and Gould, Stephen and Zhang, Lei , year =. Bottom-up and Top-down Attention for Image Captioning and Visual Question Answering , booktitle =
-
[61]
Anderson, Peter and Fernando, Basura and Johnson, Mark and Gould, Stephen , year =. Spice:. European
-
[62]
Learning to Compose Neural Networks for Question Answering , booktitle =
Andreas, Jacob and Rohrbach, Marcus and Darrell, Trevor and Klein, Dan , year =. Learning to Compose Neural Networks for Question Answering , booktitle =
-
[63]
Neural Module Networks , booktitle =
Andreas, Jacob and Rohrbach, Marcus and Darrell, Trevor and Klein, Dan , year =. Neural Module Networks , booktitle =
-
[64]
Andrews, Martin and AI, Red Dragon and Witteveen, Sam , keywords =. Relationships from
-
[65]
Antol, Stanislaw and Agrawal, Aishwarya and Lu, Jiasen and Mitchell, Margaret and Batra, Dhruv and Lawrence Zitnick, C. and Parikh, Devi , year =. Vqa:
-
[66]
Ardino, Pierfrancesco and De Nadai, Marco and Lepri, Bruno and Ricci, Elisa and Lathuili. Click. Proceedings of the. 2021 , pages =
work page 2021
-
[67]
Arnab, Anurag and Dehghani, Mostafa and Heigold, Georg and Sun, Chen and Lu. Vivit:. Proceedings of the. 2021 , pages =
work page 2021
-
[68]
Variational Transformer Networks for Layout Generation , booktitle =
Arroyo, Diego Martin and Postels, Janis and Tombari, Federico , year =. Variational Transformer Networks for Layout Generation , booktitle =
- [69]
-
[70]
Avrahami, Omri and Fried, Ohad and Lischinski, Dani , year =. Blended. doi:10.48550/arXiv.2206.02779 , urldate =. arxiv , file =:2206.02779 , primaryclass =
-
[71]
Stochastic Variational Video Prediction
Stochastic Variational Video Prediction , author =. 2017 , journal =. 1710.11252 , archiveprefix =
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[72]
Babaeizadeh, Mohammad and Finn, Chelsea and Erhan, Dumitru and Campbell, Roy H. and Levine, Sergey , year =. Stochastic
-
[73]
Neural Machine Translation by Jointly Learning to Align and Translate
Neural Machine Translation by Jointly Learning to Align and Translate , author =. 2014 , journal =. arxiv , file =:1409.0473 , urldate =
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[74]
doi:10.48550/arXiv.2206.14797 , urldate =
Bahmani, Sherwin and Park, Jeong Joon and Paschalidou, Despoina and Tang, Hao and Wetzstein, Gordon and Guibas, Leonidas and Van Gool, Luc and Timofte, Radu , year =. doi:10.48550/arXiv.2206.14797 , urldate =. arxiv , keywords =:2206.14797 , primaryclass =
- [75]
-
[76]
Bain, Max and Nagrani, Arsha and Varol, G. Frozen in Time:. Proceedings of the. 2021 , pages =
work page 2021
-
[77]
Balaji, Yogesh and Min, Martin Renqiang and Bai, Bing and Chellappa, Rama and Graf, Hans Peter , year =. Conditional
-
[78]
eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers
Balaji, Yogesh and Nah, Seungjun and Huang, Xun and Vahdat, Arash and Song, Jiaming and Kreis, Karsten and Aittala, Miika and Aila, Timo and Laine, Samuli and Catanzaro, Bryan , year =. arXiv preprint arXiv:2211.01324 , eprint =
work page internal anchor Pith review Pith/arXiv arXiv
-
[79]
Bansal, Arpit and Borgnia, Eitan and Chu, Hong-Min and Li, Jie S. and Kazemi, Hamid and Huang, Furong and Goldblum, Micah and Geiping, Jonas and Goldstein, Tom , year =. Cold Diffusion:. arXiv preprint arXiv:2208.09392 , eprint =
- [80]
- [81]
-
[82]
Bao, Fan and Nie, Shen and Xue, Kaiwen and Li, Chongxuan and Pu, Shi and Wang, Yaole and Yue, Gang and Cao, Yue and Su, Hang and Zhu, Jun , year =. One. doi:10.48550/arXiv.2303.06555 , urldate =. arxiv , file =:2303.06555 , primaryclass =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.