pith. sign in

arxiv: 2605.17543 · v2 · pith:M2KAJKZXnew · submitted 2026-05-17 · 💻 cs.CV · cs.GR

HL-OutPaint: Coarse-to-Fine Video Outpainting for High-Resolution Long-Range Videos

Pith reviewed 2026-05-20 13:28 UTC · model grok-4.3

classification 💻 cs.CV cs.GR
keywords video outpaintingcoarse-to-finehigh-resolution videolong video sequencesglobal coarse guidanceframe swappingspatio-temporal consistencyspatial extrapolation
0
0 comments X

The pith

Separating global structure modeling from fine-grained synthesis enables stable coherent outpainting for large spatial expansions in long video sequences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a framework for extending video content beyond original frame boundaries while preserving detail and consistency over long durations. It builds a low-resolution Global Coarse Guidance first, using a swapping process between distant global keyframes and nearby local frames to capture both overall structure and motion patterns. This guidance then directs a second stage that synthesizes high-resolution details. The separation prevents the drift and artifacts that usually appear when trying to handle wide spatial growth and extended sequences at once. Readers would care if this approach makes reliable adaptation of videos to new aspect ratios or longer formats practical.

Core claim

By constructing Global Coarse Guidance as a low-resolution representation through a novel global-local frame swapping mechanism that couples sparse global keyframes with local temporal windows, the method encodes both long-term structural consistency and short-term temporal dynamics in a unified way. This representation then guides the high-resolution outpainting stage to produce spatially detailed and temporally consistent content, achieving stable generation for large spatial expansion and long video sequences.

What carries the argument

Global Coarse Guidance (GCG), a low-resolution video representation constructed via global-local frame swapping that couples sparse global keyframes with local temporal windows to encode structure and motion for guiding later synthesis.

If this is right

  • The two-stage separation supports coherent results even when spatial expansion is large and sequences are long.
  • GCG provides a unified low-resolution encoding that maintains both distant structural consistency and nearby motion continuity.
  • High-resolution synthesis guided by GCG avoids the global inconsistency problems seen in direct high-resolution approaches.
  • The framework outperforms prior methods on challenging cases that combine wide extrapolation with extended video lengths.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same coarse guidance idea could be tested on video tasks such as future frame prediction where long-range consistency is also needed.
  • Removing the swapping mechanism and measuring the rise in artifacts would quantify how much of the performance depends on that specific construction step.
  • Applying the global-local exchange idea to image outpainting might help with large single-frame extrapolations that lack temporal cues.

Load-bearing premise

The global-local frame swapping mechanism in building Global Coarse Guidance encodes long-term structural consistency and short-term temporal dynamics without introducing artifacts that propagate into the high-resolution stage.

What would settle it

Outpainting the same long sequences with large spatial expansion both with and without the frame swapping step in GCG construction, then checking whether the no-swapping version produces visibly more temporal drift or structural inconsistencies, would test whether that mechanism is necessary for the claimed stability.

Figures

Figures reproduced from arXiv: 2605.17543 by Geonung Kim, Hyun-Seung Lee, Janghyeok Han, Jeongeun Park, Kyuha Choi, Sunghyun Cho, Youngseok Han.

Figure 1
Figure 1. Figure 1: HL-OutPaint handles long-range outpainting (top) and high-resolution outpainting (middle), and outperforms existing state-of-the-art methods, including M3DDM [Fan et al. 2023], MOTIA [Wang et al. 2024], Infinite-Canvas [Chen et al. 2025], and VACE [Jiang et al. 2025] (bottom). The yellow dashed boxes indicate the original regions before outpainting. The input videos are from the DAVIS dataset [Pont-Tuset e… view at source ↗
Figure 2
Figure 2. Figure 2: Overall framework of proposed HL-OutPaint. (a) HL-OutPaint consists of two stages: Global Coarse Guidance Construction and GCG-Guided Video [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison on the DAVIS [Pont-Tuset et al [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison on the Long-Video dataset with outpainting expansion of [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison on the Short-Form dataset with an outpainting expansion of [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Outpainting results using GCG compressed along different spatial [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 6
Figure 6. Figure 6: Sparse keyframes and the local temporal window centered at the [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative comparison between (a) bicubic upsampling of the temporally completed low-resolution video [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Failure case under extreme spatial expansion (512×512 → 5760×5760). The input is heavily downsampled (e.g., to 768×768) during GCG construction, causing significant loss of high-frequency details. While the original regions are restored during refinement due to strong condition￾ing, the outpainted regions fail to recover fine details, resulting in blurry structures. The input videos are from the DAVIS data… view at source ↗
read the original abstract

Video outpainting generates plausible visual content beyond the original spatial extent of a video, playing a key role in adapting videos to diverse display formats. To support such use cases, it must enable large spatial extrapolation over long sequences. However, most existing methods address only one of these challenges or lack explicit mechanisms for ensuring global spatio-temporal consistency, leading to notable limitations. In this paper, we propose HL-OutPaint, a high-resolution video outpainting framework for long sequences. Our approach follows a coarse-to-fine strategy with a two-stage pipeline. We first construct Global Coarse Guidance (GCG), a low-resolution representation that captures global structure and dominant motion across the video. Unlike naive downsampling, GCG is built via a novel global-local frame swapping mechanism that couples sparse global keyframes with local temporal windows and exchanges information during sampling. This enables GCG to encode both long-term structural consistency and short-term temporal dynamics in a unified representation. Guided by this representation, HL-OutPaint then performs high-resolution outpainting to generate spatially detailed and temporally consistent content. By separating global structure modeling from fine-grained synthesis, our framework achieves stable, coherent generation for large spatial expansion and long video sequences. Extensive experiments show that HL-OutPaint outperforms existing methods in challenging scenarios involving wide spatial extrapolation and long video sequences.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes HL-OutPaint, a coarse-to-fine two-stage framework for high-resolution video outpainting on long sequences. It first builds Global Coarse Guidance (GCG) via a novel global-local frame swapping mechanism that couples sparse global keyframes with local temporal windows during sampling, then uses this low-resolution representation to guide spatially detailed and temporally coherent high-resolution synthesis. The central claim is that separating global structure modeling from fine-grained synthesis enables stable outpainting for large spatial expansions and extended video lengths, outperforming prior methods.

Significance. If the GCG construction and separation premise hold under empirical scrutiny, the work could meaningfully advance video outpainting by addressing the combined challenges of wide spatial extrapolation and long-range temporal consistency that most existing approaches handle only partially. The coarse-to-fine strategy is a clear conceptual strength; credit is given for the explicit mechanism to encode both long-term structure and short-term dynamics in a unified low-res representation.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (GCG construction): The claim that the global-local frame swapping successfully encodes long-term structural consistency and short-term temporal dynamics without residual inconsistencies that propagate to the high-resolution stage is load-bearing for the separation premise. The manuscript should provide a concrete analysis (e.g., via temporal alignment metrics or drift measurements) showing that information exchange during sampling is symmetric and constrained enough to prevent uncorrectable artifacts, as any drift would directly undermine the fine stage's ability to resolve it.
  2. [§4] §4 (Experiments): The abstract asserts outperformance in wide spatial extrapolation and long sequences, yet the provided description contains no quantitative results, ablation studies on the frame-swapping component, or error analysis. Tables reporting metrics across varying expansion ratios and sequence lengths are needed to substantiate that the GCG stage does not introduce uncorrectable inconsistencies.
minor comments (1)
  1. A diagram or pseudocode for the global-local frame swapping process would improve clarity of the information exchange during sampling.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments on our manuscript. We address each major comment below and outline the revisions we will make to strengthen the presentation of the GCG mechanism and the supporting experiments.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (GCG construction): The claim that the global-local frame swapping successfully encodes long-term structural consistency and short-term temporal dynamics without residual inconsistencies that propagate to the high-resolution stage is load-bearing for the separation premise. The manuscript should provide a concrete analysis (e.g., via temporal alignment metrics or drift measurements) showing that information exchange during sampling is symmetric and constrained enough to prevent uncorrectable artifacts, as any drift would directly undermine the fine stage's ability to resolve it.

    Authors: We agree that explicit quantitative validation of the symmetry and bounded drift in the global-local frame swapping is important to support the separation premise. The mechanism alternates information exchange between sparse global keyframes and local temporal windows in a balanced, iterative manner during sampling, which is intended to keep inconsistencies minimal and correctable. In the revised manuscript we will add a dedicated analysis subsection in §3 that reports temporal alignment metrics (optical-flow consistency and keyframe drift) and drift measurements over extended sequences, confirming that residual inconsistencies remain within bounds that the high-resolution stage can resolve. revision: yes

  2. Referee: [§4] §4 (Experiments): The abstract asserts outperformance in wide spatial extrapolation and long sequences, yet the provided description contains no quantitative results, ablation studies on the frame-swapping component, or error analysis. Tables reporting metrics across varying expansion ratios and sequence lengths are needed to substantiate that the GCG stage does not introduce uncorrectable inconsistencies.

    Authors: We acknowledge that the experimental section must more clearly demonstrate the benefits of the GCG stage across the claimed regimes. The current manuscript contains quantitative comparisons, but we will expand §4 with additional tables that report PSNR, SSIM, and temporal consistency scores for multiple spatial expansion ratios (2×, 4×, 8×) and sequence lengths (up to several hundred frames). We will also include targeted ablations isolating the frame-swapping component together with error analysis showing how GCG-guided synthesis reduces drift relative to baselines. revision: yes

Circularity Check

0 steps flagged

No circularity; proposed two-stage mechanism is self-contained architectural design

full rationale

The paper describes a coarse-to-fine pipeline that first builds Global Coarse Guidance via a novel global-local frame swapping mechanism to capture long-term structure and short-term dynamics, then uses it to guide high-resolution outpainting. No equations, fitted parameters, or self-citations appear in the provided text that would reduce any claimed result to its own inputs by construction. The separation of global modeling from fine synthesis is presented as an explicit design choice rather than a derived equivalence or renamed empirical pattern, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the untested assumption that the frame-swapping procedure produces a guidance signal that is both globally consistent and locally dynamic; no explicit free parameters, axioms, or invented entities are named in the abstract.

pith-pipeline@v0.9.0 · 5795 in / 1070 out tokens · 28588 ms · 2026-05-20T13:28:10.294436+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

87 extracted references · 87 canonical work pages · 10 internal anchors

  1. [1]

    SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations

    Sdedit: Guided image synthesis and editing with stochastic differential equations , author=. arXiv preprint arXiv:2108.01073 , year=

  2. [2]

    2023 , howpublished =

    LAION-AI , title =. 2023 , howpublished =

  3. [3]

    International conference on machine learning , pages=

    Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

  4. [4]

    DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

    Dino: Detr with improved denoising anchor boxes for end-to-end object detection , author=. arXiv preprint arXiv:2203.03605 , year=

  5. [5]

    arXiv preprint arXiv:2411.16375 , year=

    Ca2-vdm: Efficient autoregressive video diffusion model with causal generation and cache sharing , author=. arXiv preprint arXiv:2411.16375 , year=

  6. [6]

    Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

    From slow bidirectional to fast autoregressive video diffusion models , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

  7. [7]

    Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

    Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion , author=. arXiv preprint arXiv:2506.08009 , year=

  8. [8]

    arXiv preprint arXiv:2505.07344 , year=

    Generative pre-trained autoregressive diffusion transformer , author=. arXiv preprint arXiv:2505.07344 , year=

  9. [9]

    Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

    Progressive autoregressive video diffusion models , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

  10. [10]

    Advances in Neural Information Processing Systems , volume=

    Nuwa-infinity: Autoregressive over autoregressive generation for infinite visual synthesis , author=. Advances in Neural Information Processing Systems , volume=

  11. [11]

    International Conference on Machine Learning , pages=

    Video pixel networks , author=. International Conference on Machine Learning , pages=. 2017 , organization=

  12. [12]

    VideoGPT: Video Generation using VQ-VAE and Transformers

    Videogpt: Video generation using vq-vae and transformers , author=. arXiv preprint arXiv:2104.10157 , year=

  13. [13]

    Phenaki: Variable Length Video Generation From Open Domain Textual Description

    Phenaki: Variable length video generation from open domain textual description , author=. arXiv preprint arXiv:2210.02399 , year=

  14. [14]

    IEEE Journal of Selected Topics in Signal Processing , volume=

    Ultrawide foveated video extrapolation , author=. IEEE Journal of Selected Topics in Signal Processing , volume=. 2010 , publisher=

  15. [15]

    2011 IEEE International Conference on Computational Photography (ICCP) , pages=

    Multiscale ultrawide foveated video extrapolation , author=. 2011 IEEE International Conference on Computational Photography (ICCP) , pages=. 2011 , organization=

  16. [16]

    International Journal of Computer Vision , volume=

    Exploiting diffusion prior for real-world image super-resolution , author=. International Journal of Computer Vision , volume=. 2024 , publisher=

  17. [17]

    Advances in neural information processing systems , volume=

    Conditional image generation with pixelcnn decoders , author=. Advances in neural information processing systems , volume=

  18. [18]

    Complete and temporally consistent video outpainting , year=

    Dehan, Loïc and Van Ranst, Wiebe and Vandewalle, Patrick and Goedemé, Toon , booktitle=. Complete and temporally consistent video outpainting , year=

  19. [19]

    Proceedings of the 31st ACM International Conference on Multimedia , pages=

    Hierarchical Masked 3D Diffusion Model for Video Outpainting , author=. Proceedings of the 31st ACM International Conference on Multimedia , pages=

  20. [20]

    2024 , isbn =

    Wang, Fu-Yun and Wu, Xiaoshi and Huang, Zhaoyang and Shi, Xiaoyu and Shen, Dazhong and Song, Guanglu and Liu, Yu and Li, Hongsheng , title =. 2024 , isbn =. doi:10.1007/978-3-031-72784-9_9 , booktitle =

  21. [21]

    Proceedings of the AAAI Conference on Artificial Intelligence , author=

    Infinite-Canvas: Higher-Resolution Video Outpainting with Extensive Content Generation , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2025 , month=. doi:10.1609/aaai.v39i2.32213 , number=

  22. [22]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages =

    VACE: All-in-One Video Creation and Editing , author =. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages =

  23. [23]

    and Djelouah, Abdelaziz , booktitle=

    Yu, Zhongrui and Megaro-Boldini, Martina and Sumner, Robert W. and Djelouah, Abdelaziz , booktitle=. Unboxed: Geometrically and Temporally Consistent Video Outpainting , year=

  24. [24]

    Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , pages=

    Progressive Artwork Outpainting via Latent Diffusion Models , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , pages=

  25. [25]

    2025 , eprint=

    OutDreamer: Video Outpainting with a Diffusion Transformer , author=. 2025 , eprint=

  26. [26]

    Unified Long Video Inpainting and Outpainting via Overlapping High-Order Co-Denoising , doi =

    Lyu, Shuangquan and Mao, Steven and Ma, Yue , year =. Unified Long Video Inpainting and Outpainting via Overlapping High-Order Co-Denoising , doi =

  27. [27]

    2020 , eprint=

    Denoising Diffusion Probabilistic Models , author=. 2020 , eprint=

  28. [28]

    2022 , eprint=

    Denoising Diffusion Implicit Models , author=. 2022 , eprint=

  29. [29]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Flow straight and fast: Learning to generate and transfer data with rectified flow , author=. arXiv preprint arXiv:2209.03003 , year=

  30. [30]

    Proceedings of the 38th International Conference on Machine Learning , pages =

    Improved Denoising Diffusion Probabilistic Models , author =. Proceedings of the 38th International Conference on Machine Learning , pages =. 2021 , editor =

  31. [31]

    2021 , eprint=

    Score-Based Generative Modeling through Stochastic Differential Equations , author=. 2021 , eprint=

  32. [32]

    2023 , eprint=

    Scalable Diffusion Models with Transformers , author=. 2023 , eprint=

  33. [33]

    2021 , eprint=

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author=. 2021 , eprint=

  34. [34]

    2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

    InOut: Diverse Image Outpainting via GAN Inversion , author=. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

  35. [35]

    A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions

    Yu, Hang and Li, Ruilin and Xie, Shaorong and Qiu, Jiayan , booktitle =. 2024 , volume =. doi:10.1109/CVPR52733.2024.00750 , url =

  36. [36]

    The Twelfth International Conference on Learning Representations , year=

    Continuous-Multiple Image Outpainting in One-Step via Positional Query and A Diffusion-based Approach , author=. The Twelfth International Conference on Learning Representations , year=

  37. [37]

    The 2017 DAVIS Challenge on Video Object Segmentation

    Jordi Pont. The 2017. CoRR , volume =. 2017 , url =. 1704.00675 , timestamp =

  38. [38]

    2018 , isbn =

    Xu, Ning and Yang, Linjie and Fan, Yuchen and Yang, Jianchao and Yue, Dingcheng and Liang, Yuchen and Price, Brian and Cohen, Scott and Huang, Thomas , title =. 2018 , isbn =. doi:10.1007/978-3-030-01228-1_36 , booktitle =

  39. [39]

    2026 , note =

    Pexels , author =. 2026 , note =

  40. [40]

    ArXiv , year=

    Towards Accurate Generative Models of Video: A New Metric & Challenges , author=. ArXiv , year=

  41. [41]

    and Sheikh, H.R

    Zhou Wang and Bovik, A.C. and Sheikh, H.R. and Simoncelli, E.P. , journal=. Image quality assessment: from error visibility to structural similarity , year=

  42. [42]

    CVPR , year=

    The Unreasonable Effectiveness of Deep Features as a Perceptual Metric , author=. CVPR , year=

  43. [43]

    Huang, Ziqi and He, Yinan and Yu, Jiashuo and Zhang, Fan and Si, Chenyang and Jiang, Yuming and Zhang, Yuanhan and Wu, Tianxing and Jin, Qingyang and Chanpaisit, Nattapol and Wang, Yaohui and Chen, Xinyuan and Wang, Limin and Lin, Dahua and Qiao, Yu and Liu, Ziwei , booktitle=

  44. [44]

    Edward J Hu and Yelong Shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen , booktitle=. Lo. 2022 , url=

  45. [45]

    Video extrapolation using neighboring frames

    Sangwoo Lee and Jungjin Lee and Bumki Kim and Kyehyun Kim and Junyong Noh. Video extrapolation using neighboring frames. ACM Transactions on Graphics. 2019. doi:10.1145/3196492

  46. [46]

    International Journal of Computer Trends and Technology , year =

    Bhadoriya, Shailendra and Aggarwal, Nainish and Jain, Udit and Jaiswal, Hrithik , title =. International Journal of Computer Trends and Technology , year =

  47. [47]

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , month =

    MaskGIT: Masked Generative Image Transformer , author=. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , month =

  48. [48]

    2021 , volume =

    Lin, Han and Pagnucco, Maurice and Song, Yang , booktitle =. 2021 , volume =. doi:10.1109/CVPRW53098.2021.00090 , url =

  49. [49]

    2022 , eprint=

    Hierarchical Text-Conditional Image Generation with CLIP Latents , author=. 2022 , eprint=

  50. [50]

    ACM SIGGRAPH 2022 Conference Proceedings , year =

    Palette: Image-to-Image Diffusion Models , author =. ACM SIGGRAPH 2022 Conference Proceedings , year =

  51. [51]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Wan: Open and Advanced Large-Scale Video Generative Models , author=. arXiv preprint arXiv:2503.20314 , year=

  52. [52]

    SIGGRAPH Asia 2025 Conference Papers (SA Conference Papers '25) , year =

    Geonung Kim and Janghyeok Han and Sunghyun Cho , title =. SIGGRAPH Asia 2025 Conference Papers (SA Conference Papers '25) , year =. doi:10.1145/3757377.3763871 , isbn =

  53. [53]

    2025 , isbn =

    Han, Janghyeok and Sim, Gyujin and Kim, Geonung and Lee, Hyun-Seung and Choi, Kyuha and Han, Youngseok and Cho, Sunghyun , title =. 2025 , isbn =. doi:10.1145/3721238.3730719 , booktitle =

  54. [54]

    2025 , isbn =

    Ryu, Nuri and Won, Jiyun and Son, Jooeun and Gong, Minsu and Lee, Joo-Haeng and Cho, Sunghyun , title =. 2025 , isbn =. doi:10.1145/3721238.3730701 , booktitle =

  55. [55]

    arXiv preprint arXiv:2302.08113 , year=

    MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation , author=. arXiv preprint arXiv:2302.08113 , year=

  56. [56]

    Thirty-seventh Conference on Neural Information Processing Systems , year=

    SyncDiffusion: Coherent Montage via Synchronized Joint Diffusions , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

  57. [57]

    CVPR , year=

    DemoFusion: Democratising High-Resolution Image Generation With No \ \ \ , author=. CVPR , year=

  58. [58]

    Zhou, Shangchen and Yang, Peiqing and Wang, Jianyi and Luo, Yihang and Loy, Chen Change , booktitle=

  59. [59]

    European Conference on Computer Vision (ECCV) , year=

    Motion-Guided Latent Diffusion for Temporally Consistent Real-world Video Super-resolution , author=. European Conference on Computer Vision (ECCV) , year=

  60. [60]

    Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR) , month =

    Nguyen, Trong-Tung and Nguyen, Quang and Nguyen, Khoi and Tran, Anh and Pham, Cuong , title =. Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR) , month =. 2025 , pages =

  61. [61]

    Proceedings of the Computer Vision and Pattern Recognition Conference , year=

    Videodirector: Precise video editing via text-to-video models , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , year=

  62. [62]

    arXiv preprint arXiv:2403.14773 , year=

    StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text , author=. arXiv preprint arXiv:2403.14773 , year=

  63. [63]

    2025 , eprint=

    FramePainter: Endowing Interactive Image Editing with Video Diffusion Priors , author=. 2025 , eprint=

  64. [64]

    Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =

    Li, Quanhao and Xing, Zhen and Wang, Rui and Zhang, Hui and Dai, Qi and Wu, Zuxuan , title =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =. 2025 , pages =

  65. [65]

    2025 , isbn =

    Li, Na and Li, Zihao and Tang, Zuoli and Yu, Yuqing and Zou, Lixin and Li, Chenliang , title =. 2025 , isbn =. doi:10.1145/3746027.3755278 , booktitle =

  66. [66]

    2024 , eprint=

    VIP: Versatile Image Outpainting Empowered by Multimodal Large Language Model , author=. 2024 , eprint=

  67. [67]

    and Pouget-Abadie, Jean and Mirza, Mehdi and Xu, Bing and Warde-Farley, David and Ozair, Sherjil and Courville, Aaron and Bengio, Yoshua , title =

    Goodfellow, Ian J. and Pouget-Abadie, Jean and Mirza, Mehdi and Xu, Bing and Warde-Farley, David and Ozair, Sherjil and Courville, Aaron and Bengio, Yoshua , title =. Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2 , pages =. 2014 , publisher =

  68. [68]

    Proceedings of the 37th International Conference on Machine Learning , articleno =

    Chen, Mark and Radford, Alec and Child, Rewon and Wu, Jeff and Jun, Heewoo and Luan, David and Sutskever, Ilya , title =. Proceedings of the 37th International Conference on Machine Learning , articleno =. 2020 , publisher =

  69. [69]

    2000 , isbn =

    Bertalmio, Marcelo and Sapiro, Guillermo and Caselles, Vincent and Ballester, Coloma , title =. 2000 , isbn =. doi:10.1145/344779.344972 , booktitle =

  70. [70]

    Pathak, Deepak and Kr\"ahenb\"uhl, Philipp and Donahue, Jeff and Darrell, Trevor and Efros, Alexei , Title =

  71. [71]

    OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation

    OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation , author=. arXiv preprint arXiv:2407.02371 , year=

  72. [72]

    CVPR Workshops , month =

    Nah, Seungjun and Baik, Sungyong and Hong, Seokil and Moon, Gyeongsik and Son, Sanghyun and Timofte, Radu and Lee, Kyoung Mu , title =. CVPR Workshops , month =

  73. [73]

    2025 , eprint=

    UltraGen: High-Resolution Video Generation with Hierarchical Attention , author=. 2025 , eprint=

  74. [74]

    2025 , eprint=

    SkyReels-V2: Infinite-length Film Generative Model , author=. 2025 , eprint=

  75. [75]

    2025 , eprint=

    Generative Pre-trained Autoregressive Diffusion Transformer , author=. 2025 , eprint=

  76. [76]

    Reda and Kevin J

    Guilin Liu and Fitsum A. Reda and Kevin J. Shih and Ting-Chun Wang and Andrew Tao and Bryan Catanzaro , title =. The European Conference on Computer Vision (ECCV) , year =

  77. [77]

    Generative Image Inpainting with Contextual Attention

    Generative Image Inpainting with Contextual Attention , author=. arXiv preprint arXiv:1801.07892 , year=

  78. [78]

    2019 , eprint=

    Free-Form Image Inpainting with Gated Convolution , author=. 2019 , eprint=

  79. [79]

    RePaint: Inpainting using Denoising Diffusion Probabilistic Models , year=

    Lugmayr, Andreas and Danelljan, Martin and Romero, Andres and Yu, Fisher and Timofte, Radu and Van Gool, Luc , booktitle=. RePaint: Inpainting using Denoising Diffusion Probabilistic Models , year=

  80. [80]

    2022 , eprint=

    High-Resolution Image Synthesis with Latent Diffusion Models , author=. 2022 , eprint=

Showing first 80 references.