PhotoQuilt: Training-Free Arbitrary-Resolution Photomosaics via Bootstrapped Tiled Denoising

Andrew Fleet; Babak Taati; Javad Rajabi; Koorosh Roohi

arxiv: 2606.30968 · v1 · pith:UDPWWAUGnew · submitted 2026-06-29 · 💻 cs.CV

PhotoQuilt: Training-Free Arbitrary-Resolution Photomosaics via Bootstrapped Tiled Denoising

Koorosh Roohi , Javad Rajabi , Andrew Fleet , Babak Taati This is my paper

Pith reviewed 2026-07-01 01:23 UTC · model grok-4.3

classification 💻 cs.CV

keywords photomosaicsdiffusion modelstiled denoisingtraining-freearbitrary resolutionlatent spaceimage generationglobal structure

0 comments

The pith

PhotoQuilt generates arbitrary-resolution photomosaics without training by bootstrapping a low-resolution layout into tiled denoising.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Photomosaics need both local tiles that stand alone as convincing images and a global layout that holds them into one scene. Diffusion models lose the global structure when they tile at high resolution and become too smooth or expensive when they try to generate the whole canvas at once. The method first creates a low-resolution global composition to fix the overall arrangement, then upscales that composition in latent space and re-injects noise before denoising proceeds independently inside fixed tiles. Each tile can therefore develop its own detail while the shared structure keeps the mosaic coherent. Because the tiles are handled separately, the process scales to any canvas size without quadratic attention costs or extra training.

Core claim

PhotoQuilt resolves this with a bootstrapped tiled denoising procedure. We first produce a global composition at low resolution to fix the layout, then upscale it in latent space and re-inject noise to restore generative capacity. Denoising proceeds within fixed tiles, so each forms its own image while the shared global structure holds them in one layout. Because tile generation is handled separately, PhotoQuilt scales to large canvases without quadratic attention cost.

What carries the argument

Bootstrapped tiled denoising procedure that creates a low-resolution global layout, upscales it in latent space, re-injects noise, and then performs independent per-tile denoising.

If this is right

The method produces photomosaics at any resolution while preserving both global structure and local realism.
No training or fine-tuning steps are required beyond a standard diffusion model.
Generation cost remains linear in the number of tiles rather than quadratic in canvas size.
The same low-resolution layout can be reused to generate multiple high-resolution variants of the same mosaic.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same bootstrapping pattern could be tested on other diffusion tasks that require both coarse structure and fine local variation, such as large scene or texture synthesis.
If the noise re-injection step proves robust, it may reduce the need for specialized high-resolution training runs in other structured image generation settings.
The separation of global layout from tile denoising opens a route to parallel or distributed generation of very large canvases.

Load-bearing premise

That upscaling the low-resolution global composition in latent space followed by noise re-injection will allow independent per-tile denoising to preserve the global layout without any additional training or fine-tuning steps.

What would settle it

Run the low-resolution global stage, upscale and re-inject noise, then complete per-tile denoising and check whether the final high-resolution tiles still match the original global layout within a small tolerance.

Figures

Figures reproduced from arXiv: 2606.30968 by Andrew Fleet, Babak Taati, Javad Rajabi, Koorosh Roohi.

**Figure 1.** Figure 1: High-resolution photomosaics generated across different target images and tile settings. Best viewed zoomed in. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: The PhotoQuilt method pipeline. Pixel space represen [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Photomosaic (12k x 6k) generated from a base image (4k [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison of photomosaic generation. Compared to baselines, PhotoQuilt better preserves the global target structure [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: The default configuration (s = 0.6, 768 × 768) provides the optimal balance between global layout fidelity and local tile realism compared to the ablated variants [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

Photomosaics are large images whose local regions are seen as independent tiles while their overall arrangement forms a coherent scene. Generating them at high resolution, with every tile convincing in its own right, is computationally expensive, since the canvas must hold many detailed tiles at once. We present PhotoQuilt, a training-free framework that generates photomosaics at arbitrary resolution. Diffusion models struggle to satisfy both scales at once, as direct high-resolution generation is costly and tends toward one smooth image rather than a mosaic, while patch-based tiling keeps local detail but loses global structure. PhotoQuilt resolves this with a bootstrapped tiled denoising procedure. We first produce a global composition at low resolution to fix the layout, then upscale it in latent space and re-inject noise to restore generative capacity. Denoising proceeds within fixed tiles, so each forms its own image while the shared global structure holds them in one layout. Because tile generation is handled separately, PhotoQuilt scales to large canvases without quadratic attention cost. Experiments show that PhotoQuilt outperforms current baselines on both global structure and local realism.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PhotoQuilt's low-res global layout plus latent upscale and noise re-injection is a clean engineering split, but the independent tile denoising step looks likely to loosen the global constraints once noise is added back in.

read the letter

The core move here is to generate a low-resolution global composition first, upscale that latent, re-inject noise to keep generative capacity, then run denoising separately inside fixed non-overlapping tiles. That separation lets the method scale without quadratic attention cost and avoids the usual diffusion tendency to smooth everything into one image.

What stands out as new is the explicit bootstrapped workflow that treats the upscaled latent as a fixed prior while allowing each tile to form its own coherent image. The paper positions this as training-free and resolution-agnostic, which is a practical distinction from methods that either train on high-res data or rely on overlapping patches with extra consistency losses.

The approach does handle the scale conflict in a straightforward way. Low-res layout fixing followed by per-tile generation is a reasonable division of labor, and the claim that it produces both better global structure and local realism than baselines is at least plausible on the surface.

The soft spot is exactly the one the stress-test flags. After noise is re-injected, nothing in the described procedure (no overlap, no cross-tile attention, no iterative consistency step) forces the independent denoising trajectories to stay anchored to the upscaled prior. If the latent signal is only a weak guide, stochastic variation inside tiles can shift seams, object placement, or overall layout while still looking locally realistic. The abstract asserts outperformance on both axes but supplies no metrics, baselines, or ablation details, so it is impossible to tell whether the experiments actually close this gap.

This is the kind of paper that belongs in a computer vision reading group focused on practical diffusion scaling tricks. Readers working on tiled or high-resolution synthesis would get value from the workflow even if the global-preservation claim needs more evidence. It deserves a serious referee to check whether the experiments demonstrate that the upscaled latent actually dominates after noise re-injection.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes PhotoQuilt, a training-free framework for arbitrary-resolution photomosaics with diffusion models. It first generates a low-resolution global composition to fix layout, upsamples the latent, re-injects noise to restore capacity, then runs independent denoising inside fixed non-overlapping tiles so each tile forms its own image while the shared global structure is claimed to hold the layout. The method is positioned as avoiding quadratic attention costs of full-canvas generation and as outperforming baselines on global structure and local realism.

Significance. If the upscaled latent successfully constrains independent tile trajectories after noise re-injection, the approach would provide a practical route to scalable, training-free generation of large structured images. The training-free and arbitrary-resolution aspects are genuine strengths that address real computational bottlenecks in diffusion models for mosaic-style outputs.

major comments (2)

[Bootstrapped tiled denoising procedure] The bootstrapped tiled denoising procedure (described in the abstract) states that after latent upscaling and noise re-injection, 'denoising proceeds within fixed tiles' with no overlap, cross-tile attention, or consistency loss mentioned. This leaves no explicit mechanism by which the global latent can dominate per-tile stochastic trajectories, which is load-bearing for the central claim that global structure survives.
[Experiments] The abstract asserts that 'Experiments show that PhotoQuilt outperforms current baselines on both global structure and local realism' yet supplies no metrics, baselines, datasets, ablation results, or quantitative tables. Without these, the empirical support for the two main performance claims cannot be assessed.

minor comments (1)

Notation for the latent upscaling step and the precise noise re-injection schedule would benefit from an equation or pseudocode block for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below with clarifications and indicate planned revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Bootstrapped tiled denoising procedure] The bootstrapped tiled denoising procedure (described in the abstract) states that after latent upscaling and noise re-injection, 'denoising proceeds within fixed tiles' with no overlap, cross-tile attention, or consistency loss mentioned. This leaves no explicit mechanism by which the global latent can dominate per-tile stochastic trajectories, which is load-bearing for the central claim that global structure survives.

Authors: The mechanism is the shared upscaled latent: the low-resolution global composition is first generated to fix layout, then upscaled in latent space to provide a common structured initialization for every tile. Noise is re-injected uniformly to restore generative capacity while preserving the encoded global arrangement in the starting latent of each independent tile. Although denoising runs separately with no cross-tile operations, all trajectories begin from the same layout-conditioned latent, which biases outputs toward the fixed global structure. We will revise the method section to make this initialization process explicit, including a diagram of the latent flow from global upscaling to per-tile starting points. revision: yes
Referee: [Experiments] The abstract asserts that 'Experiments show that PhotoQuilt outperforms current baselines on both global structure and local realism' yet supplies no metrics, baselines, datasets, ablation results, or quantitative tables. Without these, the empirical support for the two main performance claims cannot be assessed.

Authors: The current manuscript focuses on qualitative visual comparisons in the experiments. We agree that quantitative support is needed to substantiate the claims and will expand the section with specific baselines (e.g., naive high-resolution diffusion and overlapping patch tiling), datasets, metrics for global structure (layout consistency via keypoint matching or segmentation alignment) and local realism (perceptual metrics such as LPIPS or user studies), plus ablations on noise re-injection strength and tile size. revision: yes

Circularity Check

0 steps flagged

No circularity: procedural method with independent empirical validation

full rationale

The paper presents a training-free algorithmic procedure (low-res global composition, latent upscaling, noise re-injection, then independent tiled denoising) whose correctness is asserted via experiments on global structure and local realism rather than any derivation chain. No equations, fitted parameters, self-citations, or uniqueness theorems appear in the abstract or described method; the central claim does not reduce to a redefinition or statistical forcing of its inputs. The approach is therefore self-contained and externally falsifiable on standard image metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, axioms, or invented entities are stated in the provided text.

axioms (1)

domain assumption Diffusion models can produce coherent local images when denoising proceeds from partially noised latents that carry global structure.
Implicit foundation for the tiled denoising step described in the abstract.

pith-pipeline@v0.9.1-grok · 5737 in / 1187 out tokens · 31399 ms · 2026-07-01T01:23:48.227848+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

54 extracted references · 2 canonical work pages · 2 internal anchors

[1]

MultiDiffusion: Fusing diffusion paths for controlled image generation

Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel. MultiDiffusion: Fusing diffusion paths for controlled image generation. InProceedings of the 40th International Confer- ence on Machine Learning (ICML), pages 1737–1752, 2023. 3, 4

2023
[2]

A survey of digital mosaic techniques

Sebastiano Battiato, Gianpiero Di Blasi, Giovanni Maria Farinella, and Giovanni Gallo. A survey of digital mosaic techniques. InEurographics Italian Chapter Conference,
[3]

FLUX: Text-to-image generation model

Black Forest Labs. FLUX: Text-to-image generation model. https : / / github . com / black - forest - labs / flux, 2024. Released August 2024. 3, 5, 12

2024
[4]

FLUX.1 tools: Redux, fill, depth, canny

Black Forest Labs. FLUX.1 tools: Redux, fill, depth, canny. https://bfl.ai/flux-1-tools/, 2024. Released November 2024. 3, 12

2024
[5]

FLUX.2: Frontier visual intelligence

Black Forest Labs. FLUX.2: Frontier visual intelligence. https : / / github . com / black - forest - labs / flux2, 2025. Released November 2025. 3, 5, 12

2025
[6]

FLUX.1 kontext: Flow matching for in-context image generation and editing in latent space,

Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dock- horn, Jack English, Zion English, Patrick Esser, Sumith Ku- lal, Kyle Lacey, Yam Levi, Cheng Li, Dominik Lorenz, Jonas M¨uller, Dustin Podell, Robin Rombach, Harry Saini, Axel Sauer, and Luke Smith. FLUX.1 kontext: Flow matching for in-context i...
[7]

In- structpix2pix: Learning to follow image editing instructions

Tim Brooks, Aleksander Holynski, and Alexei A Efros. In- structpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 18392–18402, 2023. 2

2023
[8]

PixArt-α: Fast training of diffusion transformer for photorealistic text-to-image syn- thesis

Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. PixArt-α: Fast training of diffusion transformer for photorealistic text-to-image syn- thesis. InInternational Conference on Learning Represen- tations (ICLR), 2024. 3

2024
[9]

Gen- erative photomosaic with structure-aligned and personalized diffusion, 2026

Jaeyoung Chung, Hyunjin Son, and Kyoung Mu Lee. Gen- erative photomosaic with structure-aligned and personalized diffusion, 2026. 2, 5, 12

2026
[10]

Evolution of animated photomosaics

Vic Ciesielski, Marsha Berry, Karen Trist, and Daryl D’Souza. Evolution of animated photomosaics. InWork- shops on Applications of Evolutionary Computation, pages 498–507. Springer, 2007. 2

2007
[11]

Diffusion-based image mo- saics

Lars Doyle and David Mould. Diffusion-based image mo- saics. InProceedings of Graphics Interface, New York, NY , USA, 2026. Association for Computing Machinery. 2, 3

2026
[12]

DemoFusion: Democratising high- resolution image generation with no $$$

Ruoyi Du, Dongliang Chang, Timothy Hospedales, Yi-Zhe Song, and Zhanyu Ma. DemoFusion: Democratising high- resolution image generation with no $$$. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6159–6168, 2024. 3, 4

2024
[13]

Scaling rec- tified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, and Robin Rombach. Scaling rec- tified flow transformers for high-resolution image synthesis. InProceedings of the 41st International Conference on Ma- chine Learning...

2024
[14]

Image mosaics

Adam Finkelstein and Marisa Range. Image mosaics. In Electronic Publishing, Artistic Imaging, and Digital Typog- raphy (RIDT), pages 11–22. Springer, 1998. 2, 5, 12

1998
[15]

Visual ana- grams: Generating multi-view optical illusions with diffu- sion models

Daniel Geng, Inbum Park, and Andrew Owens. Visual ana- grams: Generating multi-view optical illusions with diffu- sion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 2

2024
[16]

Factorized diffusion: Perceptual illusions by noise decomposition

Daniel Geng, Inbum Park, and Andrew Owens. Factorized diffusion: Perceptual illusions by noise decomposition. In European Conference on Computer Vision (ECCV), 2024. 2

2024
[17]

Lens: Rethinking training efficiency for foundational text-to-image models, 2026

Baining Guo, Chong Luo, Dong Chen, Dongdong Chen, Fangyun Wei, Ji Li, Jianmin Bao, Jiawei Zhang, Jinjing Zhao, Lei Shi, Qinhong Yang, Sirui Zhang, Xiuyu Wu, Xuelu Feng, Yan Lu, Yanchen Dong, Yang Yue, Yitong Wang, Yunuo Chen, Zhiyang Liang, and Ziyu Wan. Lens: Rethinking training efficiency for foundational text-to-image models, 2026. 3

2026
[18]

Compos- ing photomosaic images using clustering based evolutionary programming.Multimedia Tools and Applications, 78(18): 25919–25936, 2019

Yaodong He, Jianfeng Zhou, and Shiu Yin Yuen. Compos- ing photomosaic images using clustering based evolutionary programming.Multimedia Tools and Applications, 78(18): 25919–25936, 2019. 2

2019
[19]

ScaleCrafter: Tuning-free higher- resolution visual generation with diffusion models

Yingqing He, Shaoshu Yang, Haoxin Chen, Xiaodong Cun, Menghan Xia, Yong Zhang, Xintao Wang, Ran He, Qifeng Chen, and Ying Shan. ScaleCrafter: Tuning-free higher- resolution visual generation with diffusion models. InIn- ternational Conference on Learning Representations (ICLR),
[20]

Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 3

2020
[21]

FouriScale: A frequency perspective on training-free high-resolution im- age synthesis

Linjiang Huang, Rongyao Fang, Aiping Zhang, Guanglu Song, Si Liu, Yu Liu, and Hongsheng Li. FouriScale: A frequency perspective on training-free high-resolution im- age synthesis. InEuropean Conference on Computer Vision (ECCV), 2024. 3

2024
[22]

Arbitrary style transfer in real-time with adaptive instance normalization

Xun Huang and Serge Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision (ICCV), pages 1501–1510, 2017. 5, 12

2017
[23]

Ideogram 4.0: An open-weight text-to-image foundation model.https : / / huggingface

Ideogram. Ideogram 4.0: An open-weight text-to-image foundation model.https : / / huggingface . co / ideogram-ai, 2026. Open-weight release, June 2026. 3

2026
[24]

DiffuseHigh: Training-free progressive high- resolution image synthesis through structure guidance

Younghyun Kim, Geunmin Hwang, Junyu Zhang, and Eun- byung Park. DiffuseHigh: Training-free progressive high- resolution image synthesis through structure guidance. In Proceedings of the AAAI Conference on Artificial Intelli- gence, pages 4338–4346, 2025. 3

2025
[25]

StreamDiffu- sion: A pipeline-level solution for real-time interactive gen- eration, 2023

Akio Kodaira, Chenfeng Xu, Toshiki Hazama, Takanori Yoshimoto, Kohei Ohno, Shogo Mitsuhori, Soichi Sugano, Hanying Cho, Zhijian Liu, and Kurt Keutzer. StreamDiffu- sion: A pipeline-level solution for real-time interactive gen- eration, 2023. 5, 12

2023
[26]

ScaleDiff: Higher-resolution image syn- 9 thesis via efficient and model-agnostic diffusion

Sungho Koh, SeungJu Cha, Hyunwoo Oh, Kwanyoung Lee, and Dong-Jin Kim. ScaleDiff: Higher-resolution image syn- 9 thesis via efficient and model-agnostic diffusion. InAd- vances in Neural Information Processing Systems (NeurIPS),
[27]

Diffusion- based image-to-image translation by noise correction via prompt interpolation

Junsung Lee, Minsoo Kang, and Bohyung Han. Diffusion- based image-to-image translation by noise correction via prompt interpolation. InEuropean Conference on Computer Vision, pages 289–304. Springer, 2024. 5

2024
[28]

BLIP: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. BLIP: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation. InIn- ternational Conference on Machine Learning (ICML), pages 12888–12900, 2022. 5

2022
[29]

AccDiffusion: An accurate method for higher-resolution im- age generation

Zhihang Lin, Mingbao Lin, Meng Zhao, and Rongrong Ji. AccDiffusion: An accurate method for higher-resolution im- age generation. InEuropean Conference on Computer Vision (ECCV), 2024. 3, 4

2024
[30]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling. InThe Eleventh International Conference on Learning Representations (ICLR), 2023. 3

2023
[31]

Flow straight and fast: Learning to generate and transfer data with rectified flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InThe Eleventh International Conference on Learning Representations (ICLR), 2023. 3

2023
[32]

SDEdit: Guided image synthesis and editing with stochastic differential equa- tions

Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jia- jun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Guided image synthesis and editing with stochastic differential equa- tions. InInternational Conference on Learning Representa- tions (ICLR), 2022. 3, 4

2022
[33]

T2I- Adapter: Learning adapters to dig out more controllable abil- ity for text-to-image diffusion models, 2023

Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, Ying Shan, and Xiaohu Qie. T2I- Adapter: Learning adapters to dig out more controllable abil- ity for text-to-image diffusion models, 2023. 3, 5

2023
[34]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), 2023. 3

2023
[35]

Sdxl: Improving latent diffusion models for high-resolution image synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. InInternational Confer- ence on Learning Representations, pages 1862–1874, 2024. 2, 3

2024
[36]

FreeScale: Unleashing the resolution of diffusion models via tuning-free scale fusion

Haonan Qiu, Shiwei Zhang, Yujie Wei, Ruihang Chu, Hangjie Yuan, Xiang Wang, Yingya Zhang, and Ziwei Liu. FreeScale: Unleashing the resolution of diffusion models via tuning-free scale fusion. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025. 3

2025
[37]

Learning transferable vi- sual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable vi- sual models from natural language supervision. InInter- national Conference on Machine Learning (ICML), pages 8748–8763, 2021. 5

2021
[38]

SEGA: Spectral-Energy Guided Attention for Resolution Extrapolation in Diffusion Transformers

Javad Rajabi, Kimia Shaban, Koorosh Roohi, David B Lin- dell, and Babak Taati. Sega: Spectral-energy guided atten- tion for resolution extrapolation in diffusion transformers. arXiv preprint arXiv:2605.22668, 2026. 3

work page internal anchor Pith review Pith/arXiv arXiv 2026
[39]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022. 3, 5, 12

2022
[40]

Color alignment in diffusion

Ka Chun Shum, Binh-Son Hua, Duc Thanh Nguyen, and Sai- Kit Yeung. Color alignment in diffusion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 28446–28455, 2025. 3

2025
[41]

Henry Holt and Co., 1997

Robert Silvers and Michael Hawley.Photomosaics. Henry Holt and Co., 1997. 2, 5, 12

1997
[42]

Score-Based Generative Modeling through Stochastic Differential Equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equa- tions.arXiv preprint arXiv:2011.13456, 2020. 3

work page internal anchor Pith review Pith/arXiv arXiv 2011
[43]

Is one GPU enough? pushing image generation at higher-resolutions with founda- tion models

Athanasios Tragakis, Marco Aversa, Chaitanya Kaul, Roder- ick Murray-Smith, and Daniele Faccio. Is one GPU enough? pushing image generation at higher-resolutions with founda- tion models. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. 3

2024
[44]

Chan, and Chen Change Loy

Jianyi Wang, Kelvin C.K. Chan, and Chen Change Loy. Ex- ploring CLIP for assessing the look and feel of images. In Proceedings of the AAAI Conference on Artificial Intelli- gence, pages 2555–2563, 2023. 5

2023
[45]

Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si- moncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004. 5

2004
[46]

Qwen-image technical report, 2025

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report, 2025. 3

2025
[47]

Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis

Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis. InInternational Conference on Learning Representations (ICLR), 2024. 5

2024
[48]

SANA: Efficient high-resolution image synthesis with linear diffusion transformers

Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, and Song Han. SANA: Efficient high-resolution image synthesis with linear diffusion transformers. InIn- ternational Conference on Learning Representations (ICLR),
[49]

ImageRe- ward: Learning and evaluating human preferences for text- to-image generation

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qing- hao Li, Ming Ding, Jie Tang, and Yuxiao Dong. ImageRe- ward: Learning and evaluating human preferences for text- to-image generation. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. 5

2023
[50]

IP- Adapter: Text compatible image prompt adapter for text-to- image diffusion models, 2023

Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. IP- Adapter: Text compatible image prompt adapter for text-to- image diffusion models, 2023. 3

2023
[51]

Diffusion-4k: Ultra-high-resolution image synthesis with latent diffusion models

Jinjin Zhang, Qiuyu Huang, Junjie Liu, Xiefan Guo, and Di Huang. Diffusion-4k: Ultra-high-resolution image synthesis with latent diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 23464– 23473, 2025. 5 10

2025
[52]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023. 2, 3

2023
[53]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 586–595, 2018. 5

2018
[54]

Interior of Notre-Dame cathedral, soaring gothic rib vaults, shafts of colored light, stone columns, architectural photograph

Shen Zhang, Zhaowei Chen, Zhenyu Zhao, Yuhao Chen, Yao Tang, and Jiajun Liang. HiDiffusion: Unlocking higher-resolution creativity and efficiency in pretrained dif- fusion models. InEuropean Conference on Computer Vision (ECCV), pages 145–161, 2024. 3 11 Supplementary Material A. Implementation Details All experiments were executed on NVIDIA H100 GPUs. Wh...

2024

[1] [1]

MultiDiffusion: Fusing diffusion paths for controlled image generation

Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel. MultiDiffusion: Fusing diffusion paths for controlled image generation. InProceedings of the 40th International Confer- ence on Machine Learning (ICML), pages 1737–1752, 2023. 3, 4

2023

[2] [2]

A survey of digital mosaic techniques

Sebastiano Battiato, Gianpiero Di Blasi, Giovanni Maria Farinella, and Giovanni Gallo. A survey of digital mosaic techniques. InEurographics Italian Chapter Conference,

[3] [3]

FLUX: Text-to-image generation model

Black Forest Labs. FLUX: Text-to-image generation model. https : / / github . com / black - forest - labs / flux, 2024. Released August 2024. 3, 5, 12

2024

[4] [4]

FLUX.1 tools: Redux, fill, depth, canny

Black Forest Labs. FLUX.1 tools: Redux, fill, depth, canny. https://bfl.ai/flux-1-tools/, 2024. Released November 2024. 3, 12

2024

[5] [5]

FLUX.2: Frontier visual intelligence

Black Forest Labs. FLUX.2: Frontier visual intelligence. https : / / github . com / black - forest - labs / flux2, 2025. Released November 2025. 3, 5, 12

2025

[6] [6]

FLUX.1 kontext: Flow matching for in-context image generation and editing in latent space,

Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dock- horn, Jack English, Zion English, Patrick Esser, Sumith Ku- lal, Kyle Lacey, Yam Levi, Cheng Li, Dominik Lorenz, Jonas M¨uller, Dustin Podell, Robin Rombach, Harry Saini, Axel Sauer, and Luke Smith. FLUX.1 kontext: Flow matching for in-context i...

[7] [7]

In- structpix2pix: Learning to follow image editing instructions

Tim Brooks, Aleksander Holynski, and Alexei A Efros. In- structpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 18392–18402, 2023. 2

2023

[8] [8]

PixArt-α: Fast training of diffusion transformer for photorealistic text-to-image syn- thesis

Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. PixArt-α: Fast training of diffusion transformer for photorealistic text-to-image syn- thesis. InInternational Conference on Learning Represen- tations (ICLR), 2024. 3

2024

[9] [9]

Gen- erative photomosaic with structure-aligned and personalized diffusion, 2026

Jaeyoung Chung, Hyunjin Son, and Kyoung Mu Lee. Gen- erative photomosaic with structure-aligned and personalized diffusion, 2026. 2, 5, 12

2026

[10] [10]

Evolution of animated photomosaics

Vic Ciesielski, Marsha Berry, Karen Trist, and Daryl D’Souza. Evolution of animated photomosaics. InWork- shops on Applications of Evolutionary Computation, pages 498–507. Springer, 2007. 2

2007

[11] [11]

Diffusion-based image mo- saics

Lars Doyle and David Mould. Diffusion-based image mo- saics. InProceedings of Graphics Interface, New York, NY , USA, 2026. Association for Computing Machinery. 2, 3

2026

[12] [12]

DemoFusion: Democratising high- resolution image generation with no $$$

Ruoyi Du, Dongliang Chang, Timothy Hospedales, Yi-Zhe Song, and Zhanyu Ma. DemoFusion: Democratising high- resolution image generation with no $$$. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6159–6168, 2024. 3, 4

2024

[13] [13]

Scaling rec- tified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, and Robin Rombach. Scaling rec- tified flow transformers for high-resolution image synthesis. InProceedings of the 41st International Conference on Ma- chine Learning...

2024

[14] [14]

Image mosaics

Adam Finkelstein and Marisa Range. Image mosaics. In Electronic Publishing, Artistic Imaging, and Digital Typog- raphy (RIDT), pages 11–22. Springer, 1998. 2, 5, 12

1998

[15] [15]

Visual ana- grams: Generating multi-view optical illusions with diffu- sion models

Daniel Geng, Inbum Park, and Andrew Owens. Visual ana- grams: Generating multi-view optical illusions with diffu- sion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 2

2024

[16] [16]

Factorized diffusion: Perceptual illusions by noise decomposition

Daniel Geng, Inbum Park, and Andrew Owens. Factorized diffusion: Perceptual illusions by noise decomposition. In European Conference on Computer Vision (ECCV), 2024. 2

2024

[17] [17]

Lens: Rethinking training efficiency for foundational text-to-image models, 2026

Baining Guo, Chong Luo, Dong Chen, Dongdong Chen, Fangyun Wei, Ji Li, Jianmin Bao, Jiawei Zhang, Jinjing Zhao, Lei Shi, Qinhong Yang, Sirui Zhang, Xiuyu Wu, Xuelu Feng, Yan Lu, Yanchen Dong, Yang Yue, Yitong Wang, Yunuo Chen, Zhiyang Liang, and Ziyu Wan. Lens: Rethinking training efficiency for foundational text-to-image models, 2026. 3

2026

[18] [18]

Compos- ing photomosaic images using clustering based evolutionary programming.Multimedia Tools and Applications, 78(18): 25919–25936, 2019

Yaodong He, Jianfeng Zhou, and Shiu Yin Yuen. Compos- ing photomosaic images using clustering based evolutionary programming.Multimedia Tools and Applications, 78(18): 25919–25936, 2019. 2

2019

[19] [19]

ScaleCrafter: Tuning-free higher- resolution visual generation with diffusion models

Yingqing He, Shaoshu Yang, Haoxin Chen, Xiaodong Cun, Menghan Xia, Yong Zhang, Xintao Wang, Ran He, Qifeng Chen, and Ying Shan. ScaleCrafter: Tuning-free higher- resolution visual generation with diffusion models. InIn- ternational Conference on Learning Representations (ICLR),

[20] [20]

Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 3

2020

[21] [21]

FouriScale: A frequency perspective on training-free high-resolution im- age synthesis

Linjiang Huang, Rongyao Fang, Aiping Zhang, Guanglu Song, Si Liu, Yu Liu, and Hongsheng Li. FouriScale: A frequency perspective on training-free high-resolution im- age synthesis. InEuropean Conference on Computer Vision (ECCV), 2024. 3

2024

[22] [22]

Arbitrary style transfer in real-time with adaptive instance normalization

Xun Huang and Serge Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision (ICCV), pages 1501–1510, 2017. 5, 12

2017

[23] [23]

Ideogram 4.0: An open-weight text-to-image foundation model.https : / / huggingface

Ideogram. Ideogram 4.0: An open-weight text-to-image foundation model.https : / / huggingface . co / ideogram-ai, 2026. Open-weight release, June 2026. 3

2026

[24] [24]

DiffuseHigh: Training-free progressive high- resolution image synthesis through structure guidance

Younghyun Kim, Geunmin Hwang, Junyu Zhang, and Eun- byung Park. DiffuseHigh: Training-free progressive high- resolution image synthesis through structure guidance. In Proceedings of the AAAI Conference on Artificial Intelli- gence, pages 4338–4346, 2025. 3

2025

[25] [25]

StreamDiffu- sion: A pipeline-level solution for real-time interactive gen- eration, 2023

Akio Kodaira, Chenfeng Xu, Toshiki Hazama, Takanori Yoshimoto, Kohei Ohno, Shogo Mitsuhori, Soichi Sugano, Hanying Cho, Zhijian Liu, and Kurt Keutzer. StreamDiffu- sion: A pipeline-level solution for real-time interactive gen- eration, 2023. 5, 12

2023

[26] [26]

ScaleDiff: Higher-resolution image syn- 9 thesis via efficient and model-agnostic diffusion

Sungho Koh, SeungJu Cha, Hyunwoo Oh, Kwanyoung Lee, and Dong-Jin Kim. ScaleDiff: Higher-resolution image syn- 9 thesis via efficient and model-agnostic diffusion. InAd- vances in Neural Information Processing Systems (NeurIPS),

[27] [27]

Diffusion- based image-to-image translation by noise correction via prompt interpolation

Junsung Lee, Minsoo Kang, and Bohyung Han. Diffusion- based image-to-image translation by noise correction via prompt interpolation. InEuropean Conference on Computer Vision, pages 289–304. Springer, 2024. 5

2024

[28] [28]

BLIP: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. BLIP: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation. InIn- ternational Conference on Machine Learning (ICML), pages 12888–12900, 2022. 5

2022

[29] [29]

AccDiffusion: An accurate method for higher-resolution im- age generation

Zhihang Lin, Mingbao Lin, Meng Zhao, and Rongrong Ji. AccDiffusion: An accurate method for higher-resolution im- age generation. InEuropean Conference on Computer Vision (ECCV), 2024. 3, 4

2024

[30] [30]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling. InThe Eleventh International Conference on Learning Representations (ICLR), 2023. 3

2023

[31] [31]

Flow straight and fast: Learning to generate and transfer data with rectified flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InThe Eleventh International Conference on Learning Representations (ICLR), 2023. 3

2023

[32] [32]

SDEdit: Guided image synthesis and editing with stochastic differential equa- tions

Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jia- jun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Guided image synthesis and editing with stochastic differential equa- tions. InInternational Conference on Learning Representa- tions (ICLR), 2022. 3, 4

2022

[33] [33]

T2I- Adapter: Learning adapters to dig out more controllable abil- ity for text-to-image diffusion models, 2023

Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, Ying Shan, and Xiaohu Qie. T2I- Adapter: Learning adapters to dig out more controllable abil- ity for text-to-image diffusion models, 2023. 3, 5

2023

[34] [34]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), 2023. 3

2023

[35] [35]

Sdxl: Improving latent diffusion models for high-resolution image synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. InInternational Confer- ence on Learning Representations, pages 1862–1874, 2024. 2, 3

2024

[36] [36]

FreeScale: Unleashing the resolution of diffusion models via tuning-free scale fusion

Haonan Qiu, Shiwei Zhang, Yujie Wei, Ruihang Chu, Hangjie Yuan, Xiang Wang, Yingya Zhang, and Ziwei Liu. FreeScale: Unleashing the resolution of diffusion models via tuning-free scale fusion. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025. 3

2025

[37] [37]

Learning transferable vi- sual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable vi- sual models from natural language supervision. InInter- national Conference on Machine Learning (ICML), pages 8748–8763, 2021. 5

2021

[38] [38]

SEGA: Spectral-Energy Guided Attention for Resolution Extrapolation in Diffusion Transformers

Javad Rajabi, Kimia Shaban, Koorosh Roohi, David B Lin- dell, and Babak Taati. Sega: Spectral-energy guided atten- tion for resolution extrapolation in diffusion transformers. arXiv preprint arXiv:2605.22668, 2026. 3

work page internal anchor Pith review Pith/arXiv arXiv 2026

[39] [39]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022. 3, 5, 12

2022

[40] [40]

Color alignment in diffusion

Ka Chun Shum, Binh-Son Hua, Duc Thanh Nguyen, and Sai- Kit Yeung. Color alignment in diffusion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 28446–28455, 2025. 3

2025

[41] [41]

Henry Holt and Co., 1997

Robert Silvers and Michael Hawley.Photomosaics. Henry Holt and Co., 1997. 2, 5, 12

1997

[42] [42]

Score-Based Generative Modeling through Stochastic Differential Equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equa- tions.arXiv preprint arXiv:2011.13456, 2020. 3

work page internal anchor Pith review Pith/arXiv arXiv 2011

[43] [43]

Is one GPU enough? pushing image generation at higher-resolutions with founda- tion models

Athanasios Tragakis, Marco Aversa, Chaitanya Kaul, Roder- ick Murray-Smith, and Daniele Faccio. Is one GPU enough? pushing image generation at higher-resolutions with founda- tion models. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. 3

2024

[44] [44]

Chan, and Chen Change Loy

Jianyi Wang, Kelvin C.K. Chan, and Chen Change Loy. Ex- ploring CLIP for assessing the look and feel of images. In Proceedings of the AAAI Conference on Artificial Intelli- gence, pages 2555–2563, 2023. 5

2023

[45] [45]

Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si- moncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004. 5

2004

[46] [46]

Qwen-image technical report, 2025

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report, 2025. 3

2025

[47] [47]

Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis

Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis. InInternational Conference on Learning Representations (ICLR), 2024. 5

2024

[48] [48]

SANA: Efficient high-resolution image synthesis with linear diffusion transformers

Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, and Song Han. SANA: Efficient high-resolution image synthesis with linear diffusion transformers. InIn- ternational Conference on Learning Representations (ICLR),

[49] [49]

ImageRe- ward: Learning and evaluating human preferences for text- to-image generation

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qing- hao Li, Ming Ding, Jie Tang, and Yuxiao Dong. ImageRe- ward: Learning and evaluating human preferences for text- to-image generation. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. 5

2023

[50] [50]

IP- Adapter: Text compatible image prompt adapter for text-to- image diffusion models, 2023

Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. IP- Adapter: Text compatible image prompt adapter for text-to- image diffusion models, 2023. 3

2023

[51] [51]

Diffusion-4k: Ultra-high-resolution image synthesis with latent diffusion models

Jinjin Zhang, Qiuyu Huang, Junjie Liu, Xiefan Guo, and Di Huang. Diffusion-4k: Ultra-high-resolution image synthesis with latent diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 23464– 23473, 2025. 5 10

2025

[52] [52]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023. 2, 3

2023

[53] [53]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 586–595, 2018. 5

2018

[54] [54]

Interior of Notre-Dame cathedral, soaring gothic rib vaults, shafts of colored light, stone columns, architectural photograph

Shen Zhang, Zhaowei Chen, Zhenyu Zhao, Yuhao Chen, Yao Tang, and Jiajun Liang. HiDiffusion: Unlocking higher-resolution creativity and efficiency in pretrained dif- fusion models. InEuropean Conference on Computer Vision (ECCV), pages 145–161, 2024. 3 11 Supplementary Material A. Implementation Details All experiments were executed on NVIDIA H100 GPUs. Wh...

2024