pith. sign in

arxiv: 2606.31018 · v1 · pith:CZQ6FVEBnew · submitted 2026-06-30 · 💻 cs.CV

WarpI2I: Image Warping for Image-to-Image Translation

Pith reviewed 2026-07-01 06:59 UTC · model grok-4.3

classification 💻 cs.CV
keywords image-to-image translationlatent diffusion modelssaliency-guided warpingstructural preservationimage relightingdriving scene translationsynthetic data generationmodel-agnostic method
0
0 comments X

The pith

Saliency-guided warping before encoding lets diffusion models keep fine structures in I2I tasks without raising latent resolution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a warp-unwarp method that uses a saliency map to distort the input image so that salient areas take up more of the pixel grid before it enters the encoder of a latent diffusion model. After the model performs the image-to-image translation on this warped version, an inverse warp restores the original geometry. This lets the compact latent space capture finer details in important regions. The approach requires no changes to the underlying model and adds very little computation. It is demonstrated on relighting tasks and scene translation, and the authors also describe a pipeline for creating synthetic paired training data using outpainting.

Core claim

By guiding a forward warp with a saliency map to concentrate spatial resolution on salient image regions before encoding, followed by processing in the original diffusion model and an inverse warp, the framework achieves better preservation of structural details in image-to-image translation tasks while keeping the latent resolution unchanged.

What carries the argument

The saliency-guided warp-unwarp framework that reallocates spatial representation toward salient regions before encoding into the latent space.

If this is right

  • Improved structural preservation, lighting faithfulness, and overall image quality on human relighting, driving scene relighting, and translation tasks.
  • Natural extension to video by applying the framework frame-by-frame while maintaining good temporal stability.
  • Applicability to any existing diffusion model without architectural modifications.
  • Negligible computational overhead during inference.
  • Availability of a simple outpainting-based method to generate high-quality paired data for relighting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar warping could help other tasks where latent compression loses critical local details, such as medical image analysis or satellite imagery.
  • Performance might improve further if the saliency detector is trained jointly with the translation model rather than used off-the-shelf.
  • The method's low overhead suggests it could be combined with other efficiency techniques like model distillation.
  • Testing on non-relighting I2I tasks like style transfer could reveal broader utility.

Load-bearing premise

The saliency map must accurately point to the image regions whose structural details are most important to keep, and the warping operations must not introduce distortions that the diffusion model cannot undo during translation.

What would settle it

Running the framework on a test set with known ground-truth structures and finding that the output images show equal or worse structural similarity scores than the unwarped baseline, or that unwarping produces visible artifacts not present in the baseline.

Figures

Figures reproduced from arXiv: 2606.31018 by Anurag Ghosh, Gaurav Parmar, Shen Zheng, Srinivasa Narasimhan.

Figure 1
Figure 1. Figure 1: Our warping significantly improves fine details of face regions, com [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our warp-unwarp approach for latent diffusion-based [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Our pipeline for generating synthetic paired training images. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Synthetic images generated from our pipeline [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Warping followed by unwarping introduces minimal information loss. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative Comparison on Human Relighting. [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative comparison on driving scene relighting. [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Warping preserves fine details better than no-warp img2img [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Our method vs. ChatGPT img2img for human relighting (moon [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Examples of ChatGPT-generated captions following our prompt [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Additional examples of ChatGPT-generated captions following our [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Bad generated pairs filtered by ChatGPT. [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Additional qualitative comparison on human relighting (noon sun [PITH_FULL_IMAGE:figures/full_fig_p030_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Additional qualitative comparison on human relighting (moonlight) [PITH_FULL_IMAGE:figures/full_fig_p031_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Additional qualitative comparison on human relighting (foggy) [PITH_FULL_IMAGE:figures/full_fig_p032_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Additional qualitative comparison on human relighting (golden sun [PITH_FULL_IMAGE:figures/full_fig_p033_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Additional qualitative comparison on driving scene relighting. [PITH_FULL_IMAGE:figures/full_fig_p034_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Qualitative background improvement in weather and time-of-day [PITH_FULL_IMAGE:figures/full_fig_p036_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Additional qualitative background improvement in weather and [PITH_FULL_IMAGE:figures/full_fig_p037_19.png] view at source ↗
read the original abstract

Image-to-image (I2I) translation has achieved strong results in tasks like human relighting and driving scene translation using latent diffusion models (LDMs). However, compact LDMs often struggle to preserve fine-grained structures because the encoder compresses high-resolution inputs into a spatially downsampled latent space. To address this issue, we propose a simple saliency-guided warp-unwarp framework that reallocates spatial representation toward salient regions before encoding, enabling better preservation of structural details without increasing latent resolution. The warped image is processed by the original diffusion model and then mapped back via an inverse warp. In addition, we propose a simple and efficient outpainting-based synthetic data generation pipeline to produce high-quality paired data for image relighting. Our method is model-agnostic, requires no architectural modification, and introduces negligible computational overhead. Experiments on human relighting, driving scene relighting, and translation demonstrate improved structural preservation, lighting faithfulness, and image quality, with our framework extending naturally to video via frame-by-frame application with good temporal stability. Project Webpage: https://shenzheng2000.github.io/WarpI2I.github.io

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper proposes WarpI2I, a saliency-guided warp-unwarp preprocessing framework for latent diffusion model-based image-to-image translation. The method reallocates spatial capacity toward salient regions in the input image before encoding to improve structural detail preservation (e.g., in human and driving-scene relighting) without enlarging the latent resolution; the diffusion model operates on the warped image and the output is inverse-warped. It additionally introduces an outpainting-based synthetic data pipeline for relighting pairs and reports extensions to video with temporal stability. The approach is presented as model-agnostic with negligible overhead.

Significance. If the central assumptions hold, the framework offers a lightweight, architecture-preserving way to boost structural fidelity and lighting faithfulness in existing LDM I2I pipelines. Strengths include its model-agnostic nature, the synthetic data pipeline, and the video extension; these could have practical impact in detail-sensitive translation tasks where increasing latent resolution is undesirable.

major comments (1)
  1. [Abstract] Abstract (framework description): the preservation claim requires that the saliency map correctly identifies regions whose details must be retained and that the forward/inverse warp pair introduces no unrecoverable artifacts. The abstract provides no information on saliency source, warp interpolation, or any round-trip reconstruction error metric, leaving the mechanism's contribution to the reported gains unverified.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the single major comment below and will incorporate revisions to strengthen the abstract's description of the framework.

read point-by-point responses
  1. Referee: [Abstract] Abstract (framework description): the preservation claim requires that the saliency map correctly identifies regions whose details must be retained and that the forward/inverse warp pair introduces no unrecoverable artifacts. The abstract provides no information on saliency source, warp interpolation, or any round-trip reconstruction error metric, leaving the mechanism's contribution to the reported gains unverified.

    Authors: We agree the abstract should more explicitly support the preservation claim. The full manuscript (Section 3.1) specifies that the saliency map is produced by a standard pre-trained saliency detector, forward and inverse warps employ bilinear interpolation, and round-trip reconstruction fidelity is quantified via PSNR/SSIM metrics (reported in the experiments and supplementary material). These elements verify that the warp-unwarp pair introduces only minimal recoverable artifacts and that salient regions receive the intended spatial reallocation. To address the concern directly in the abstract, we will revise it to include concise references to the saliency source, interpolation method, and reconstruction metrics while respecting length constraints. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper proposes a saliency-guided warp-unwarp preprocessing step for LDM-based I2I translation but contains no equations, fitted parameters, or self-citations that reduce the claimed structural preservation to a quantity defined by the method itself. The framework is presented as an external, model-agnostic addition whose benefits are asserted to follow from reallocation of spatial capacity; no derivation chain equates the output improvement to the input warp definition or to a self-citation. The central assumptions (saliency accuracy, warp invertibility) are external to any internal reduction and are not justified by construction within the paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; the central claim rests on the unstated premise that a computable saliency map exists whose warping improves latent encoding without side effects, plus standard assumptions about image warping invertibility.

axioms (1)
  • domain assumption Image warping operations are invertible and do not introduce irreversible information loss when applied to the input before latent encoding.
    Required for the un-warp step to recover the original structure after diffusion processing.

pith-pipeline@v0.9.1-grok · 5742 in / 1175 out tokens · 27897 ms · 2026-07-01T06:59:06.455391+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

68 extracted references · 10 canonical work pages · 5 internal anchors

  1. [1]

    Bar-Tal, O., Yariv, L., Lipman, Y., Dekel, T.: Multidiffusion: Fusing diffusion paths for controlled image generation (2023)

  2. [2]

    In: Seminal Graphics Papers: Pushing the Boundaries, Volume 2 (2023)

    Beier, T., Neely, S.: Feature-based image metamorphosis. In: Seminal Graphics Papers: Pushing the Boundaries, Volume 2 (2023)

  3. [3]

    https://cdn

    Betker, J., Goh, G., Jing, L., Brooks, T., Wang, J., Li, L., Ouyang, L., Zhuang, J., Lee,J.,Guo,Y.,etal.:Improvingimagegenerationwithbettercaptions.Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf (2023)

  4. [4]

    In: CVPR (2020)

    Bijelic, M., Gruber, T., Mannan, F., Kraus, F., Ritter, W., Dietmayer, K., Heide, F.: Seeing through fog without seeing fog: Deep multimodal sensor fusion in unseen adverse weather. In: CVPR (2020)

  5. [5]

    Demystifying MMD GANs

    Bińkowski, M., Sutherland, D.J., Arbel, M., Gretton, A.: Demystifying mmd gans. arXiv preprint arXiv:1801.01401 (2018)

  6. [6]

    In: AAAI (2025)

    Cha, J., Ren, M., Singh, K.K., Zhang, H., Hold-Geoffroy, Y., Yoon, S., Jung, H., Yoon, J.S., Baek, S.: Text2relight: Creative portrait relighting with text guidance. In: AAAI (2025)

  7. [7]

    In: ACML (2023)

    Changjie, L., Shen, Z., Zirui, W., Omar, D., Gaurav, G.: As-introvae: Adversarial similarity distance makes robust introvae. In: ACML (2023)

  8. [8]

    In: CVPR (2022)

    Chen, W.T., Huang, Z.K., Tsai, C.C., Yang, H.H., Ding, J.J., Kuo, S.Y.: Learning multiple adverse weather removal via two-stage knowledge learning and multi- contrastive regularization: Toward a unified model. In: CVPR (2022)

  9. [9]

    In: CVPR (2024)

    Cheng, T., Song, L., Ge, Y., Liu, W., Wang, X., Shan, Y.: Yolo-world: Real-time open-vocabulary object detection. In: CVPR (2024)

  10. [10]

    In: CVPR (2021)

    Choi, S., Park, S., Lee, M., Choo, J.: Viton-hd: High-resolution virtual try-on via misalignment-aware normalization. In: CVPR (2021)

  11. [11]

    In: CVPR (2016)

    Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR (2016)

  12. [12]

    In: WACV (2025)

    Cui, A., Mahajan, J., Shah, V., Gomathinayagam, P., Liu, C., Lazebnik, S.: Street tryon: Learning in-the-wild virtual try-on from unpaired person images. In: WACV (2025)

  13. [13]

    Constructive Distortion: Improving MLLMs with Attention-Guided Image Warping

    Dalal, D., Vashishtha, G., Mishra, U., Kim, J., Kanda, M., Ha, H., Lazebnik, S., Ji, H., Jain, U.: Constructive distortion: Improving mllms with attention-guided image warping. arXiv preprint arXiv:2510.09741 (2025)

  14. [14]

    In: CVPR (2019)

    Deng, J., Guo, J., Niannan, X., Zafeiriou, S.: Arcface: Additive angular margin loss for deep face recognition. In: CVPR (2019)

  15. [15]

    In: CGF (2021)

    Einabadi, F., Guillemaut, J.Y., Hilton, A.: Deep neural models for illumination estimation and relighting: A survey. In: CGF (2021)

  16. [16]

    In: ICML (2024)

    Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high-resolution image synthesis. In: ICML (2024)

  17. [17]

    In: CVPR (2023)

    Ghosh, A., Reddy, N.D., Mertz, C., Narasimhan, S.G.: Learned two-plane perspec- tive prior based image resampling for efficient object detection. In: CVPR (2023)

  18. [18]

    In: ICCV (2025)

    Ghosh, A., Zheng, S., Tamburo, R., Vuong, K., Alvarez-Padilla, J., Zhu, H., Cardei, M., Dunn, N., Mertz, C., Narasimhan, S.G.: Roadwork: A dataset and benchmark for learning to recognize, observe, analyze and drive through work zones. In: ICCV (2025)

  19. [19]

    NeurIPS (2014) WarpI2I 17

    Goodfellow, I.J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. NeurIPS (2014) WarpI2I 17

  20. [20]

    In: EMNLP (2021)

    Hessel, J., Holtzman, A., Forbes, M., Le Bras, R., Choi, Y.: Clipscore: A reference- free evaluation metric for image captioning. In: EMNLP (2021)

  21. [21]

    NeurIPS (2017)

    Heusel,M.,Ramsauer,H.,Unterthiner,T.,Nessler,B.,Hochreiter,S.:Ganstrained by a two time-scale update rule converge to a local nash equilibrium. NeurIPS (2017)

  22. [22]

    NeurIPS (2020)

    Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. NeurIPS (2020)

  23. [23]

    JMLR (2022)

    Ho, J., Saharia, C., Chan, W., Fleet, D.J., Norouzi, M., Salimans, T.: Cascaded diffusion models for high fidelity image generation. JMLR (2022)

  24. [24]

    ICLR (2022)

    Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. ICLR (2022)

  25. [25]

    In: ECCV (2018)

    Huang, X., Liu, M.Y., Belongie, S., Kautz, J.: Multimodal unsupervised image-to- image translation. In: ECCV (2018)

  26. [26]

    NeurIPS (2015)

    Jaderberg, M., Simonyan, K., Zisserman, A., et al.: Spatial transformer networks. NeurIPS (2015)

  27. [27]

    In: ICCV (2025)

    Khan, M.A.H., Jain, Y., Bhattacharyya, S., Vineet, V.: Test-time prompt refine- ment for text-to-image models. In: ICCV (2025)

  28. [28]

    FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

    Labs, B.F., Batifol, S., Blattmann, A., Boesel, F., Consul, S., Diagne, C., Dock- horn, T., English, J., English, Z., Esser, P., et al.: Flux. 1 kontext: Flow match- ing for in-context image generation and editing in latent space. arXiv preprint arXiv:2506.15742 (2025)

  29. [29]

    NeurIPS (2017)

    Liu, M.Y., Breuel, T., Kautz, J.: Unsupervised image-to-image translation net- works. NeurIPS (2017)

  30. [30]

    NeurIPS (2026)

    Liu, Y., Xiao, W., Wang, Q., Chen, J., Wang, S., Wang, Y., Wu, X., Tang, Y.: Dreamlight: Towards harmonious and consistent image relighting. NeurIPS (2026)

  31. [31]

    Midjourney, I.: Midjourney.https://www.midjourney.com(2024), accessed: 2026- 03-03

  32. [32]

    In: ECCV (2020)

    Park, T., Efros, A.A., Zhang, R., Zhu, J.Y.: Contrastive learning for unpaired image-to-image translation. In: ECCV (2020)

  33. [33]

    arXiv preprint arXiv:2403.12036 (2024)

    Parmar, G., Park, T., Narasimhan, S., Zhu, J.Y.: One-step image translation with text-to-image models. arXiv preprint arXiv:2403.12036 (2024)

  34. [34]

    In: CVPR (2022)

    Parmar, G., Zhang, R., Zhu, J.Y.: On aliased resizing and surprising subtleties in gan evaluation. In: CVPR (2022)

  35. [35]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952 (2023)

  36. [36]

    In: ICML (2021)

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: ICML (2021)

  37. [37]

    In: ECCV (2018)

    Recasens, A., Kellnhofer, P., Stent, S., Matusik, W., Torralba, A.: Learning to zoom: a saliency-based sampling layer for neural networks. In: ECCV (2018)

  38. [38]

    Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

    Ren, T., Liu, S., Zeng, A., Lin, J., Li, K., Cao, H., Chen, J., Huang, X., Chen, Y., Yan, F., et al.: Grounded sam: Assembling open-world models for diverse visual tasks. arXiv preprint arXiv:2401.14159 (2024)

  39. [39]

    In: CVPR (2022)

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)

  40. [40]

    NeurIPS (2022) 18 Zheng et al

    Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E.L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T., et al.: Photorealistic text- to-image diffusion models with deep language understanding. NeurIPS (2022) 18 Zheng et al

  41. [41]

    Sakaridis, C., Dai, D., Gool, L.V.: Guided curriculum model adaptation and uncertainty-awareevaluationforsemanticnighttimeimagesegmentation.In:ICCV (2019)

  42. [42]

    In: ICCV (2021)

    Sakaridis, C., Dai, D., Van Gool, L.: Acdc: The adverse conditions dataset with correspondences for semantic driving scene understanding. In: ICCV (2021)

  43. [43]

    arXiv preprint arXiv:2104.05358 (2021)

    Sasaki, H., Willcocks, C.G., Breckon, T.P.: Unit-ddpm: Unpaired image transla- tion with denoising diffusion probabilistic models. arXiv preprint arXiv:2104.05358 (2021)

  44. [44]

    arXiv preprint arXiv:2203.08382 (2022)

    Su, X., Song, J., Meng, C., Ermon, S.: Dual diffusion implicit bridges for image- to-image translation. arXiv preprint arXiv:2203.08382 (2022)

  45. [45]

    In: ICCV (2021)

    Thavamani, C., Li, M., Cebron, N., Ramanan, D.: Fovea: Foveated image magni- fication for autonomous navigation. In: ICCV (2021)

  46. [46]

    In: CVPR (2023)

    Thavamani, C., Li, M., Ferroni, F., Ramanan, D.: Learning to zoom and unzoom. In: CVPR (2023)

  47. [47]

    In: CVPR (2022)

    Tumanyan, N., Bar-Tal, O., Bagon, S., Dekel, T.: Splicing vit features for semantic appearance transfer. In: CVPR (2022)

  48. [48]

    arXiv preprint arXiv:2505.17581 (2025)

    Wang, H., Hu, Q., Guo, X.: Modem: A morton-order degradation estimation mech- anism for adverse weather image recovery. arXiv preprint arXiv:2505.17581 (2025)

  49. [49]

    arXiv preprint arXiv:2408.15374 (2024)

    Wang, T., Lin, Y.C.: Cyclegan with better cycles. arXiv preprint arXiv:2408.15374 (2024)

  50. [50]

    Wolberg, G.: Digital image warping, vol. 10662. IEEE computer society press Los Alamitos, CA (1990)

  51. [51]

    In: CVPR (2024)

    Yang, L., Kang, B., Huang, Z., Xu, X., Feng, J., Zhao, H.: Depth anything: Un- leashing the power of large-scale unlabeled data. In: CVPR (2024)

  52. [52]

    In: ICCV (2017)

    Yi, Z., Zhang, H., Tan, P., Gong, M.: Dualgan: Unsupervised dual learning for image-to-image translation. In: ICCV (2017)

  53. [53]

    In: CVPR (2020)

    Yu, F., Chen, H., Wang, X., Xian, W., Chen, Y., Liu, F., Madhavan, V., Darrell, T.: Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In: CVPR (2020)

  54. [54]

    In: CVPR (2018)

    Yuan, Y., Liu, S., Zhang, J., Zhang, Y., Dong, C., Lin, L.: Unsupervised image super-resolution using cycle-in-cycle generative adversarial networks. In: CVPR (2018)

  55. [55]

    In: CVPR (2022)

    Zhan, F., Zhang, J., Yu, Y., Wu, R., Lu, S.: Modulated contrast for versatile image synthesis. In: CVPR (2022)

  56. [56]

    In: NeurIPS (2023)

    Zhang, J., Herrmann, C., Hur, J., Cabrera, L.P., Jampani, V., Sun, D., Yang, M.H.: A tale of two features: Stable diffusion complements dino for zero-shot semantic correspondence. In: NeurIPS (2023)

  57. [57]

    In: ICLR (2025)

    Zhang, L., Rao, A., Agrawala, M.: Scaling in-the-wild training for diffusion-based illumination harmonization and editing by imposing consistent light transport. In: ICLR (2025)

  58. [58]

    In: CVPR (2018)

    Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR (2018)

  59. [59]

    ACM TOMM (2025)

    Zhang, X., Xu, Z., Tang, H., Gu, C., Chen, W., El Saddik, A.: Wakeup-darkness: When multimodal meets unsupervised low-light image enhancement. ACM TOMM (2025)

  60. [60]

    In: WACV (2025)

    Zheng, S., Ghosh, A., Narasimhan, S.: Instance-warp: Saliency guided image warp- ing for unsupervised domain adaptation. In: WACV (2025)

  61. [61]

    In: WACV (2024)

    Zheng, S., Lu, C., Narasimhan, S.G.: Tpsence: Towards artifact-free realistic rain generation for deraining and object detection in rain. In: WACV (2024)

  62. [62]

    In: ICML (2022) WarpI2I 19

    Zhou, D., Yu, Z., Xie, E., Xiao, C., Anandkumar, A., Feng, J., Alvarez, J.M.: Understanding the robustness in vision transformers. In: ICML (2022) WarpI2I 19

  63. [63]

    In: ICCV (2017) 20 Zheng et al

    Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: ICCV (2017) 20 Zheng et al. Overview This supplementary material provides additional context on key design decisions (Section A), robustness under detection error and saliency bandwidth variation (Section B), warp vs. no-warp o...

  64. [64]

    <lighting description>

    Does therightimage match the following lighting description? “<lighting description>” (Yes/No)

  65. [65]

    Do both themiddleandrightimages contain exactly one person? (Yes/No)

  66. [66]

    Same person? (Yes/No)

  67. [67]

    Same clothing? (Yes/No)

  68. [68]

    Otherwise, write No.” F.2 Evaluating Synthetic (T2I) Pairs Almost no images fail thelighting descriptiontest

    Same pose? (Yes/No) FINAL ANSWER:Yes (only if all answers are Yes). Otherwise, write No.” F.2 Evaluating Synthetic (T2I) Pairs Almost no images fail thelighting descriptiontest. About 1% fail theexactly one personcheck, 1% fail thesame personcheck, and 2% fail thesame clothing check. Overall, 96% of the generated pairs can be directly used for training, i...