Learning to Balance: Decoupled Siamese Diffusion Transformer for Reference-Based Remote Sensing Image Super-Resolution
Pith reviewed 2026-05-20 12:31 UTC · model grok-4.3
The pith
Decoupling attention paths in a Siamese diffusion transformer lets reference images supply texture without creating artifacts in remote sensing super-resolution.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DS-DiT decouples low-resolution structural priors and reference texture information at the attention level inside a Siamese diffusion transformer so that each source interacts independently with the noisy latent. This separation reduces inter-source competition. A Patch-Level Weights module adaptively modulates the fusion to compensate for the limited local modeling of global attention. The architecture also enables an autoguidance strategy at inference time that exploits the prediction discrepancy between strong and weak reference conditions, improving reconstruction quality with no extra training.
What carries the argument
The decoupled attention mechanism in the Siamese diffusion transformer that separates low-resolution structural priors from reference texture information so each can condition the noisy latent on its own.
If this is right
- The model reports higher quantitative scores than prior RefSR methods on multiple remote sensing datasets.
- Visual outputs exhibit fewer texture artifacts and recover more accurate fine details at large scaling factors.
- Autoguidance improves results during inference without requiring additional model training.
- Performance holds across different scaling factors and reference image qualities.
Where Pith is reading between the lines
- The same attention-level decoupling could be tested in other conditional diffusion settings where multiple inputs tend to compete, such as multi-modal image synthesis.
- Practical remote sensing pipelines might need less manual curation of reference images if the balancing mechanism generalizes.
- Applying the Patch-Level Weights idea to temporal or volumetric data could address similar source-conflict problems in video or 3D super-resolution.
Load-bearing premise
Separating the attention paths for low-resolution structure and reference texture, together with the patch-level weights module, is enough to stop the two sources from competing while still letting them contribute useful information to the reconstruction.
What would settle it
On a held-out remote sensing test set at 4x or 8x scaling, the method produces no improvement in PSNR, SSIM, or visual detail recovery compared with a standard non-decoupled diffusion RefSR baseline.
Figures
read the original abstract
Diffusion-based methods demonstrate significant potential for remote sensing image super-resolution at large scaling factors, particularly in reference-based super-resolution (RefSR) where high-resolution reference images provide critical fine-grained texture priors. However, existing methods often suffer from a trade-off between over-reliance on reference information, which leads to texture artifacts, and underutilization, which results in insufficient detail recovery. To address these issues, we propose DS-DiT, a Decoupled Siamese Diffusion Transformer method that decouples low-resolution and reference interactions at the attention level. By enabling low-resolution structural priors and reference texture information to interact independently with the noisy latent, the framework effectively mitigates inter-source competition. Furthermore, to compensate for the limited local modeling ability of global attention, we introduce a Patch-Level Weights (PLW) module that adaptively modulates the fusion of conditional sources. In addition, this siamese architecture facilitates an autoguidance strategy during inference, which enhances reconstruction by exploiting the prediction discrepancy between strong and weak reference conditions. This approach boosts generation quality without additional training. Experimental results across multiple datasets and scaling factors demonstrate that DS-DiT outperforms existing methods in both quantitative metrics and visual fidelity.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes DS-DiT, a Decoupled Siamese Diffusion Transformer for reference-based remote sensing image super-resolution. It decouples low-resolution structural priors and reference texture information so that each interacts independently with the noisy latent at the attention level, introduces a Patch-Level Weights (PLW) module to adaptively modulate conditional fusion, and employs an autoguidance strategy at inference that exploits prediction discrepancies between strong and weak reference conditions. The central claim is that these components mitigate inter-source competition while preserving useful interactions, yielding superior quantitative metrics and visual fidelity over prior methods across multiple datasets and scaling factors.
Significance. If the experimental claims hold after proper validation, the work offers a practical architectural route to balancing reference texture injection in diffusion-based RefSR for remote sensing, where large scaling factors make artifact control especially important. The siamese structure enabling training-free autoguidance is a lightweight addition worth noting. The contribution is incremental rather than foundational, as it builds on existing diffusion transformers and conditional mechanisms without introducing new theoretical derivations or parameter-free results.
major comments (3)
- [Abstract and §4] Abstract and §4: the claim that DS-DiT 'outperforms existing methods in both quantitative metrics and visual fidelity' is presented without any reported PSNR/SSIM/LPIPS values, dataset names, baseline lists, or statistical significance tests. This absence prevents evaluation of whether the reported gains are load-bearing or depend on unstated data splits or post-hoc tuning.
- [§3.2] §3.2 (Decoupled Attention): the premise that independent attention pathways for LR structural priors and reference texture 'effectively mitigates inter-source competition' lacks supporting diagnostics. No attention-map visualizations, cross-source correlation metrics, or controlled ablations isolating the decoupling (e.g., vs. standard cross-attention concatenation) are referenced, leaving open the possibility that the mechanism reduces to conventional conditioning plus the PLW module.
- [§3.3 and §4.3] §3.3 (PLW module) and §4.3 (Ablations): the assertion that PLW compensates for 'limited local modeling ability of global attention' requires explicit ablation tables showing performance drop when PLW is removed or replaced by a simple concatenation. Without these, the module's contribution to the balance claim cannot be isolated from the overall architecture.
minor comments (3)
- [§3.1] Clarify in §3.1 whether the siamese branches share weights or maintain separate parameters, and specify the exact form of the autoguidance scaling factor used at inference.
- [§4] Add a table in §4 listing all compared baselines with their original publication years and whether they were re-trained or used off-the-shelf on the same remote-sensing splits.
- [Figure 2] Ensure Figure 2 (architecture) explicitly annotates the separate key/query projections or masking used in the decoupled attention blocks.
Simulated Author's Rebuttal
We sincerely thank the referee for the constructive and detailed feedback. We address each major comment below and outline the revisions we will make to improve clarity and support for our claims.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4: the claim that DS-DiT 'outperforms existing methods in both quantitative metrics and visual fidelity' is presented without any reported PSNR/SSIM/LPIPS values, dataset names, baseline lists, or statistical significance tests. This absence prevents evaluation of whether the reported gains are load-bearing or depend on unstated data splits or post-hoc tuning.
Authors: We thank the referee for this point. Section 4 of the manuscript already contains detailed tables reporting PSNR, SSIM, and LPIPS across datasets such as AID and NWPU-RESISC45 for 4× and 8× scaling factors, with comparisons to multiple baselines. To address the abstract, we will revise it to explicitly state key quantitative gains (e.g., average PSNR improvements). We will also add statistical significance tests (e.g., paired t-tests or confidence intervals) to Section 4 in the revision to rule out concerns about data splits or tuning. revision: partial
-
Referee: [§3.2] §3.2 (Decoupled Attention): the premise that independent attention pathways for LR structural priors and reference texture 'effectively mitigates inter-source competition' lacks supporting diagnostics. No attention-map visualizations, cross-source correlation metrics, or controlled ablations isolating the decoupling (e.g., vs. standard cross-attention concatenation) are referenced, leaving open the possibility that the mechanism reduces to conventional conditioning plus the PLW module.
Authors: We agree that additional evidence would strengthen the decoupling claim. In the revised manuscript we will add attention-map visualizations for the separate LR and reference pathways, report cross-source correlation metrics, and include a controlled ablation that compares the decoupled siamese attention against standard concatenated cross-attention while holding all other components fixed. revision: yes
-
Referee: [§3.3 and §4.3] §3.3 (PLW module) and §4.3 (Ablations): the assertion that PLW compensates for 'limited local modeling ability of global attention' requires explicit ablation tables showing performance drop when PLW is removed or replaced by a simple concatenation. Without these, the module's contribution to the balance claim cannot be isolated from the overall architecture.
Authors: We appreciate the suggestion. We will expand the ablation study in Section 4.3 with new tables that quantify the performance drop when the PLW module is removed or replaced by simple concatenation. These results will isolate the module's contribution to local modulation and source balancing. revision: yes
Circularity Check
Architectural design choices with no self-referential equations or fitted predictions
full rationale
The paper introduces DS-DiT as a proposed architecture featuring decoupled attention at the Siamese level, a Patch-Level Weights module, and autoguidance during inference. These are presented as independent engineering decisions to address inter-source competition in RefSR, supported by experimental results on datasets. No equations reduce a claimed prediction or first-principles result back to fitted inputs by construction, and no uniqueness theorems or ansatzes are smuggled via self-citation. The central claims rest on empirical outperformance rather than tautological redefinitions, rendering the method self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Diffusion models conditioned on low-resolution and reference inputs can generate high-fidelity super-resolved images when inter-source competition is controlled.
invented entities (2)
-
Decoupled Siamese Diffusion Transformer (DS-DiT)
no independent evidence
-
Patch-Level Weights (PLW) module
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
decouples low-resolution and reference interactions at the attention level... shared set of query, key, and value (Q, K, and V) matrices that are dispatched to two parallel joint attention paths
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Patch-Level Weights (PLW) module that adaptively modulates the fusion of conditional sources
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
In: European Conference on Computer Vision
Ahn, D., Cho, H., Min, J., Jang, W., Kim, J., Kim, S., Park, H.H., Jin, K.H., Kim, S.: Self-rectifying diffusion sampling with perturbed-attention guidance. In: European Conference on Computer Vision. pp. 1–17. Springer (2024)
work page 2024
-
[2]
Black Forest Labs: Flux.https://blackforestlabs.ai/announcing- black- forest-labs(2024)
work page 2024
-
[3]
In: European conference on computer vision
Cao, J., Liang, J., Zhang, K., Li, Y., Zhang, Y., Wang, W., Gool, L.V.: Reference- based image super-resolution with deformable attention transformer. In: European conference on computer vision. pp. 325–342. Springer (2022)
work page 2022
-
[4]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Chen, J., Pan, J., Dong, J.: Faithdiff: Unleashing diffusion priors for faithful image super-resolution. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 28188–28197 (2025)
work page 2025
-
[5]
IEEE transactions on pattern analysis and machine intelligence44(5), 2567–2581 (2020)
Ding, K., Ma, K., Wang, S., Simoncelli, E.P.: Image quality assessment: Unify- ing structure and texture similarity. IEEE transactions on pattern analysis and machine intelligence44(5), 2567–2581 (2020)
work page 2020
-
[6]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Dong, R., Mou, L., Chen, M., Li, W., Tong, X.Y., Yuan, S., Zhang, L., Zheng, J., Zhu, X., Fu, H.: Large-scale land cover mapping with fine-grained classes via class- aware semi-supervised semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 16783–16793 (2023)
work page 2023
-
[7]
In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Dong, R., Yuan, S., Luo, B., Chen, M., Zhang, J., Zhang, L., Li, W., Zheng, J., Fu, H.: Building bridges across spatial and temporal resolutions: Reference-based super-resolution via change priors and conditional diffusion model. In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 27684–27694 (June 2024)
work page 2024
-
[8]
IEEE Transactions on Geoscience and Remote Sensing60, 1–17 (2021)
Dong, R., Zhang, L., Fu, H.: Rrsgan: Reference-based super-resolution for remote sensing image. IEEE Transactions on Geoscience and Remote Sensing60, 1–17 (2021)
work page 2021
-
[9]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
Duan, Z.P., Zhang, J., Jin, X., Zhang, Z., Xiong, Z., Zou, D., Ren, J.S., Guo, C., Li, C.: Dit4sr: Taming diffusion transformer for real-world image super-resolution. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 18948–18958 (October 2025)
work page 2025
-
[10]
In: Forty-first international conference on machine learning (2024)
Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high-resolution image synthesis. In: Forty-first international conference on machine learning (2024)
work page 2024
-
[11]
Advances in Neural In- formation Processing Systems37, 46593–46621 (2024)
Guo,H.,Dai,T.,Ouyang,Z.,Zhang,T.,Zha,Y.,Chen,B.,Xia,S.t.:Refir:Ground- ing large restoration models with retrieval augmentation. Advances in Neural In- formation Processing Systems37, 46593–46621 (2024)
work page 2024
-
[12]
Advances in neural information processing systems30(2017) 16 B
Heusel,M.,Ramsauer,H.,Unterthiner,T.,Nessler,B.,Hochreiter,S.:Ganstrained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems30(2017) 16 B. Luo et al
work page 2017
-
[13]
Advances in neural information processing systems33, 6840–6851 (2020)
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems33, 6840–6851 (2020)
work page 2020
-
[14]
Classifier-Free Diffusion Guidance
Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[15]
Advances in Neural Information Processing Systems37, 66743–66772 (2024)
Hong,S.:Smoothedenergyguidance:Guidingdiffusionmodelswithreducedenergy curvature of attention. Advances in Neural Information Processing Systems37, 66743–66772 (2024)
work page 2024
-
[16]
IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing13, 4607–4626 (2020)
Jiang, J., Zhang, Q., Yao, X., Tian, Y., Zhu, Y., Cao, W., Cheng, T.: Histif: A new spatiotemporal image fusion method for high-resolution monitoring of crops at the subfield level. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing13, 4607–4626 (2020)
work page 2020
-
[17]
In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition
Jiang, Y., Chan, K.C., Wang, X., Loy, C.C., Liu, Z.: Robust reference-based super- resolution via c2-matching. In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition. pp. 2103–2112 (2021)
work page 2021
-
[18]
Advances in Neural Information Processing Systems37, 52996–53021 (2024)
Karras, T., Aittala, M., Kynkäänniemi, T., Lehtinen, J., Aila, T., Laine, S.: Guid- ing a diffusion model with a bad version of itself. Advances in Neural Information Processing Systems37, 52996–53021 (2024)
work page 2024
-
[19]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Ke, J., Wang, Q., Wang, Y., Milanfar, P., Yang, F.: Musiq: Multi-scale image quality transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 5148–5157 (2021)
work page 2021
-
[20]
FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space
Labs, B.F., Batifol, S., Blattmann, A., Boesel, F., Consul, S., Diagne, C., Dock- horn, T., English, J., English, Z., Esser, P., et al.: Flux. 1 kontext: Flow match- ing for in-context image generation and editing in latent space. arXiv preprint arXiv:2506.15742 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[21]
In: European conference on computer vision
Lin, X., He, J., Chen, Z., Lyu, Z., Dai, B., Yu, F., Qiao, Y., Ouyang, W., Dong, C.: Diffbir: Toward blind image restoration with generative diffusion prior. In: European conference on computer vision. pp. 430–448. Springer (2024)
work page 2024
-
[22]
Flow Matching for Generative Modeling
Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. arXiv preprint arXiv:2210.02747 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[23]
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Liu, X., Gong, C., Liu, Q.: Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[24]
Decoupled Weight Decay Regularization
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[25]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Lu, L., Li, W., Tao, X., Lu, J., Jia, J.: Masa-sr: Matching acceleration and spa- tial adaptation for reference-based image super-resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6368– 6377 (2021)
work page 2021
-
[26]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)
work page 2022
-
[27]
In: International Conference on Learning Representations
Sadat, S., Kansy, M., Hilliges, O., Weber, R.: No training, no problem: Rethink- ing classifier-free guidance for diffusion models. In: International Conference on Learning Representations. vol. 2025, pp. 76833–76858 (2025)
work page 2025
-
[28]
IEEE transactions on pattern analysis and ma- chine intelligence45(4), 4713–4726 (2022)
Saharia, C., Ho, J., Chan, W., Salimans, T., Fleet, D.J., Norouzi, M.: Image super- resolution via iterative refinement. IEEE transactions on pattern analysis and ma- chine intelligence45(4), 4713–4726 (2022)
work page 2022
-
[29]
Denoising Diffusion Implicit Models
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[30]
Stability AI: Stable diffusion 3.5.https://stability.ai/news/introducing- stable-diffusion-3-5(2024) DS-DiT 17
work page 2024
-
[31]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Sun, H., Li, W., Liu, J., Chen, H., Pei, R., Zou, X., Yan, Y., Yang, Y.: Coser: Bridging image and language for cognitive super-resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 25868–25878 (June 2024)
work page 2024
-
[32]
IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (2026)
Wang, C., Sun, W.: Controllable reference-guided diffusion with local-global fusion for real-world remote sensing super-resolution. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (2026)
work page 2026
-
[33]
Wang, J., Chan, K.C., Loy, C.C.: Exploring clip for assessing the look and feel of images. In: AAAI (2023)
work page 2023
-
[34]
International Journal of Computer Vision 132(12), 5929–5949 (2024)
Wang, J., Yue, Z., Zhou, S., Chan, K.C., Loy, C.C.: Exploiting diffusion prior for real-world image super-resolution. International Journal of Computer Vision 132(12), 5929–5949 (2024)
work page 2024
-
[35]
In: Proceedings of the IEEE/CVF in- ternational conference on computer vision
Wang, X., Xie, L., Dong, C., Shan, Y.: Real-esrgan: Training real-world blind super-resolution with pure synthetic data. In: Proceedings of the IEEE/CVF in- ternational conference on computer vision. pp. 1905–1914 (2021)
work page 1905
- [36]
-
[37]
In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR)
Wu, R., Yang, T., Sun, L., Zhang, Z., Li, S., Zhang, L.: Seesr: Towards semantics- aware real-world image super-resolution. In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR). pp. 25456–25467 (June 2024)
work page 2024
-
[38]
In: Proceedings of the aaai conference on artificial intelligence
Xia, B., Tian, Y., Hang, Y., Yang, W., Liao, Q., Zhou, J.: Coarse-to-fine embed- ded patchmatch and multi-scale dynamic aggregation for reference-based super- resolution. In: Proceedings of the aaai conference on artificial intelligence. vol. 36, pp. 2768–2776 (2022)
work page 2022
-
[39]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Yang, F., Yang, H., Fu, J., Lu, H., Guo, B.: Learning texture transformer net- work for image super-resolution. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5791–5800 (2020)
work page 2020
-
[40]
arXiv preprint arXiv:2010.05687 (2020)
Yang, K., Xia, G.S., Liu, Z., Du, B., Yang, W., Pelillo, M., Zhang, L.: Se- mantic change detection with asymmetric siamese networks. arXiv preprint arXiv:2010.05687 (2020)
-
[41]
In: European conference on computer vision
Yang, T., Wu, R., Ren, P., Xie, X., Zhang, L.: Pixel-aware stable diffusion for real- istic image super-resolution and personalized stylization. In: European conference on computer vision. pp. 74–91. Springer (2024)
work page 2024
-
[42]
Yang, Y., Zhang, Y., Zhang, K., Zhang, J., Chen, X., Fu, H., Dong, R.: Task- oriented data synthesis and control-rectify sampling for remote sensing semantic segmentation. arXiv preprint arXiv:2512.16740 (2025)
-
[43]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Yu,F.,Gu,J.,Li,Z.,Hu,J.,Kong,X.,Wang,X.,He,J.,Qiao,Y.,Dong,C.:Scaling up to excellence: Practicing model scaling for photo-realistic image restoration in the wild. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 25669–25680 (2024)
work page 2024
-
[44]
Advances in Neural Information Processing Systems37, 132417–132439 (2024)
Yuan, S., Lin, G., Zhang, L., Dong, R., Zhang, J., Chen, S., Zheng, J., Wang, J., Fu, H.: Fusu: A multi-temporal-source land use change segmentation dataset for fine-grained urban semantic understanding. Advances in Neural Information Processing Systems37, 132417–132439 (2024)
work page 2024
-
[45]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
Zhang,H.,Hong,D.,Wang,Y.,Shao,J.,Wu,X.,Wu,Z.,Jiang,Y.G.:Creatilayout: Siamese multimodal diffusion transformer for creative layout-to-image generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 18487–18497 (October 2025) 18 B. Luo et al
work page 2025
-
[46]
Remote Sensing15(4), 1103 (2023)
Zhang, J., Zhang, W., Jiang, B., Tong, X., Chai, K., Yin, Y., Wang, L., Jia, J., Chen, X.: Reference-based super-resolution method for remote sensing images with feature compression module. Remote Sensing15(4), 1103 (2023)
work page 2023
-
[47]
In: Proceedings of the IEEE/CVF international conference on computer vision
Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 3836–3847 (2023)
work page 2023
-
[48]
In: Proceedings of the IEEE conference on computer vision and pattern recognition
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 586–595 (2018)
work page 2018
-
[49]
In: European Conference on Computer Vision
Zhang, Y., Zhang, Z., DiVerdi, S., Wang, Z., Echevarria, J., Fu, Y.: Texture hal- lucination for large-factor painting super-resolution. In: European Conference on Computer Vision. pp. 209–225. Springer (2020)
work page 2020
-
[50]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Zhang, Z., Wang, Z., Lin, Z., Qi, H.: Image super-resolution by neural texture transfer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 7982–7991 (2019)
work page 2019
-
[51]
In: Proceedings of the European Conference on Computer Vision (ECCV) (September 2018)
Zheng, H., Ji, M., Wang, H., Liu, Y., Fang, L.: Crossnet: An end-to-end reference- based super resolution network using cross-scale warping. In: Proceedings of the European Conference on Computer Vision (ECCV) (September 2018)
work page 2018
-
[52]
Zhou, Y., Wang, J., Ding, J., Liu, B., Weng, N., Xiao, H.: Signet: A siamese graph convolutional network for multi-class urban change detection. Remote Sens- ing15(9), 2464 (2023) DS-DiT 19 Appendix A Detailed Illustration of M 3-DiT As mentioned in Sec. 3.2, we adapt M3-DiT for RefSR and include it as a com- parison method. Here we provide a more detaile...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.