Recognition: unknown
StructDiff: A Structure-Preserving and Spatially Controllable Diffusion Model for Single-Image Generation
Pith reviewed 2026-05-10 14:52 UTC · model grok-4.3
The pith
StructDiff adds adaptive receptive fields and 3D positional encoding to a diffusion model so single-image generation preserves layout while allowing control over object positions and scales.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By combining an adaptive receptive field module with 3D positional encoding inside a single-scale diffusion model, StructDiff maintains the source image's global and local distributions while enabling direct manipulation of object positions, scales, and fine details through the positional prior.
What carries the argument
Adaptive receptive field module plus 3D positional encoding, which together balance global and local statistics and supply an explicit spatial prior for controllable generation.
If this is right
- Generated samples retain structural layout even for images dominated by large rigid objects.
- Users can directly specify positions, scales, and local details of content without retraining.
- The same model applies to text-guided synthesis, image editing, outpainting, and paint-to-image tasks.
- An LLM-based criterion offers an automated alternative to existing objective metrics and user studies.
Where Pith is reading between the lines
- The spatial control mechanism may transfer to video or 3D generation where layout consistency across frames or views is required.
- If the positional encoding proves robust, creative tools could reduce reliance on large external training sets for domain-specific image synthesis.
- Combining the LLM evaluator with existing perceptual metrics could become a standard protocol for assessing internal-statistic models.
Load-bearing premise
The adaptive receptive field module and 3D positional encoding preserve both global and local image statistics while delivering reliable spatial control without creating new artifacts or requiring per-image retuning.
What would settle it
Generate images containing large rigid objects or strict spatial constraints under user-specified position or scale commands; if the outputs show structural distortions, misplaced elements, or loss of detail matching the source at rates no better than prior methods, the central claim fails.
Figures
read the original abstract
This paper introduces StructDiff, a generative framework based on a single-scale diffusion model for single-image generation. Single-image generation aims to synthesize diverse samples with similar visual content to the source image by capturing its internal statistics, without relying on external data. However, existing methods often struggle to preserve the structural layout, especially for images with large rigid objects or strict spatial constraints. Moreover, most approaches lack spatial controllability, making it difficult to guide the structure or placement of generated content. To address these challenges, StructDiff introduces an \textit{adaptive receptive field} module to maintain both global and local distributions. Building on this foundation, StructDiff incorporates 3D positional encoding (PE) as a spatial prior, allowing flexible control over positions, scale, and local details of generated objects. To our knowledge, this spatial control capability represents the first exploration of PE-based manipulation in single-image generation. Furthermore, we propose a novel evaluation criterion for single-image generation based on large language models (LLMs). This criterion specifically addresses the limitations of existing objective metrics and the high labor costs associated with user studies. StructDiff also demonstrates broad applicability across downstream tasks, such as text-guided image generation, image editing, outpainting, and paint-to-image synthesis. Extensive experiments demonstrate that StructDiff outperforms existing methods in structural consistency, visual quality, and spatial controllability. The project page is available at https://butter-crab.github.io/StructDiff/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces StructDiff, a single-scale diffusion model for single-image generation. It proposes an adaptive receptive field module to preserve both global and local distributions from the source image and incorporates 3D positional encoding as a spatial prior to enable controllable manipulation of object position, scale, and local details. The work claims this is the first exploration of PE-based manipulation in single-image generation, introduces an LLM-based evaluation criterion to address limitations of existing metrics and user studies, demonstrates applicability to downstream tasks including text-guided generation, editing, outpainting, and paint-to-image, and asserts through extensive experiments that it outperforms prior methods in structural consistency, visual quality, and spatial controllability.
Significance. If the central claims hold with proper validation, the contribution would be meaningful for single-image generation by addressing structural preservation challenges with rigid objects and adding explicit spatial controllability without external data or per-image retuning. The LLM-based evaluation metric could offer a practical alternative to costly user studies. The novelty of applying 3D PE manipulation in this setting is noted as a first exploration. However, the absence of any experimental details, baselines, quantitative results, or ablations in the manuscript text prevents assessment of whether these benefits are realized or whether the adaptive receptive field and 3D PE truly maintain internal statistics without artifacts.
major comments (3)
- [Abstract / Experiments] Abstract and Experiments section: The headline claim that 'Extensive experiments demonstrate that StructDiff outperforms existing methods in structural consistency, visual quality, and spatial controllability' is unsupported because the manuscript provides no baselines, datasets, quantitative metrics (e.g., FID, LPIPS, or structural similarity scores), ablation studies, or qualitative comparisons. This directly undermines the central empirical assertion.
- [Method / Experiments] §3 (Method) and §4 (Experiments): The description of the adaptive receptive field module combined with 3D positional encoding does not include equations, implementation details, or analysis showing that the combination preserves the single-image internal statistics under spatial manipulation. Without this, it is impossible to evaluate the skeptic concern that the 3D PE injection may shift the learned distribution or require image-specific tuning, violating the single-image premise.
- [Evaluation] Evaluation section: The proposed LLM-based evaluation criterion is introduced to address limitations of objective metrics, but no validation of the criterion itself (e.g., correlation with human judgments, prompt templates, or inter-LLM agreement) is provided, leaving its reliability unverified.
minor comments (1)
- [Abstract] The abstract states the project page URL but the manuscript does not reference any supplementary material, code, or additional qualitative results that would allow verification of the spatial control claims.
Simulated Author's Rebuttal
We are grateful to the referee for the insightful comments and the recommendation for major revision. We have carefully considered each point and provide detailed responses below, outlining the revisions we plan to make to address the concerns.
read point-by-point responses
-
Referee: [Abstract / Experiments] Abstract and Experiments section: The headline claim that 'Extensive experiments demonstrate that StructDiff outperforms existing methods in structural consistency, visual quality, and spatial controllability' is unsupported because the manuscript provides no baselines, datasets, quantitative metrics (e.g., FID, LPIPS, or structural similarity scores), ablation studies, or qualitative comparisons. This directly undermines the central empirical assertion.
Authors: We acknowledge that the abstract's claim requires stronger textual support. Although Section 4 presents qualitative results and downstream task examples, we agree that explicit quantitative baselines, datasets, and metrics are insufficiently detailed. In the revised manuscript, we will expand the Experiments section to include quantitative comparisons using FID, LPIPS, and structural similarity scores, along with ablation studies and clear descriptions of baselines and datasets. This will directly substantiate the empirical assertions. revision: yes
-
Referee: [Method / Experiments] §3 (Method) and §4 (Experiments): The description of the adaptive receptive field module combined with 3D positional encoding does not include equations, implementation details, or analysis showing that the combination preserves the single-image internal statistics under spatial manipulation. Without this, it is impossible to evaluate the skeptic concern that the 3D PE injection may shift the learned distribution or require image-specific tuning, violating the single-image premise.
Authors: We agree that additional rigor is needed here. In the revision, we will add the full mathematical equations for the adaptive receptive field module and its integration with 3D positional encoding in Section 3. We will also include implementation details (e.g., network architecture, training procedure) and a dedicated analysis subsection demonstrating preservation of internal statistics under manipulation, including discussion of potential distribution shifts and why image-specific tuning is not required. revision: yes
-
Referee: [Evaluation] Evaluation section: The proposed LLM-based evaluation criterion is introduced to address limitations of objective metrics, but no validation of the criterion itself (e.g., correlation with human judgments, prompt templates, or inter-LLM agreement) is provided, leaving its reliability unverified.
Authors: We appreciate the referee highlighting the need for validation of the LLM-based criterion. In the revised manuscript, we will augment the Evaluation section with: (i) quantitative correlation results between LLM scores and human judgments on a held-out set of samples, (ii) the exact prompt templates employed, and (iii) inter-LLM agreement statistics (e.g., Cohen's kappa or percentage agreement across multiple models). These additions will verify the criterion's reliability. revision: yes
Circularity Check
No circularity in claimed derivation
full rationale
The paper describes an architectural framework that combines a single-scale diffusion model with two added modules (adaptive receptive field and 3D positional encoding) plus an LLM-based evaluation criterion. No equations, parameter-fitting steps, or uniqueness theorems are presented that reduce the central claims to inputs by construction. The method is introduced as a novel combination of existing diffusion ideas with new components, and performance claims rest on experimental comparisons rather than any self-referential derivation or fitted-input renaming. The derivation chain is therefore self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Diffusion models can capture internal statistics of a single image for generation
Reference graph
Works this paper leans on
-
[1]
Singan: Learning a generative model from a single natural image,
T. R. Shaham, T. Dekel, and T. Michaeli, “Singan: Learning a generative model from a single natural image,” inProceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 4570–4580
2019
-
[2]
Sinddm: A sin- gle image denoising diffusion model,
V . Kulikov, S. Yadin, M. Kleiner, and T. Michaeli, “Sinddm: A sin- gle image denoising diffusion model,” inInternational conference on machine learning. PMLR, 2023, pp. 17 920–17 930
2023
-
[3]
Sindiffusion: Learning a diffusion model from a single natural image,
W. Wang, J. Bao, W. Zhou, D. Chen, D. Chen, L. Yuan, and H. Li, “Sindiffusion: Learning a diffusion model from a single natural image,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025
2025
-
[4]
R. Wu, R. Liu, C. V ondrick, and C. Zheng, “Sin3dm: Learning a diffusion model from a single 3d textured shape,”arXiv preprint arXiv:2305.15399, 2023
-
[5]
Sinfusion: Training diffusion models on a single image or video,
Y . Nikankin, N. Haim, and M. Irani, “Sinfusion: Training diffusion models on a single image or video,”arXiv preprint arXiv:2211.11743, 2022
-
[6]
Denoising diffusion probabilistic models,
J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems, vol. 33, pp. 6840– 6851, 2020
2020
-
[7]
High- resolution image synthesis with latent diffusion models,
R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High- resolution image synthesis with latent diffusion models,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695
2022
-
[8]
FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space
B. F. Labs, S. Batifol, A. Blattmann, F. Boesel, S. Consul, C. Diagne, T. Dockhorn, J. English, Z. English, P. Esser, S. Kulal, K. Lacey, Y . Levi, C. Li, D. Lorenz, J. M ¨uller, D. Podell, R. Rombach, H. Saini, A. Sauer, and L. Smith, “Flux.1 kontext: Flow matching for in-context image generation and editing in latent space,” 2025. [Online]. Available: h...
work page internal anchor Pith review arXiv 2025
-
[9]
Deep internal learning: Deep learning from a single input,
T. Tirer, R. Giryes, S. Y . Chun, and Y . C. Eldar, “Deep internal learning: Deep learning from a single input,”arXiv preprint arXiv:2312.07425, 2023
-
[10]
Image-to-image translation with conditional adversarial networks,
P. Isola, J.-Y . Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1125– 1134
2017
-
[11]
Ingan: Capturing and re- targeting the
A. Shocher, S. Bagon, P. Isola, and M. Irani, “Ingan: Capturing and re- targeting the” dna” of a natural image,” inProceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 4492–4501
2019
-
[12]
Improved techniques for training single-image gans,
T. Hinz, M. Fisher, O. Wang, and S. Wermter, “Improved techniques for training single-image gans,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2021, pp. 1300–1309
2021
-
[13]
Patchwise generative convnet: Training energy-based models from a single natural image for internal learning,
Z. Zheng, J. Xie, and P. Li, “Patchwise generative convnet: Training energy-based models from a single natural image for internal learning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 2961–2970
2021
-
[14]
Drop the gan: In defense of patches nearest neighbors as single image generative models,
N. Granot, B. Feinstein, A. Shocher, S. Bagon, and M. Irani, “Drop the gan: In defense of patches nearest neighbors as single image generative models,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 13 460–13 469
2022
-
[15]
Generative modeling by estimating gradients of the data distribution,
Y . Song and S. Ermon, “Generative modeling by estimating gradients of the data distribution,”Advances in neural information processing systems, vol. 32, 2019
2019
-
[16]
Diffusion models beat gans on image synthesis,
P. Dhariwal and A. Nichol, “Diffusion models beat gans on image synthesis,”Advances in neural information processing systems, vol. 34, pp. 8780–8794, 2021
2021
-
[17]
Generative adversarial networks,
I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio, “Generative adversarial networks,” Communications of the ACM, vol. 63, no. 11, pp. 139–144, 2020
2020
-
[18]
Pixel recurrent neural networks,
A. Van Den Oord, N. Kalchbrenner, and K. Kavukcuoglu, “Pixel recurrent neural networks,” inInternational conference on machine learning. PMLR, 2016, pp. 1747–1756
2016
-
[19]
Image super-resolution via iterative refinement,
C. Saharia, J. Ho, W. Chan, T. Salimans, D. J. Fleet, and M. Norouzi, “Image super-resolution via iterative refinement,”IEEE transactions on pattern analysis and machine intelligence, vol. 45, no. 4, pp. 4713–4726, 2022
2022
-
[20]
Palette: Image-to-image diffusion models,
C. Saharia, W. Chan, H. Chang, C. Lee, J. Ho, T. Salimans, D. Fleet, and M. Norouzi, “Palette: Image-to-image diffusion models,” inACM SIGGRAPH 2022 conference proceedings, 2022, pp. 1–10
2022
-
[21]
Hierarchical Text-Conditional Image Generation with CLIP Latents
A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical text-conditional image generation with clip latents,”arXiv preprint arXiv:2204.06125, vol. 1, no. 2, p. 3, 2022
work page internal anchor Pith review arXiv 2022
-
[22]
Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing,
M. Cao, X. Wang, Z. Qi, Y . Shan, X. Qie, and Y . Zheng, “Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 22 560–22 570
2023
-
[23]
arXiv preprint arXiv:2307.02421 (2023)
C. Mou, X. Wang, J. Song, Y . Shan, and J. Zhang, “Dragondiffusion: Enabling drag-style manipulation on diffusion models,”arXiv preprint arXiv:2307.02421, 2023
-
[24]
Sine: Single image editing with text-to-image diffusion models,
Z. Zhang, L. Han, A. Ghosh, D. N. Metaxas, and J. Ren, “Sine: Single image editing with text-to-image diffusion models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 6027–6037
2023
-
[25]
Instructpix2pix: Learning to follow image editing instructions,
T. Brooks, A. Holynski, and A. A. Efros, “Instructpix2pix: Learning to follow image editing instructions,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 18 392–18 402
2023
-
[26]
Mmginpainting: Multi-modality guided image inpainting based on diffusion models,
C. Zhang, W. Yang, X. Li, and H. Han, “Mmginpainting: Multi-modality guided image inpainting based on diffusion models,”IEEE Transactions on Multimedia, vol. 26, pp. 8811–8823, 2024
2024
-
[27]
Diff- fashion: Reference-based fashion design with structure-aware transfer by diffusion models,
S. Cao, W. Chai, S. Hao, Y . Zhang, H. Chen, and G. Wang, “Diff- fashion: Reference-based fashion design with structure-aware transfer by diffusion models,”IEEE Transactions on Multimedia, vol. 26, pp. 3962–3975, 2023
2023
-
[28]
Animediff: Customized image generation of anime characters using diffusion model,
Y . Jiang, Q. Liu, D. Chen, L. Yuan, and Y . Fu, “Animediff: Customized image generation of anime characters using diffusion model,”IEEE Transactions on Multimedia, 2024
2024
-
[29]
Stableiden- tity: Inserting anybody into anywhere at first sight,
Q. Wang, X. Jia, X. Li, T. Li, L. Ma, Y . Zhuge, and H. Lu, “Stableiden- tity: Inserting anybody into anywhere at first sight,”IEEE Transactions on Multimedia, 2025
2025
-
[30]
Truncate diffusion: Efficient video editing with low-rank truncate,
B. Qin, W. Ye, C. Zhang, Q. Yu, W. Zhang, S. Tang, and Y . Zhuang, “Truncate diffusion: Efficient video editing with low-rank truncate,” IEEE Transactions on Multimedia, 2025
2025
-
[31]
Efficient and robust video virtual try-on via enhanced multi-garment alignment,
Z. He, P. Chen, G. Zheng, G. Wang, X. Luo, L. Lin, and G. Li, “Efficient and robust video virtual try-on via enhanced multi-garment alignment,” IEEE Transactions on Multimedia, 2025
2025
-
[32]
Revealing directions for text-guided 3d face editing,
Z. Chen, Y . Yan, S. Liu, Y . Cheng, W. Zhao, L. Li, M. Bi, and X. Yang, “Revealing directions for text-guided 3d face editing,”IEEE Transactions on Multimedia, 2025
2025
-
[33]
Avatarmakeup: Realistic makeup transfer for 3d animatable head avatars,
Y . Zhong, X. Zhang, L. Liu, Y . Zhao, and Y . Wei, “Avatarmakeup: Realistic makeup transfer for 3d animatable head avatars,”arXiv preprint arXiv:2507.02419, 2025
-
[34]
T2i- adapter: Learning adapters to dig out more controllable ability for text- to-image diffusion models,
C. Mou, X. Wang, L. Xie, Y . Wu, J. Zhang, Z. Qi, and Y . Shan, “T2i- adapter: Learning adapters to dig out more controllable ability for text- to-image diffusion models,” inProceedings of the AAAI conference on artificial intelligence, vol. 38, no. 5, 2024, pp. 4296–4304
2024
-
[35]
Uni-controlnet: All-in-one control to text-to-image diffusion models,
S. Zhao, D. Chen, Y .-C. Chen, J. Bao, S. Hao, L. Yuan, and K.-Y . K. Wong, “Uni-controlnet: All-in-one control to text-to-image diffusion models,”Advances in Neural Information Processing Systems, vol. 36, pp. 11 127–11 150, 2023
2023
-
[36]
Adding conditional control to text-to-image diffusion models,
L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 3836–3847
2023
-
[37]
Gligen: Open-set grounded text-to-image generation,
Y . Li, H. Liu, Q. Wu, F. Mu, J. Yang, J. Gao, C. Li, and Y . J. Lee, “Gligen: Open-set grounded text-to-image generation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 22 511–22 521. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 12
2023
-
[38]
Humansd: A native skeleton-guided diffusion model for human image generation,
X. Ju, A. Zeng, C. Zhao, J. Wang, L. Zhang, and Q. Xu, “Humansd: A native skeleton-guided diffusion model for human image generation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 15 988–15 998
2023
-
[39]
Diffusion self-guidance for controllable image generation,
D. Epstein, A. Jabri, B. Poole, A. Efros, and A. Holynski, “Diffusion self-guidance for controllable image generation,”Advances in Neural Information Processing Systems, vol. 36, pp. 16 222–16 239, 2023
2023
-
[40]
Lora: Low-rank adaptation of large language models
E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chenet al., “Lora: Low-rank adaptation of large language models.” ICLR, vol. 1, no. 2, p. 3, 2022
2022
-
[41]
An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion
R. Gal, Y . Alaluf, Y . Atzmon, O. Patashnik, A. H. Bermano, G. Chechik, and D. Cohen-Or, “An image is worth one word: Person- alizing text-to-image generation using textual inversion,”arXiv preprint arXiv:2208.01618, 2022
work page internal anchor Pith review arXiv 2022
-
[42]
Dreambooth: Fine tuning text-to-image diffusion models for subject- driven generation,
N. Ruiz, Y . Li, V . Jampani, Y . Pritch, M. Rubinstein, and K. Aberman, “Dreambooth: Fine tuning text-to-image diffusion models for subject- driven generation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 22 500–22 510
2023
-
[43]
Hyperdreambooth: Hypernetworks for fast personalization of text-to-image models,
N. Ruiz, Y . Li, V . Jampani, W. Wei, T. Hou, Y . Pritch, N. Wadhwa, M. Rubinstein, and K. Aberman, “Hyperdreambooth: Hypernetworks for fast personalization of text-to-image models,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 6527–6536
2024
-
[44]
Ssr-encoder: Encoding selective subject represen- tation for subject-driven generation,
Y . Zhang, Y . Song, J. Liu, R. Wang, J. Yu, H. Tang, H. Li, X. Tang, Y . Hu, H. Panet al., “Ssr-encoder: Encoding selective subject represen- tation for subject-driven generation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 8069–8078
2024
-
[45]
U-net: Convolutional networks for biomedical image segmentation,
O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” inInternational Conference on Medical image computing and computer-assisted intervention. Springer, 2015, pp. 234–241
2015
-
[46]
Selective kernel networks,
X. Li, W. Wang, X. Hu, and J. Yang, “Selective kernel networks,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 510–519
2019
-
[47]
Implicit neural representations with periodic activation functions,
V . Sitzmann, J. Martel, A. Bergman, D. Lindell, and G. Wetzstein, “Implicit neural representations with periodic activation functions,” Advances in neural information processing systems, vol. 33, pp. 7462– 7473, 2020
2020
-
[48]
Learning transferable visual models from natural language supervision,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PmLR, 2021, pp. 8748–8763
2021
-
[49]
J. Choi, S. Kim, Y . Jeong, Y . Gwon, and S. Yoon, “Ilvr: Conditioning method for denoising diffusion probabilistic models,”arXiv preprint arXiv:2108.02938, 2021
-
[50]
Very Deep Convolutional Networks for Large-Scale Image Recognition
K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,”arXiv preprint arXiv:1409.1556, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[51]
Musiq: Multi- scale image quality transformer,
J. Ke, Q. Wang, Y . Wang, P. Milanfar, and F. Yang, “Musiq: Multi- scale image quality transformer,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 5148–5157
2021
-
[52]
Blind image quality assessment using a deep bilinear convolutional neural network,
W. Zhang, K. Ma, J. Yan, D. Deng, and Z. Wang, “Blind image quality assessment using a deep bilinear convolutional neural network,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 1, pp. 36–47, 2018
2018
-
[53]
The unreasonable effectiveness of deep features as a perceptual metric,
R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 586–595
2018
-
[54]
Image quality assessment: from error visibility to structural similarity,
Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,”IEEE transactions on image processing, vol. 13, no. 4, pp. 600–612, 2004
2004
-
[55]
Introducing 4o image generation,
OpenAI, “Introducing 4o image generation,” 2025. [Online]. Available: https://openai.com/index/introducing-4o-image-generation/
2025
-
[56]
Modification takes courage: Seamless image stitching via reference-driven inpainting,
Z. Xie, X. Lai, W. Zhao, S. Jiang, X. Liu, and W. Hou, “Modification takes courage: Seamless image stitching via reference-driven inpainting,” arXiv preprint arXiv:2411.10309, 2024
-
[57]
C. Wu, J. Li, J. Zhou, J. Lin, K. Gao, K. Yan, S. ming Yin, S. Bai, X. Xu, Y . Chen, Y . Chen, Z. Tang, Z. Zhang, Z. Wang, A. Yang, B. Yu, C. Cheng, D. Liu, D. Li, H. Zhang, H. Meng, H. Wei, J. Ni, K. Chen, K. Cao, L. Peng, L. Qu, M. Wu, P. Wang, S. Yu, T. Wen, W. Feng, X. Xu, Y . Wang, Y . Zhang, Y . Zhu, Y . Wu, Y . Cai, and Z. Liu, “Qwen-image technica...
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.