pith. sign in

arxiv: 2504.01571 · v2 · pith:LTYHSJWTnew · submitted 2025-04-02 · 💻 cs.GR · cs.AI· cs.CV· cs.LG

Pro-DG: Procedural Diffusion Guidance for Architectural Facade Generation

Pith reviewed 2026-05-22 22:09 UTC · model grok-4.3

classification 💻 cs.GR cs.AIcs.CVcs.LG
keywords architectural facade generationprocedural modelingdiffusion modelsControlNetstructural editinghierarchical alignmentinverse procedural modelingimage synthesis
0
0 comments X

The pith

Hierarchical procedural rules embedded in diffusion control maps allow structural edits to facades while preserving local appearance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a pipeline that starts with one facade image and its segmentation, then recovers the building's hierarchical layout through an inverse procedural step. This layout drives the creation of control maps inside a ControlNet-augmented Stable Diffusion model, so that procedural transformations such as floor duplication or window rearrangement can be applied. The control maps steer the generative process to keep textures and details consistent even when the overall structure changes substantially. The authors show that this approach produces more coherent results than standard inpainting techniques on both synthetic tests and real images. User studies and quantitative metrics are presented as evidence that architectural identity is better maintained.

Core claim

By integrating hierarchical alignment directly into control maps derived from procedural rules, the diffusion process can be guided to perform extensive structural modifications on facade imagery while maintaining local appearance fidelity.

What carries the argument

Hierarchical procedural control maps generated by an inverse procedural module and supplied to a ControlNet pipeline

If this is right

  • Floor duplication and window rearrangement become feasible while local textures remain consistent.
  • The generated images preserve architectural identity better than inpainting-based methods.
  • Quantitative benchmarks and user feedback demonstrate accurate, controllable edits.
  • The same pipeline supports multiple types of structural changes guided by the recovered hierarchy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Design software could expose these procedural controls to let non-experts create varied yet coherent building variants from one reference photo.
  • The approach might transfer to other image domains that contain repeating hierarchical structures, such as city blocks or patterned textiles.
  • If the inverse recovery step is made more robust to partial occlusions, the method could handle photographs taken under less ideal conditions.

Load-bearing premise

The inverse procedural module can reliably recover an accurate hierarchical layout and structural features from a single input image and its segmentation.

What would settle it

If blind user tests or quantitative metrics on a set of facades with varied hierarchies show no measurable improvement over inpainting baselines in structural accuracy or appearance consistency, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2504.01571 by Aleksander Plocharski, Jan Swidzinski, Przemyslaw Musialski.

Figure 1
Figure 1. Figure 1: Pro-DG is a novel approach to guiding diffusion model [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The pipeline consists of two distinct elements: the Hier [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: An example of a simplified procedural representation [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Example of the fully reconstructed Canny edges serve [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 7
Figure 7. Figure 7: User study results showcasing if the users were partial [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Binary ablations. We perform an on/off analysis of two [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: The showcase of a limitation of the method present [PITH_FULL_IMAGE:figures/full_fig_p007_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Results of our method: Each row of five images begins with the original target image, followed by segmentation of variation 1 [PITH_FULL_IMAGE:figures/full_fig_p008_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Showcase of how the user study looked like the the user: (1) landing introductory page; (2) realism question; (3) appearance [PITH_FULL_IMAGE:figures/full_fig_p011_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Results of our method: Each row of five images begins with the original target image, followed by segmentation of variation 1 [PITH_FULL_IMAGE:figures/full_fig_p012_13.png] view at source ↗
read the original abstract

We use hierarchical procedural rules for the generation of control maps within the stable diffusion framework to produce photo-realistic architectural facade images. Starting from a single input image and its segmentation, we apply an inverse procedural module to identify the facade's hierarchical layout. Leveraging this hierarchy and structural features, we introduce a novel ControlNet pipeline that generates new facade imagery guided by procedural transformations. Our method enables various structural edits, including floor duplication and window rearrangement, by integrating hierarchical alignment directly into control maps. This precisely guides the diffusion-based generative process, ensuring local appearance fidelity alongside extensive structural modifications. Comprehensive evaluations, including comparisons with inpainting-based approaches and synthetic benchmarks, confirm our approach's superior capability in preserving architectural identity and achieving accurate, controllable edits. Quantitative results and user feedback validate our method's effectiveness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents Pro-DG, a framework that combines inverse procedural modeling with a ControlNet-augmented Stable Diffusion pipeline to generate and structurally edit architectural facade images. From a single input image and its segmentation, an inverse procedural module extracts a hierarchical layout, which is then transformed procedurally and used to condition the diffusion process via aligned control maps. The authors claim this enables edits such as floor duplication and window rearrangement while preserving local appearance, and that it outperforms inpainting baselines and synthetic benchmarks in quantitative and user studies.

Significance. If the inverse procedural recovery step proves accurate and generalizable, the method could offer a significant advance in controllable generation for architectural imagery by allowing procedural structural changes that go beyond pixel-level inpainting. This would be valuable for design exploration and visualization tasks. The integration of procedural hierarchies into diffusion control maps is a promising direction, though its impact depends on the reliability of the front-end recovery module, which is not quantitatively validated in the provided text.

major comments (2)
  1. [Abstract] The abstract states that 'comprehensive evaluations... confirm our approach's superior capability', yet no quantitative tables, metrics, error bars, or specific comparison details are included, preventing verification of the performance claims.
  2. [Method (Inverse Procedural Module)] The method's ability to perform structural edits like floor duplication relies on the inverse procedural module accurately recovering the hierarchical layout from a single image and segmentation. No evaluation metrics for this module's accuracy (such as IoU for floor/window detection or success rates on test cases) are referenced, which is a load-bearing assumption for the central claims.
minor comments (1)
  1. [Abstract] The term 'hierarchical alignment directly into control maps' is used without a brief definition or reference to the specific mechanism in the ControlNet pipeline.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which identifies key areas where the manuscript's claims require stronger supporting evidence. We address each major comment below and commit to revisions that will improve the clarity and verifiability of the results.

read point-by-point responses
  1. Referee: [Abstract] The abstract states that 'comprehensive evaluations... confirm our approach's superior capability', yet no quantitative tables, metrics, error bars, or specific comparison details are included, preventing verification of the performance claims.

    Authors: We agree that the abstract's summary claim would benefit from more concrete support to enable immediate verification. The current version does not embed specific metrics or tables within the abstract itself. We will revise the abstract to include brief references to key quantitative outcomes (e.g., reported improvements in perceptual metrics and user preference rates) and will ensure the Results section contains full tables with metrics, baselines, and error bars. revision: yes

  2. Referee: [Method (Inverse Procedural Module)] The method's ability to perform structural edits like floor duplication relies on the inverse procedural module accurately recovering the hierarchical layout from a single image and segmentation. No evaluation metrics for this module's accuracy (such as IoU for floor/window detection or success rates on test cases) are referenced, which is a load-bearing assumption for the central claims.

    Authors: This observation is correct and highlights a gap in the current presentation. While the end-to-end edit quality provides indirect evidence, direct quantitative validation of the inverse procedural recovery step is absent. We will add a dedicated evaluation subsection reporting accuracy metrics, including IoU scores for floor and window detection as well as success rates on a test set of facade images, to substantiate the module's reliability. revision: yes

Circularity Check

0 steps flagged

No circularity: pipeline described without self-referential equations or fitted predictions

full rationale

The abstract and description present a sequential pipeline: input image + segmentation -> inverse procedural module for hierarchy -> ControlNet with procedural transformations for edits. No equations, fitted parameters, or predictions are mentioned that reduce by construction to inputs. No self-citations, uniqueness theorems, or ansatzes are invoked. The central claims rest on the integration of hierarchical alignment into control maps, which is presented as an independent methodological step rather than a renaming or self-definition. This is the common case of a self-contained description against external benchmarks such as inpainting comparisons.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an applied engineering paper in computer graphics. The abstract mentions no new mathematical axioms, free parameters, or invented physical entities; the method rests on standard stable-diffusion and ControlNet components from prior literature.

pith-pipeline@v0.9.0 · 5673 in / 1064 out tokens · 56737 ms · 2026-05-22T22:09:50.362458+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 3 internal anchors

  1. [1]

    Adobe firefly: Generative ai for creative content,

    Adobe Inc. Adobe firefly: Generative ai for creative content,

  2. [2]

    More details available at https://www

    Adobe Firefly powers features such as Generative Fill in Photoshop. More details available at https://www. adobe.com/sensei/generative- ai/firefly. html. 5

  3. [3]

    Adobe photoshop 2025, 2025

    Adobe Inc. Adobe photoshop 2025, 2025. Accessed: 2025-03-08. Available at https://www.adobe.com/ products/photoshop.html. 5

  4. [4]

    Aliaga, Paul A

    Daniel G. Aliaga, Paul A. Rosen, and David R. Bekins. Style grammars for interactive visualization of architecture. IEEE Transactions on Visualization and Computer Graphics , 13 (4):546–558, 2007. 2

  5. [5]

    Blended diffusion for text-driven editing of natural images

    Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended diffusion for text-driven editing of natural images. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18208–18218, 2022. 2

  6. [6]

    Spatext: Spatio-textual representation for con- trollable image generation

    Omri Avrahami, Thomas Hayes, Oran Gafni, Sonal Gupta, Yaniv Taigman, Devi Parikh, Dani Lischinski, Ohad Fried, and Xi Yin. Spatext: Spatio-textual representation for con- trollable image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition ,

  7. [7]

    Multidiffusion: Fusing diffusion paths for controlled image generation

    Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel. Multidiffusion: Fusing diffusion paths for controlled image generation. In Proceedings of the International Conference on Machine Learning, 2023. 2 8

  8. [8]

    Sliced wasserstein distances for comparing proba- bility distributions

    Nicolas Bonneel, Justin Rabin, Gabriel Peyr ´e, and Marco Cuturi. Sliced wasserstein distances for comparing proba- bility distributions. In Advances in Neural Information Pro- cessing Systems, pages 1124–1132, 2015. 6

  9. [9]

    Tim Brooks, Aleksander Holynski, and Alexei A. Efros. In- structPix2Pix: Learning to follow image editing instructions. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 2

  10. [10]

    Training-free layout control with cross-attention guidance

    Minghao Chen, Iro Laina, and Andrea Vedaldi. Training-free layout control with cross-attention guidance. In Proceed- ings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 5343–5352, 2024. 2

  11. [11]

    Denoising dif- fusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models. In Advances in Neural Infor- mation Processing Systems (NeurIPS) 33, pages 6840–6851,

  12. [12]

    Zero-1-to- 3: Zero-shot one image to 3d object

    Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tok- makov, Sergey Zakharov, and Carl V ondrick. Zero-1-to- 3: Zero-shot one image to 3d object. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023. 2

  13. [13]

    Grounding dino: Marrying dino with grounded pre-training for open-set object detection, 2023

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection, 2023. 2

  14. [14]

    Facade parsing using HOG, LBP, and structural constraints

    Markus Mathias, Aleksandar Martinovic, Jonah Weis- senberg, and Luc Van Gool. Facade parsing using HOG, LBP, and structural constraints. In 2011 IEEE International Conference on Computer Vision Workshops (ICCV Work- shops), pages 6–13, 2011. 2

  15. [15]

    Object 3dit: language-guided 3d-aware image editing

    Oscar Michel, Anand Bhattad, Eli VanderBilt, Ranjay Kr- ishna, Aniruddha Kembhavi, and Tanmay Gupta. Object 3dit: language-guided 3d-aware image editing. In Proceed- ings of the 37th International Conference on Neural Infor- mation Processing Systems, Red Hook, NY , USA, 2023. Cur- ran Associates Inc. 2

  16. [16]

    Diffusion handles enabling 3d edits for diffusion models by lifting ac- tivations to 3d

    Oscar Michel, Anand Bhattad, Eli VanderBilt, Ranjay Kr- ishna, Aniruddha Kembhavi, and Tanmay Gupta. Diffusion handles enabling 3d edits for diffusion models by lifting ac- tivations to 3d. arXiv preprint arXiv:2307.11073, 2023. 2, 3, 5, 7

  17. [17]

    Null-text inversion for editing real images using guided diffusion models

    Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Null-text inversion for editing real images using guided diffusion models. In 2023 IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR) , pages 6038–6047, 2023. 3, 5

  18. [18]

    T2i- adapter: Learning adapters to dig out more controllable abil- ity for text-to-image diffusion models, 2023

    Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, Ying Shan, and Xiaohu Qie. T2i- adapter: Learning adapters to dig out more controllable abil- ity for text-to-image diffusion models, 2023. 2

  19. [19]

    Procedural modeling of buildings

    Pascal M ¨uller, Peter Wonka, Simon Haegler, Andreas Ulmer, and Luc Van Gool. Procedural modeling of buildings. ACM Transactions on Graphics (TOG), 25(3):614–623, 2006. 2

  20. [20]

    Interactive coherence-based facade modeling

    Przemyslaw Musialski, Michael Wimmer, and Peter Wonka. Interactive coherence-based facade modeling. Computer Graphics Forum, 31:661–670, 2012. 3

  21. [21]

    A survey of urban reconstruction

    Przemyslaw Musialski, Michael Wimmer, Luc Van Gool, Scott Irwin, Michael Waechter, and Werner Purgathofer. A survey of urban reconstruction. Computer Graphics Forum, 32(6):146–177, 2013. 2

  22. [22]

    Drag your gan: Interactive point-based manipulation on the generative image manifold

    Xingang Pan, Ayush Tewari, Thomas Leimk ¨uhler, Lingjie Liu, Abhimitra Meka, and Christian Theobalt. Drag your gan: Interactive point-based manipulation on the generative image manifold. ACM Transactions on Graphics (Proceed- ings of SIGGRAPH 2023), 42(4):1–12, 2023. 2

  23. [23]

    Yoav I. H. Parish and Pascal M ¨uller. Procedural modeling of cities. In Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques (SIG- GRAPH), pages 301–308, New York, NY , USA, 2001. ACM. 2

  24. [24]

    Fac ¸aid: A transformer model for neuro-symbolic facade reconstruction

    Aleksander Plocharski, Jan Swidzinski, Joanna Porter- Sobieraj, and Przemyslaw Musialski. Fac ¸aid: A transformer model for neuro-symbolic facade reconstruction. In SIG- GRAPH Asia 2024 Conference Papers, New York, NY , USA,

  25. [25]

    Association for Computing Machinery. 1, 2, 3

  26. [26]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gen- eration with clip latents. arXiv preprint arXiv:2204.06125,

  27. [27]

    Application of a for- mal grammar to facade reconstruction in semiautomatic and automatic environments

    Nathanael Ripperda and Claus Brenner. Application of a for- mal grammar to facade reconstruction in semiautomatic and automatic environments. In Photogrammetric Image Analy- sis (PIA), pages 29–38. Springer, 2009. 2

  28. [28]

    High-resolution image syn- thesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022. 2

  29. [29]

    Photorealistic text-to-image diffusion models with deep language understanding

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. In Advances in Neural Information Processing Systems (NeurIPS) 35, 2022. 2

  30. [30]

    Very Deep Convolutional Networks for Large-Scale Image Recognition

    Karen Simonyan and Andrew Zisserman. Very deep convo- lutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. 6

  31. [31]

    Deep unsupervised learning using nonequilibrium thermodynamics

    Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the 32nd International Conference on Machine Learning (ICML) , pages 2256–2265. PMLR, 2015. 2

  32. [32]

    Denois- ing diffusion implicit models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois- ing diffusion implicit models. In 8th International Confer- ence on Learning Representations (ICLR), 2020. 1, 2

  33. [33]

    Object- stitch: Object compositing with diffusion model

    Yizhi Song, Zhifei Zhang, Zhe Lin, Scott Cohen, Brian Price, Jianming Zhang, Soo Ye Kim, and Daniel Aliaga. Object- stitch: Object compositing with diffusion model. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18310–18319, 2023. 2

  34. [34]

    Inverse procedural modeling by automatic generation of l-systems

    Ond ˇrej St’ava, Jiˇr´ı Vanek, Bedrich Benes, Ross Mead, and Nathan Miller. Inverse procedural modeling by automatic generation of l-systems. Computer Graphics Forum, 29(2): 665–674, 2010. 2

  35. [35]

    Pictorial and formal aspects of shape and shape grammars

    George Stiny. Pictorial and formal aspects of shape and shape grammars. Technical report, Environmental Design 9 and Research Center, Massachusetts Institute of Technology,

  36. [36]

    Automated facade interpretation using im- age parsing

    Olivier Teboul, Lo ¨ıc Simon, Panagiotis Koutsourakis, and Nikos Paragios. Automated facade interpretation using im- age parsing. In 2011 International Conference on 3D Imag- ing, Modeling, Processing, Visualization and Transmission , pages 50–57, 2011. 2

  37. [37]

    Shape grammar parsing via reinforcement learning

    Olivier Teboul, Panagiotis Koutsourakis, Lo ¨ıc Simon, Nikos Paragios, and Andrea Torsello. Shape grammar parsing via reinforcement learning. Computer Vision and Image Under- standing, 117(1):1–11, 2013. 1, 2

  38. [38]

    Instant architecture

    Peter Wonka, Michael Wimmer, Franc ¸ois Sillion, and William Ribarsky. Instant architecture. ACM Transactions on Graphics (TOG), 22(3):669–677, 2003. 1, 2, 3

  39. [39]

    Pars- ing fac ¸ade with rank-one approximation

    Chao Yang, Tian Han, Long Quan, and Chiew-Lan Tai. Pars- ing fac ¸ade with rank-one approximation. In2012 IEEE Con- ference on Computer Vision and Pattern Recognition, pages 1720–1727, 2012. 4

  40. [40]

    Adding Conditional Control to Text-to-Image Diffusion Models

    Lvmin Zhang and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543, 2023. 2 10 A. User Study Set-up (1) Landing Page (2) Realism (3) Appearance Preservation (4) Edit Adherence Figure 12. Showcase of how the user study looked like the the user: (1) landing introductory page; (2) realism question;...