pith. sign in

arxiv: 2512.01242 · v3 · pith:6M6GD7VDnew · submitted 2025-12-01 · 💻 cs.CV · cs.AI· cs.CL

When Diffusion Breaks Constraints: Sequential Autoregressive Generation with RL and MCTS

Pith reviewed 2026-05-17 03:40 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL
keywords diffusion modelsconstrained generationautoregressive generationreinforcement learningMonte Carlo tree searchtangram generationgeometric constraintsplanning and design tasks
0
0 comments X

The pith

Diffusion models fail to satisfy strict geometric constraints in planning tasks because continuous density matching cannot target low-dimensional feasible regions, while reformulating generation as sequential discrete choices with RL and M-

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines why diffusion models produce invalid outputs in constrained tasks such as assembling tangram pieces into a language-described silhouette without overlaps or disconnections. Experiments on tangram generation and a simplified rectangle composition task with a learned bounding-box constraint show that diffusion struggles even with guidance or projection, as feasible solutions occupy small low-dimensional areas of the output space. The authors reformulate the problem as building outputs step by step through discrete autoregressive decisions. Reinforcement learning then trains policies to favor feasible configurations, and Monte Carlo tree search measures how much planning ahead helps when feasible regions become smaller. The combined evidence indicates a structural limitation for continuous density matching on this class of problems and points to sequential constraint-aware generation as an alternative.

Core claim

Diffusion models struggle to satisfy constraints in tasks such as tangram generation from language and rectangle composition with bounding-box constraints, consistent with difficulty generating samples near low-dimensional submanifolds. Reformulating constrained generation as discrete autoregressive sequential generation allows reinforcement learning to improve feasibility and task success, and Monte Carlo tree search to quantify the value of look-ahead when feasible regions shrink. The empirical, theoretical, and prior-work evidence points to a structural limitation of continuous density matching on this class of constrained-generation problems.

What carries the argument

Reformulation of constrained generation as discrete autoregressive sequential generation, where reinforcement learning improves feasibility and Monte Carlo tree search quantifies the value of look-ahead.

If this is right

  • Diffusion models will continue to exhibit severe constraint violations in engineering inverse design, molecular generation, multi-robot planning, and floorplan synthesis even when projection or guidance is applied.
  • Sequential autoregressive methods achieve higher rates of feasible and task-successful outputs on problems that require non-overlap, connectivity, and other geometric constraints.
  • The benefit of Monte Carlo tree search increases as the size of feasible regions shrinks in constrained tasks.
  • Locally feasible reparameterizations provide the motivation for moving from continuous density matching to discrete sequential modeling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The sequential RL approach could be applied directly to molecular generation to enforce physical constraints that diffusion models routinely violate.
  • Hybrid pipelines might first use diffusion to propose creative starting points and then switch to sequential refinement to meet hard constraints in scene or floorplan synthesis.
  • Evaluating the method on multi-robot coordination tasks would test whether the tangram results generalize to problems involving dynamic spatial constraints.

Load-bearing premise

The failure modes observed in tangram generation from language and the simplified rectangle composition task are representative of the broader class of constrained planning and design tasks such as engineering inverse design, molecular generation, and multi-robot planning.

What would settle it

A controlled experiment in which diffusion models with guidance or projection achieve feasibility rates on tangram or molecular generation tasks that match or exceed those of the sequential RL-plus-MCTS method would falsify the structural-limitation claim.

Figures

Figures reproduced from arXiv: 2512.01242 by Boye Niu, David Hsu, Harold Soh, Wee Sun Lee, Zirui Zhao.

Figure 1
Figure 1. Figure 1: The Tangram assembly task. The seven pieces are placed [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Demonstration of the proposed Generative Adversarial [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 2
Figure 2. Figure 2: The demo of the actions in Tangram assembly. (a) The [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: The policy value network. We use a pre-trained Vision [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The demonstration of the dataset and the generated tan [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: One example question in the questionnaire. We ran [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
read the original abstract

Data-driven generative models excel in language and vision, but diffusion models often fail in constrained planning and design tasks, exhibiting severe constraint violations in engineering inverse design, molecular generation, multi-robot planning, and floorplan/scene synthesis even with projection or guidance. Such tasks combine hard-to-specify semantic goals with strict geometric or physical constraints (e.g., non-overlap, connectivity), yielding feasible solutions that lie on low-dimensional, small, and sometimes disconnected regions of the output space. This paper studies the failure mode through tangram generation from language, where seven fixed shapes must form a text-described silhouette while remaining connected and non-overlapping, and a simplified rectangle composition task with a learned bounding-box constraint. We find diffusion models struggle to satisfy constraints, consistent with difficulty generating samples near low-dimensional submanifolds. Motivated by locally feasible reparameterizations, we reformulate constrained generation as discrete autoregressive sequential generation. Reinforcement learning improves feasibility and task success, and Monte Carlo tree search quantifies the value of look-ahead when feasible regions shrink. Overall, the empirical, theoretical, and prior-work evidence points to a structural limitation of continuous density matching on this class of constrained-generation problems, and suggests sequential constraint-aware generation as a promising alternative.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that diffusion models exhibit structural limitations when generating samples in constrained planning and design tasks whose feasible sets are low-dimensional, small, or disconnected. It demonstrates this via language-conditioned tangram generation (seven fixed shapes forming a silhouette while satisfying connectivity and non-overlap) and a simplified rectangle composition task with a learned bounding-box constraint, then proposes reformulating the problem as discrete autoregressive sequential generation that is improved by reinforcement learning for feasibility and Monte Carlo tree search for look-ahead value.

Significance. If the central claim holds, the work supplies concrete empirical evidence and geometric motivation for why continuous density matching struggles on this class of problems and identifies sequential constraint-aware generation as a viable alternative. The use of RL and MCTS to improve feasibility on the proposed tasks is a concrete, reproducible strength that could inform hybrid methods in molecular generation, robotics, and inverse design.

major comments (2)
  1. [§4] §4 (Tangram experiments): the claim that observed constraint violations demonstrate a structural limitation of continuous density matching would be strengthened by an explicit control that varies the dimensionality of the feasible set while holding the diffusion architecture fixed; without it, the results remain consistent with the specific discrete geometry of tangram rather than the paradigm itself.
  2. [§5.2] §5.2 (Rectangle composition): the learned bounding-box constraint is a simplified proxy; the manuscript should report the fraction of samples that satisfy the constraint under standard diffusion guidance versus the sequential method, with error bars across at least three random seeds, to make the quantitative gap load-bearing for the broader conclusion.
minor comments (2)
  1. [§3] The abstract states that diffusion models fail 'even with projection or guidance' but does not cite the specific guidance or projection baselines used in the tangram and rectangle experiments; add these references in §3.
  2. [§6] Notation for the autoregressive factorization (e.g., the decomposition of the joint into sequential conditionals) should be introduced with an equation number in §6 to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the significance of our findings. The comments highlight opportunities to strengthen the evidence for a structural limitation of continuous density matching. We address each major comment below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses
  1. Referee: [§4] §4 (Tangram experiments): the claim that observed constraint violations demonstrate a structural limitation of continuous density matching would be strengthened by an explicit control that varies the dimensionality of the feasible set while holding the diffusion architecture fixed; without it, the results remain consistent with the specific discrete geometry of tangram rather than the paradigm itself.

    Authors: We agree that an explicit control varying feasible-set dimensionality while fixing the diffusion architecture would isolate the effect more cleanly. The tangram task was selected precisely because its feasible region is low-dimensional (seven rigid shapes subject to connectivity and non-overlap), but we recognize that the discrete geometry could be a confounding factor. In the revision we will add an ablation on the rectangle-composition task in which we vary the number of rectangles (and thus the intrinsic dimensionality of the feasible set) while keeping the identical diffusion backbone, training procedure, and guidance mechanism. This will provide the requested control and allow direct comparison of violation rates as dimensionality changes. revision: yes

  2. Referee: [§5.2] §5.2 (Rectangle composition): the learned bounding-box constraint is a simplified proxy; the manuscript should report the fraction of samples that satisfy the constraint under standard diffusion guidance versus the sequential method, with error bars across at least three random seeds, to make the quantitative gap load-bearing for the broader conclusion.

    Authors: We accept this recommendation. The current manuscript reports aggregate success rates but does not include per-seed fractions or error bars for the exact constraint-satisfaction metric. In the revised version we will add a table (or expanded figure caption) in §5.2 that reports the mean fraction of samples satisfying the learned bounding-box constraint for both standard diffusion guidance and the sequential RL+MCTS method, together with standard deviation across at least three independent random seeds. This change will make the quantitative gap statistically more robust and directly address the concern that the bounding-box task is only a simplified proxy. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation grounded in empirical observations and geometric arguments

full rationale

The paper motivates its claim of a structural limitation in continuous density matching by citing observed failures of diffusion models on the tangram and rectangle tasks, along with geometric arguments that feasible solutions occupy low-dimensional submanifolds. The shift to sequential autoregressive generation with RL and MCTS is presented as a direct response to these empirical patterns and reparameterization ideas rather than any quantity defined in terms of fitted parameters, self-referential equations, or load-bearing self-citations. No derivation step reduces to its inputs by construction, and the central conclusion retains independent content from the reported experiments and prior-work evidence.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that feasible solutions occupy low-dimensional submanifolds, with no free parameters or invented entities introduced in the abstract.

axioms (1)
  • domain assumption Feasible solutions in constrained generation tasks lie on low-dimensional, small, and sometimes disconnected regions of the output space
    Invoked in the abstract to explain diffusion model failures on tasks like tangram and rectangle composition.

pith-pipeline@v0.9.0 · 5528 in / 1284 out tokens · 87410 ms · 2026-05-17T03:40:48.393355+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 2 internal anchors

  1. [1]

    Shape related constraints aware gen- eration of mechanical designs through deep convolutional gan.arXiv preprint arXiv:2010.11833, 2020

    Waad Almasri, Dimitri Bettebghor, Fakhreddine Ababsa, and Florence Danglade. Shape related constraints aware gen- eration of mechanical designs through deep convolutional gan.arXiv preprint arXiv:2010.11833, 2020. 2

  2. [2]

    Lexical entrainment without conceptual pacts? revisiting the matching task.Journal of Memory and Language, 114: 104129, 2020

    Adrian Bangerter, Eric Mayor, and Dominique Knutsen. Lexical entrainment without conceptual pacts? revisiting the matching task.Journal of Memory and Language, 114: 104129, 2020. 2

  3. [3]

    Recognition-by-components: A theory of human image understanding.Psychological Review, 94(2): 115–147, 1987

    Irving Biederman. Recognition-by-components: A theory of human image understanding.Psychological Review, 94(2): 115–147, 1987. 1

  4. [4]

    Ab- stract concepts: External influences, internal constraints, and methodological issues.Psychological Research, 86(8): 2370–2388, 2022

    Anna M Borghi, Samuel Shaki, and Martin H Fischer. Ab- stract concepts: External influences, internal constraints, and methodological issues.Psychological Research, 86(8): 2370–2388, 2022. 2

  5. [5]

    Interac- tion promotes the adaptation of referential conventions to the communicative context.Cognitive science, 43(8):e12780,

    Luc ´ıa Castillo, Kenny Smith, and Holly P Branigan. Interac- tion promotes the adaptation of referential conventions to the communicative context.Cognitive science, 43(8):e12780,

  6. [6]

    Decision transformer: Reinforce- ment learning via sequence modeling.Advances in neural information processing systems, 34:15084–15097, 2021

    Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srini- vas, and Igor Mordatch. Decision transformer: Reinforce- ment learning via sequence modeling.Advances in neural information processing systems, 34:15084–15097, 2021. 13

  7. [7]

    Cops-ref: A new dataset and task on composi- tional referring expression comprehension

    Zhenfang Chen, Peng Wang, Lin Ma, Kwan-Yee K Wong, and Qi Wu. Cops-ref: A new dataset and task on composi- tional referring expression comprehension. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 10086–10095, 2020. 2

  8. [8]

    Constrained synthesis with projected diffusion models.Ad- vances in Neural Information Processing Systems, 37: 89307–89333, 2024

    Jacob K Christopher, Stephen Baek, and Nando Fioretto. Constrained synthesis with projected diffusion models.Ad- vances in Neural Information Processing Systems, 37: 89307–89333, 2024. 2

  9. [9]

    Referring as a collaborative process.Cognition, 22(1):1–39, 1986

    Herbert H Clark and Deanna Wilkes-Gibbs. Referring as a collaborative process.Cognition, 22(1):1–39, 1986. 2

  10. [10]

    Visual blending for concept representation: A case study on emoji generation.New Generation Comput- ing, 38(4):739–771, 2020

    Jo ˜ao M Cunha, Nuno Lourenc ¸o, Pedro Martins, and Pe- nousal Machado. Visual blending for concept representation: A case study on emoji generation.New Generation Comput- ing, 38(4):739–771, 2020. 2

  11. [11]

    Policy improvement by planning with gumbel

    Ivo Danihelka, Arthur Guez, Julian Schrittwieser, and David Silver. Policy improvement by planning with gumbel. InIn- ternational Conference on Learning Representations, 2022. 12

  12. [12]

    Diffusion models beat gans on image synthesis.Advances in neural informa- tion processing systems, 34:8780–8794, 2021

    Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis.Advances in neural informa- tion processing systems, 34:8780–8794, 2021. 5

  13. [13]

    Development of shared information in communication despite hippocampal amnesia.Nature neuroscience, 9(1): 140–146, 2006

    Melissa C Duff, Julie Hengst, Daniel Tranel, and Neal J Co- hen. Development of shared information in communication despite hippocampal amnesia.Nature neuroscience, 9(1): 140–146, 2006. 2

  14. [14]

    Con- strained layout generation with factor graphs

    Mohammed Haroon Dupty, Yanfei Dong, Sicong Leng, Guoji Fu, Yong Liang Goh, Wei Lu, and Wee Sun Lee. Con- strained layout generation with factor graphs. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 12851–12860, 2024. 2

  15. [15]

    Clipdraw: Exploring text-to-drawing synthesis through language-image encoders.Advances in Neural Information Processing Sys- tems, 35:5207–5218, 2022

    Kevin Frans, Lisa Soros, and Olaf Witkowski. Clipdraw: Exploring text-to-drawing synthesis through language-image encoders.Advances in Neural Information Processing Sys- tems, 35:5207–5218, 2022. 2

  16. [16]

    Garey and David S

    Michael R. Garey and David S. Johnson.Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman & Co., San Francisco, CA, 1979. 1

  17. [17]

    Visual conceptual blending with large-scale language and vision models.arXiv preprint arXiv:2106.14127, 2021

    Songwei Ge and Devi Parikh. Visual conceptual blending with large-scale language and vision models.arXiv preprint arXiv:2106.14127, 2021. 2

  18. [18]

    Statistical theory of extreme valuse and some practical applications.Nat

    Emil Julius Gumbel. Statistical theory of extreme valuse and some practical applications.Nat. Bur. Standards Appl. Math. Ser. 33, 1954. 12

  19. [19]

    Denoising dif- fusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models. InAdvances in Neural Infor- mation Processing Systems (NeurIPS 33), pages 6840–6851,

  20. [20]

    Co-speech gesture mimicry in the process of collaborative referring during face-to-face dialogue.Journal of nonverbal behavior, 35:133–153, 2011

    Judith Holler and Katie Wilkin. Co-speech gesture mimicry in the process of collaborative referring during face-to-face dialogue.Journal of nonverbal behavior, 35:133–153, 2011. 2

  21. [21]

    Speakers’ expe- riences and audience design: Knowing when and knowing how to adjust utterances to addressees.Journal of Memory and Language, 47(4):589–606, 2002

    William S Horton and Richard J Gerrig. Speakers’ expe- riences and audience design: Knowing when and knowing how to adjust utterances to addressees.Journal of Memory and Language, 47(4):589–606, 2002. 2

  22. [22]

    de Lima, Silvano Martello, Fl´avio K

    Manuel Iori, Vin ´ıcius L. de Lima, Silvano Martello, Fl´avio K. Miyazawa, and Michele Monaci. Exact solution techniques for two-dimensional cutting and packing.Eu- ropean Journal of Operational Research, 289(2):399–415,

  23. [23]

    Vectorfusion: Text-to-svg by abstracting pixel-based diffusion models

    Ajay Jain, Amber Xie, and Pieter Abbeel. Vectorfusion: Text-to-svg by abstracting pixel-based diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 1911–1920, 2023. 2

  24. [24]

    Categorical Reparameterization with Gumbel-Softmax

    Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax.arXiv preprint arXiv:1611.01144, 2016. 12

  25. [25]

    Abstract visual reasoning with tangram shapes

    Anya Ji, Noriyuki Kojima, Noah Rush, Alane Suhr, Wai Keen V ong, Robert Hawkins, and Yoav Artzi. Abstract visual reasoning with tangram shapes. InProceedings of the 2022 Conference on Empirical Methods in Natural Lan- guage Processing, pages 582–601, 2022. 2

  26. [26]

    Scaling up visual and vision-language representa- tion learning with noisy text supervision

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representa- tion learning with noisy text supervision. InInternational conference on machine learning, pages 4904–4916. PMLR,

  27. [27]

    Almost op- timal exploration in multi-armed bandits

    Zohar Karnin, Tomer Koren, and Oren Somekh. Almost op- timal exploration in multi-armed bandits. InInternational 9 conference on machine learning, pages 1238–1246. PMLR,

  28. [28]

    Vilt: Vision- and-language transformer without convolution or region su- pervision

    Wonjae Kim, Bokyung Son, and Ildoo Kim. Vilt: Vision- and-language transformer without convolution or region su- pervision. InInternational conference on machine learning, pages 5583–5594. PMLR, 2021. 2

  29. [29]

    Stochastic beams and where to find them: The gumbel-top-k trick for sampling sequences without replacement

    Wouter Kool, Herke Van Hoof, and Max Welling. Stochastic beams and where to find them: The gumbel-top-k trick for sampling sequences without replacement. InInternational Conference on Machine Learning, pages 3499–3508. PMLR,

  30. [30]

    Changes in ref- erence phrases as a function of frequency of usage in social interaction: A preliminary study.Psychonomic Science, 1: 113–114, 1964

    Robert M Krauss and Sidney Weinheimer. Changes in ref- erence phrases as a function of frequency of usage in social interaction: A preliminary study.Psychonomic Science, 1: 113–114, 1964. 2

  31. [31]

    Differentiable vector graphics rasterization for editing and learning.ACM Transactions on Graphics (TOG), 39(6):1–15, 2020

    Tzu-Mao Li, Michal Luk ´aˇc, Micha ¨el Gharbi, and Jonathan Ragan-Kelley. Differentiable vector graphics rasterization for editing and learning.ACM Transactions on Graphics (TOG), 39(6):1–15, 2020. 2

  32. [32]

    Text-to-image genera- tion for abstract concepts

    Jiayi Liao, Xu Chen, Qiang Fu, Lun Du, Xiangnan He, Xiang Wang, Shi Han, and Dongmei Zhang. Text-to-image genera- tion for abstract concepts. InProceedings of the AAAI Con- ference on Artificial Intelligence, pages 3360–3368, 2024. 2

  33. [33]

    Mirror diffusion models for constrained and wa- termarked generation

    Guan-Horng Liu, Tianrong Chen, Evangelos Theodorou, and Molei Tao. Mirror diffusion models for constrained and wa- termarked generation. InThirty-seventh Conference on Neu- ral Information Processing Systems, 2023. 2

  34. [34]

    Opal: Multimodal image generation for news illustration

    Vivian Liu, Han Qiao, and Lydia Chilton. Opal: Multimodal image generation for news illustration. InProceedings of the 35th Annual ACM Symposium on User Interface Software and Technology, pages 1–17, 2022. 2

  35. [35]

    Object- centric learning with slot attention

    Francesco Locatello, Dirk Weissenborn, Thomas Un- terthiner, Aravindh Mahendran, Georg Heigold, Jakob Uszkoreit, Alexey Dosovitskiy, and Thomas Kipf. Object- centric learning with slot attention. InAdvances in Neural In- formation Processing Systems (NeurIPS 33), pages 11525– 11538, 2020. 2

  36. [36]

    Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks.Advances in neural information processing systems, 32, 2019

    Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks.Advances in neural information processing systems, 32, 2019. 2

  37. [37]

    Towards layer- wise image vectorization

    Xu Ma, Yuqian Zhou, Xingqian Xu, Bin Sun, Valerii Filev, Nikita Orlov, Yun Fu, and Humphrey Shi. Towards layer- wise image vectorization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16314–16323, 2022. 2

  38. [38]

    Guglielmo Padula, Francesco Romor, Giovanni Stabile, and Gianluigi Rozza. Generative models for the deformation of industrial shapes with linear geometric constraints: Model order and parameter space reductions.Computer Methods in Applied Mechanics and Engineering, 423:116823, 2024. 2

  39. [39]

    Sketchgen: Generating constrained cad sketches.Advances in Neural Information Processing Systems, 34:5077–5088, 2021

    Wamiq Para, Shariq Bhat, Paul Guerrero, Tom Kelly, Niloy Mitra, Leonidas J Guibas, and Peter Wonka. Sketchgen: Generating constrained cad sketches.Advances in Neural Information Processing Systems, 34:5077–5088, 2021. 2

  40. [40]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InProceedings of the 38th International Conference on Machine Learning (ICML) 2021, pages 8748–8763, 2021. 2

  41. [41]

    Learn- ing transferable visual models from natural language super- vision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InInternational Conference on Machine Learning, pages 8748–8763. PMLR, 2021. 2, 13

  42. [42]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022, page †, 2022. † Fill in actual page numbers. 1

  43. [43]

    Styleclipdraw: Coupling content and style in text-to-drawing translation.arXiv preprint arXiv:2202.12362, 2022

    Peter Schaldenbrand, Zhixuan Liu, and Jean Oh. Styleclip- draw: Coupling content and style in text-to-drawing transla- tion.arXiv preprint arXiv:2202.12362, 2022. 2

  44. [44]

    Mastering atari, go, chess and shogi by planning with a learned model.Nature, 588(7839):604–609, 2020

    Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al. Mastering atari, go, chess and shogi by planning with a learned model.Nature, 588(7839):604–609, 2020. 13

  45. [45]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad- ford, and Oleg Klimov. Proximal policy optimization algo- rithms.arXiv preprint arXiv:1707.06347, 2017. 2

  46. [46]

    Mastering the game of go without human knowledge.nature, 550(7676): 354–359, 2017

    David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lu- cas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge.nature, 550(7676): 354–359, 2017. 12

  47. [47]

    A general rein- forcement learning algorithm that masters chess, shogi, and go through self-play.Science, 362(6419):1140–1144, 2018

    David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Lau- rent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lill- icrap, Karen Simonyan, and Demis Hassabis. A general rein- forcement learning algorithm that masters chess, shogi, and go through self-play.Science, 362(6419):1140–1144, 2018. 2

  48. [48]

    Creative blends of visual concepts.arXiv preprint arXiv:2502.16062,

    Zhida Sun, Zhenyao Zhang, Yue Zhang, Min Lu, Dani Lischinski, Daniel Cohen-Or, and Hui Huang. Creative blends of visual concepts.arXiv preprint arXiv:2502.16062,

  49. [49]

    Mcts based on simple regret

    David Tolpin and Solomon Shimony. Mcts based on simple regret. InProceedings of the AAAI Conference on Artificial Intelligence, pages 570–576, 2012. 12

  50. [50]

    Untersuchungen zur lehre von der gestalt

    Max Wertheimer. Untersuchungen zur lehre von der gestalt. ii.Psychologische Forschung, 4:301–350, 1923. In German. 1

  51. [51]

    Icon- shop: Text-guided vector icon synthesis with autoregressive transformers.ACM Transactions on Graphics (TOG), 42(6): 1–14, 2023

    Ronghuan Wu, Wanchao Su, Kede Ma, and Jing Liao. Icon- shop: Text-guided vector icon synthesis with autoregressive transformers.ACM Transactions on Graphics (TOG), 42(6): 1–14, 2023. 2

  52. [52]

    Vismantic: Meaning- making with images

    Ping Xiao and Simo Matias Linkola. Vismantic: Meaning- making with images. InInternational Conference on Com- putational Creativity, pages 158–165. Brigham Young Uni- versity, 2015. 2 10

  53. [53]

    Diffsketcher: Text guided vector sketch synthesis through latent diffusion models.Advances in Neu- ral Information Processing Systems, 36:15869–15889, 2023

    Ximing Xing, Chuang Wang, Haitao Zhou, Jing Zhang, Qian Yu, and Dong Xu. Diffsketcher: Text guided vector sketch synthesis through latent diffusion models.Advances in Neu- ral Information Processing Systems, 36:15869–15889, 2023. 2

  54. [54]

    John I Yellott Jr. The relationship between luce’s choice ax- iom, thurstone’s theory of comparative judgment, and the double exponential distribution.Journal of Mathematical Psychology, 15(2):109–144, 1977. 12

  55. [55]

    Monte carlo tree diffusion for system 2 planning.arXiv preprint arXiv:2502.07202, 2025

    Jaesik Yoon, Hyeonseo Cho, Doojin Baek, Yoshua Bengio, and Sungjin Ahn. Monte carlo tree diffusion for system 2 planning.arXiv preprint arXiv:2502.07202, 2025. 2

  56. [56]

    arXiv preprint arXiv:2502.05625 , year =

    Stefano Zampini, Jacob Christopher, Luca Oneto, Davide Anguita, and Ferdinando Fioretto. Training-free constrained generation with stable diffusion models.arXiv preprint arXiv:2502.05625, 2025. 2

  57. [57]

    Enforcing impre- cise constraints on generative adversarial networks for emu- lating physical systems.Communications in Computational Physics, 30(3):635–665, 2021

    Yang Zeng, Jin-Long Wu, and Heng Xiao. Enforcing impre- cise constraints on generative adversarial networks for emu- lating physical systems.Communications in Computational Physics, 30(3):635–665, 2021. 2 11 A. Preliminary: Gumbel Muzero Gumbel Muzero [11] is a variant of the AlphaZero [46] algorithm with a major modification in the MCTS algo- rithm. Inst...

  58. [58]

    Which image better matches the description below?

    is a bandit algorithm for simple regret minimisation. It aims at maximising theQ-value of the last or the finally selected action in simulations, which differs from the cu- mulative regrets that maximise theQfrom allnsimulations [49]. Algorithm 2Sequential Halving with Gumbel 1:procedureSEQHALGUMBEL(s, N, K, g, π θ, f, σ)▷ s: current state;N: simulation b...