BrickNet: Graph-Backed Generative Brick Assembly

Cordelia Schmid; Peter Kulits

arxiv: 2604.22984 · v1 · submitted 2026-04-24 · 💻 cs.CV · cs.GR

BrickNet: Graph-Backed Generative Brick Assembly

Peter Kulits , Cordelia Schmid This is my paper

Pith reviewed 2026-05-08 12:22 UTC · model grok-4.3

classification 💻 cs.CV cs.GR

keywords LEGO assemblygenerative modelsgraph representationslanguage modelsbuild sequencesphysical constraints3D generationLDraw

0 comments

The pith

Representing LEGO assemblies as connectivity graphs lets language models generate long, physically valid build sequences for thousands of brick types.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper trains language models to output sequences for assembling LEGO bricks drawn from a wide variety of part types. Earlier work handled only simple voxel-style towers, but this setting involves complex objects whose parts have many different connection rules. Direct prediction of 3D positions quickly produces sequences that violate physical constraints. Instead, the authors encode each structure as a graph whose edges capture how parts attach to one another. This graph representation keeps the generated sequences grounded in connectivity, allowing longer valid builds. The work also releases a dataset of more than 100,000 human-designed LDraw models.

Core claim

We design a graph-based program representation that parametrizes structure through connectivity, improving the physical grounding of generated sequences. This allows autoregressive generation of build sequences that satisfy physical constraints even when using thousands of part types with varied connection semantics, where direct prediction of block poses leads to rapid invalidation.

What carries the argument

Graph-based program representation that parametrizes structure through part connectivity

If this is right

Build sequences remain valid over longer horizons than those produced by direct 3D pose prediction.
The method scales to scenes containing thousands of distinct part types and diverse connection semantics.
Human-designed LDraw objects and scenes provide training data sufficient for learning such sequences.
Released dataset and models support further research into generative assembly tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Graph encodings of connectivity could apply to other sequential assembly domains such as modular furniture or robotic construction.
Layering a lightweight physics check on the graph output might further reduce invalid sequences.
The approach suggests a path toward AI tools that generate build instructions for arbitrary user-specified shapes.
Connectivity-focused representations may generalize to non-brick modular systems that obey attachment rules.

Load-bearing premise

Modeling structure solely through part connectivity in a graph suffices to produce valid long build sequences without explicit 3D geometry or physics checks.

What would settle it

Run the model on a large complex scene, output the full build sequence, then attempt to execute the sequence in a 3D simulator or physical build to check whether any step produces a collision or unstable joint.

Figures

Figures reproduced from arXiv: 2604.22984 by Cordelia Schmid, Peter Kulits.

**Figure 1.** Figure 1: We finetune an LLM to autoregressively generate LEGO-brick build sequences. To enable this, we introduce a large-scale dataset view at source ↗

**Figure 2.** Figure 2: Motivation. To teach a model to autoregressively generate brick structures in a discrete, voxelized domain, it is intuitive to train it to regress 3D coordinates (Fig. 2a). However, doing so becomes more difficult when dealing with the complexities of real-world objects (Fig. 2b). Starting at the orange hinge plate (1) and placing bricks down to the white stud (5) at the end requires maintaining a high de… view at source ↗

**Figure 3.** Figure 3: Connectivity Semantics. We broadly model five types of connectivity between bricks. Stud (Fig. 3a) connections, after defining which stud connects to which hole, have at most one degree of freedom. Hinge (Fig. 3b) connections have a degree of rotational freedom, and often the ability to be flipped (binary). Axle (Fig. 3c) connections inherit the same freedom as hinges, but can also be offset along their pr… view at source ↗

**Figure 4.** Figure 4: Graph Visualization. After encoding relative transformations between parts into their connectivity, we arrive at connected graphs. From these graphs, we can sample iterative build instructions (spanning trees), that begin at a root part, add another part, define an edge that connects that part with the existing structure, and on. For example, see the dark red piece at the top of the render in Fig. 4b. T… view at source ↗

**Figure 5.** Figure 5: Dataset Plots. We compute statistics over both BrickNet subsets. While SFT samples are capped at 100 parts, the BrickNet-PT set is long-tailed and can include thousands of parts (Fig. 5a). The number of unique parts (Fig. 5b) and colors (Fig. 5c) per object are also similarly tailed. We additionally compute the proportion of samples containing an instance of each connection-type class (Fig. 5d). 4.62% 0.69… view at source ↗

**Figure 6.** Figure 6: Part Frequency. We compute relative part frequency (Fig. 6a) and the proportion of samples each part occurs in (Fig. 6b). We define two overlapping sets from this, BrickNet-PT (pretraining) and BrickNet-SFT (fine-tuning). BrickNetSFT contains 67,185 samples, with 1,774,387 cumulative instances, between four and 100 parts that fulfill part-color and part-type diversity criteria. Each sample was additional… view at source ↗

**Figure 7.** Figure 7: Connectivity Survival. On the nucleus-sampled sequences produced from our unconditional models, we compute the proportion of generations which “survived” at least k placement actions before an unparseable or unsupported action was sampled. up to 100 pieces each. We sample them in proportion to the square root of the number of pieces in a given structure. While the internet-curated objects themselves exhib… view at source ↗

**Figure 8.** Figure 8: Unconditional Samples. Random samples drawn from either BrECS [1] (Fig. 8a) or our model (Fig. 8b). Zoom in for details. using only a standard next-token-prediction cross-entropy loss: p(x) = Yn i=1 p (si |s1, . . . si−1) (1) After training, we sample 2 16 build sequences from each model at full temperature, suppressing the EOS token to force full-length generation. With true full-temperature ancestral sa… view at source ↗

**Figure 9.** Figure 9: Text-Conditioned Samples. Samples produced using prompts from the evaluation set. Outputs are arranged as pairs with a prompt number. Within pairs, our outputs are above and those of BrickGPT [28] are beneath. Match the number to the full prompt below. 7. Conclusion Our investigation represents the first attempt to model and autoregressively generate LEGO-brick structures using pieces with arbitrary conne… view at source ↗

read the original abstract

We train a language model to generate LEGO-brick build sequences. While prior work has been restricted to discrete, voxel-like towers, we consider a much broader set of pieces, encompassing thousands of part types with diverse connection semantics. To enable this, we first collect a large-scale dataset of over 100,000 human-designed LDraw brick objects and scenes. The complexity of our setting makes it challenging to autoregressively assemble structures that satisfy physical constraints. When predicting block pose directly, build sequences quickly become invalid after a small number of steps. Although pieces are placed in 3D space, it is the spatial relationships of the parts which define the whole. With this in mind, we design a graph-based program representation that parametrizes structure through connectivity, improving the physical grounding of generated sequences. To enable future applications, we make our dataset and models available for research purposes. https://kulits.github.io/BrickNet

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BrickNet adds a large LDraw dataset and switches to graph connectivity modeling for LEGO sequences, but the abstract shows no numbers on whether it actually produces valid builds.

read the letter

This paper's core move is collecting over 100,000 LDraw models and scenes, then training a language model on a graph representation of brick connectivity instead of direct 3D pose prediction. They argue that focusing on how parts link up gives better physical grounding and lets the model handle thousands of part types without sequences collapsing after a few steps. Prior voxel-tower work was narrower, so the dataset scale and the shift to graphs are the concrete advances here. Releasing the data and models is also useful for anyone who wants to build on it. The graph framing makes sense on the surface because absolute positions are less important than the relations that keep a structure stable. That said, the abstract gives no success rates, no ablations, and no comparison showing longer valid sequences or fewer invalid states. Without those, it's unclear whether connectivity alone captures stud orientations, collisions, or support rules that matter for real builds. The stress-test point about missing geometric constraints looks like it could still apply if the model stays purely topological. This is for people working on generative models for physical assemblies or 3D structured data. Someone who needs a broad LEGO dataset or is exploring sequence models for design tasks could get value from the data release and the basic idea. It should go to peer review. The dataset is a real addition and the modeling choice is a reasonable extension, so referees can check the experiments and see if the validity claims hold up.

Referee Report

2 major / 1 minor

Summary. The paper introduces BrickNet, a language model trained to generate LEGO brick build sequences. It collects a dataset of over 100,000 LDraw objects and scenes and proposes a graph-based program representation that encodes assemblies via part connectivity (rather than direct 3D pose prediction) to improve physical grounding and enable longer valid sequences across thousands of diverse part types. The dataset and models are released publicly.

Significance. If the central claim holds, the work would advance generative modeling for complex physical assemblies by showing that a connectivity graph can implicitly capture constraints better than pose-based autoregression, with applications in design automation and robotics. The public dataset release is a clear strength for reproducibility and follow-on research.

major comments (2)

[Abstract and §3] Abstract and §3 (method): the claim that parametrizing via connectivity 'improves the physical grounding of generated sequences' is load-bearing for the contribution, yet the description provides no mechanism for encoding orientation-specific stud/cavity compatibility or collision avoidance; pure topology may still permit interpenetrations or floating components, directly contradicting the assertion that the graph recovers validity at scale.
[§4] §4 (experiments): no quantitative results, validity metrics, or ablations appear in the abstract, and the skeptic note indicates they are absent from the provided summary; without reported validity rates over long sequences, sequence-length comparisons to direct-pose baselines, or failure-mode analysis, the central claim that the graph representation enables longer valid builds cannot be evaluated.

minor comments (1)

Ensure the released dataset at https://kulits.github.io/BrickNet includes LDraw files, graph annotations, and train/test splits so that connectivity modeling can be independently verified.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful review and recommendation for major revision. We provide point-by-point responses to the major comments and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (method): the claim that parametrizing via connectivity 'improves the physical grounding of generated sequences' is load-bearing for the contribution, yet the description provides no mechanism for encoding orientation-specific stud/cavity compatibility or collision avoidance; pure topology may still permit interpenetrations or floating components, directly contradicting the assertion that the graph recovers validity at scale.

Authors: We appreciate this observation. Our graph-based program representation encodes assemblies as a graph of brick connections extracted from the LDraw dataset of over 100,000 valid human-designed objects. Each connection in the graph corresponds to a physically compatible stud-cavity pair as realized in the original models. The language model learns to generate sequences that produce graphs consistent with this distribution, thereby inheriting the physical constraints implicit in the data. While we do not include explicit geometric collision checks or orientation encoding beyond what is captured in the connectivity types, this approach avoids the difficulties of direct 3D pose regression and leads to more valid long sequences in practice. We agree that additional mechanisms could further ensure validity and will add a paragraph in §3 discussing potential interpenetration issues and how the data-driven method mitigates them. We will also include more details on how connection types encode compatibility. revision: partial
Referee: [§4] §4 (experiments): no quantitative results, validity metrics, or ablations appear in the abstract, and the skeptic note indicates they are absent from the provided summary; without reported validity rates over long sequences, sequence-length comparisons to direct-pose baselines, or failure-mode analysis, the central claim that the graph representation enables longer valid builds cannot be evaluated.

Authors: The full manuscript in §4 does contain quantitative evaluations of sequence validity, including metrics on the fraction of valid generated assemblies for sequences up to several hundred bricks, direct comparisons showing superior performance over pose-based autoregressive baselines (which degrade rapidly), and ablations on the graph representation components. Failure modes are analyzed qualitatively with examples of invalid outputs. We believe the provided summary to the referee may have omitted these details. To ensure clarity, we will revise the abstract to include a summary of the key quantitative findings on validity and sequence length, and expand the failure-mode discussion with additional quantitative analysis in the revised version. revision: yes

Circularity Check

0 steps flagged

No circularity: standard dataset-driven language modeling with no derivations or self-referential fits

full rationale

The paper collects a dataset of over 100,000 LDraw objects and trains a language model to generate sequences using a graph-based connectivity representation. No equations, parameter fits, predictions, or uniqueness theorems appear in the provided text. The central design choice (parametrizing via connectivity rather than direct pose) is presented as an empirical engineering decision whose validity is tested on held-out human-designed data, not derived from or equivalent to its own inputs by construction. No self-citations are invoked as load-bearing premises, and the approach reduces to ordinary supervised sequence modeling rather than any of the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the assumption that connectivity graphs capture the essential physical constraints for assembly without needing explicit 3D pose or collision modeling. No free parameters, axioms, or invented entities are explicitly introduced in the abstract.

pith-pipeline@v0.9.0 · 5451 in / 1100 out tokens · 35070 ms · 2026-05-08T12:22:51.143898+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages

[1]

Budget-aware sequential brick assembly with efficient con- straint satisfaction.TMLR, 2024

Seokjun Ahn, Jungtaek Kim, Minsu Cho, and Jaesik Park. Budget-aware sequential brick assembly with efficient con- straint satisfaction.TMLR, 2024. 2, 3, 6, 7

work page 2024
[2]

SceneScript: Reconstructing scenes with an autoregressive structured language model

Armen Avetisyan, Christopher Xie, Henry Howard-Jenkins, Tsun-Yi Yang, Samir Aroudj, Suvam Patra, Fuyang Zhang, Duncan Frost, Luke Holland, Campbell Orme, Jakob Engel, Edward Miller, Richard Newcombe, and Vasileios Balntas. SceneScript: Reconstructing scenes with an autoregressive structured language model. InECCV, 2024. 2

work page 2024
[3]

Stressing the elements, 2006

Jamie Berard. Stressing the elements, 2006. 3

work page 2006
[4]

Perception encoder: The best visual embeddings are not at the output of the network

Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Abdul Rasheed, Junke Wang, Marco Monteiro, Hu Xu, Shiyu Dong, Nikhila Ravi, Shang-Wen Li, Piotr Dollar, and Christoph Feichtenhofer. Perception encoder: The best visual embeddings are not at the output of the network. InNeur...

work page 2025
[5]

A parallel feature-preserving mesh variable offsetting method with dynamic programming, 2023

Hongyi Cao, Gang Xu, Renshu Gu, Jinlan Xu, Xiaoyu Zhang, and Timon Rabczuk. A parallel feature-preserving mesh variable offsetting method with dynamic programming, 2023. 4

work page 2023
[6]

Brick-by- brick: Combinatorial construction with deep reinforcement learning

Hyunsoo Chung, Jungtaek Kim, Boris Knyazev, Jinhwi Lee, Graham W Taylor, Jaesik Park, and Minsu Cho. Brick-by- brick: Combinatorial construction with deep reinforcement learning. InNeurIPS, pages 5745–5757. Curran Associates, Inc., 2021. 2

work page 2021
[7]

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025

Gheorghe Comanici et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. 5

work page 2025
[8]

InverseCSG: automatic conversion of 3d models to CSG trees.ACM TOG, 37(6), 2018

Tao Du, Jeevana Priya Inala, Yewen Pu, Andrew Spielberg, Adriana Schulz, Daniela Rus, Armando Solar-Lezama, and Wojciech Matusik. InverseCSG: automatic conversion of 3d models to CSG trees.ACM TOG, 37(6), 2018. 2

work page 2018
[9]

Generating context-aware natural answers for questions in 3d scenes

Mohammed Munzer Dwedari, Matthias Niessner, and Zhenyu Chen. Generating context-aware natural answers for questions in 3d scenes. InBMVC. BMV A, 2023. 2

work page 2023
[10]

Learn to create simple LEGO micro buildings.ACM TOG, 43(6), 2024

Jiahao Ge, Mingjun Zhou, and Chi-Wing Fu. Learn to create simple LEGO micro buildings.ACM TOG, 43(6), 2024. 2

work page 2024
[11]

Blocks assemble! learning to assemble with large- scale structured reinforcement learning

Seyed Kamyar Seyed Ghasemipour, Satoshi Kataoka, By- ron David, Daniel Freeman, Shixiang Shane Gu, and Igor Mordatch. Blocks assemble! learning to assemble with large- scale structured reinforcement learning. InICML, pages 7435–7469. PMLR, 2022. 2

work page 2022
[12]

TreeSBA: Tree-transformer for self-supervised sequential brick assembly

Mengqi Guo, Chen Li, Yuyang Zhao, and Gim Hee Lee. TreeSBA: Tree-transformer for self-supervised sequential brick assembly. InECCV, pages 35–51, Cham, 2025. Springer Nature Switzerland. 2

work page 2025
[13]

3D-LLM: In- jecting the 3D world into large language models

Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, and Chuang Gan. 3D-LLM: In- jecting the 3D world into large language models. InNeurIPS, pages 20482–20494. Curran Associates, Inc., 2023. 2

work page 2023
[14]

Ross, Cordelia Schmid, and Alireza Fathi

Ziniu Hu, Ahmet Iscen, Aashi Jain, Thomas Kipf, Yisong Yue, David A. Ross, Cordelia Schmid, and Alireza Fathi. SceneCraft: An LLM agent for synthesizing 3d scenes as blender code. InICML, 2024. 2

work page 2024
[15]

LDraw, 1995–2026

James Jessiman. LDraw, 1995–2026. 1, 3, 4

work page 1995
[16]

Kenny Jones, Theresa Barton, Xianghao Xu, Kai Wang, Ellen Jiang, Paul Guerrero, Niloy J

R. Kenny Jones, Theresa Barton, Xianghao Xu, Kai Wang, Ellen Jiang, Paul Guerrero, Niloy J. Mitra, and Daniel Ritchie. ShapeAssembly: learning to generate programs for 3D shape structure synthesis.ACM TOG, 39(6), 2020. 1, 2

work page 2020
[17]

Com- poseAnything: Composite object priors for text-to-image generation, 2025

Zeeshan Khan, Shizhe Chen, and Cordelia Schmid. Com- poseAnything: Composite object priors for text-to-image generation, 2025. 2

work page 2025
[18]

Combinatorial 3D shape generation via sequen- tial assembly

Jungtaek Kim, Hyunsoo Chung, Jinhwi Lee, Minsu Cho, and Jaesik Park. Combinatorial 3D shape generation via sequen- tial assembly. InNeurIPS Workshop on Machine Learning for Engineering Modeling, Simulation, and Design (ML4Eng),

work page
[19]

Jones, Maaz Bin Safeer Ah- mad, Vladimir G

Milin Kodnongbua, Benjamin T. Jones, Maaz Bin Safeer Ah- mad, Vladimir G. Kim, and Adriana Schulz. ReparamCAD: Zero-shot CAD re-parameterization for interactive manipula- tion.SIGGRAPH Asia, 2023. 2

work page 2023
[20]

Peter Kulits, Haiwen Feng, Weiyang Liu, Victoria Fernandez Abrevaya, and Michael J. Black. Re-thinking inverse graphics with large language models.TMLR, 2024. 2

work page 2024
[21]

Black, and Silvia Zuffi

Peter Kulits, Michael J. Black, and Silvia Zuffi. Reconstruct- ing animals and the wild. InCVPR, pages 16565–16577,

work page
[22]

Eval- uating text-to-visual generation with image-to-text genera- tion

Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, and Deva Ramanan. Eval- uating text-to-visual generation with image-to-text genera- tion. InECCV, pages 366–384, Cham, 2025. Springer Nature Switzerland. 7

work page 2025
[23]

GPT4Motion: Scripting physical motions in text-to-video generation via blender-oriented GPT planning

Jiaxi Lv, Yi Huang, Mingfu Yan, Jiancheng Huang, Jianzhuang Liu, Yifan Liu, Yafei Wen, Xiaoxin Chen, and Shifeng Chen. GPT4Motion: Scripting physical motions in text-to-video generation via blender-oriented GPT planning. InCVPRW, pages 1430–1440, 2024. 2

work page 2024
[24]

Lengyel, and A

Khaled Mamou, E. Lengyel, and A. Peters. V olumetric hier- archical approximate convex decomposition.Game Engine Gems, 3:141–158, 2016. 4

work page 2016
[25]

LDCad: LDraw CAD editor, 2026

Roland Melkert. LDCad: LDraw CAD editor, 2026. 3

work page 2026
[26]

Mitra, and Leonidas J

Kaichun Mo, Paul Guerrero, Li Yi, Hao Su, Peter Wonka, Niloy J. Mitra, and Leonidas J. Guibas. StructureNet: Hierar- chical graph networks for 3D shape generation.ACM TOG, 38(6), 2019. 1, 2

work page 2019
[27]

Maxim Peysakhov and William C. Regli. Using assembly representations to enable evolutionary design of LEGO struc- tures.Artificial Intelligence for Engineering Design, Analysis and Manufacturing, 17(2):155–168, 2003. 2

work page 2003
[28]

Generating physically sta- ble and buildable brick structures from text

Ava Pun, Kangle Deng, Ruixuan Liu, Deva Ramanan, Changliu Liu, and Jun-Yan Zhu. Generating physically sta- ble and buildable brick structures from text. InICCV, pages 14798–14809, 2025. 2, 3, 4, 7, 8

work page 2025
[29]

CSGNet: Neural shape parser for 9 constructive solid geometry

Gopal Sharma, Rishabh Goyal, Difan Liu, Evangelos Kaloger- akis, and Subhransu Maji. CSGNet: Neural shape parser for 9 constructive solid geometry. InCVPR, pages 5515–5523,

work page
[30]

3D-GPT: Procedural 3D model- ing with large language models

Chunyi Sun, Junlin Han, Weijian Deng, Xinlong Wang, Zis- han Qin, and Stephen Gould. 3D-GPT: Procedural 3D model- ing with large language models. In3DV, pages 1253–1263,

work page
[31]

Qwen2.5-VL, 2025

Qwen Team. Qwen2.5-VL, 2025. 7

work page 2025
[32]

Rylee Thompson, Ghalebi Elahe, Terrance DeVries, and Gra- ham W. Taylor. Building LEGO using deep generative models of graphs.Machine Learning for Engineering Modeling, Sim- ulation, and Design Workshop at NeurIPS, 2020. 2, 3

work page 2020
[33]

SigLIP 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features, 2025

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier H´enaff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua Zhai. SigLIP 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features, 2025. 7

work page 2025
[34]

Break and make: Inter- active structural understanding using LEGO bricks

Aaron Walsman, Muru Zhang, Klemen Kotar, Karthik Desingh, Ali Farhadi, and Dieter Fox. Break and make: Inter- active structural understanding using LEGO bricks. InECCV, pages 90–107, Cham, 2022. Springer Nature Switzerland. 2, 3, 4

work page 2022
[35]

Learning to build by building your own instructions

Aaron Walsman, Muru Zhang, Adam Fishman, Ali Farhadi, and Dieter Fox. Learning to build by building your own instructions. InECCV, pages 261–278, Cham, 2025. Springer Nature Switzerland. 2, 3

work page 2025
[36]

Translating a visual LEGO man- ual to a machine-executable plan

Ruocheng Wang, Yunzhi Zhang, Jiayuan Mao, Chin-Yi Cheng, and Jiajun Wu. Translating a visual LEGO man- ual to a machine-executable plan. InECCV, pages 677–694. Springer, 2022. 2

work page 2022
[37]

3D ShapeNets: A deep representation for volumetric shapes

Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Lin- guang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3D ShapeNets: A deep representation for volumetric shapes. In CVPR, 2015. 7

work page 2015
[38]

ULIP: Learning a unified representation of language, images, and point clouds for 3D understanding

Le Xue, Mingfei Gao, Chen Xing, Roberto Mart ´ın-Mart´ın, Jiajun Wu, Caiming Xiong, Ran Xu, Juan Carlos Niebles, and Silvio Savarese. ULIP: Learning a unified representation of language, images, and point clouds for 3D understanding. In CVPR, pages 1179–1189, 2023. 2

work page 2023
[39]

Qwen3 technical report, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page 2025
[40]

Holodeck: Lan- guage guided generation of 3d embodied ai environments

Yue Yang, Fan-Yun Sun, Luca Weihs, Eli VanderBilt, Al- varo Herrasti, Winson Han, Jiajun Wu, Nick Haber, Ranjay Krishna, Lingjie Liu, Chris Callison-Burch, Mark Yatskar, Aniruddha Kembhavi, and Christopher Clark. Holodeck: Lan- guage guided generation of 3d embodied ai environments. In CVPR, pages 16227–16237, 2024. 2

work page 2024
[41]

Tenenbaum, Tianmin Shu, and Chuang Gan

Hongxin Zhang, Weihua Du, Jiaming Shan, Qinhong Zhou, Yilun Du, Joshua B. Tenenbaum, Tianmin Shu, and Chuang Gan. Building cooperative embodied agents modularly with large language models. InICLR, 2024. 2 10

work page 2024

[1] [1]

Budget-aware sequential brick assembly with efficient con- straint satisfaction.TMLR, 2024

Seokjun Ahn, Jungtaek Kim, Minsu Cho, and Jaesik Park. Budget-aware sequential brick assembly with efficient con- straint satisfaction.TMLR, 2024. 2, 3, 6, 7

work page 2024

[2] [2]

SceneScript: Reconstructing scenes with an autoregressive structured language model

Armen Avetisyan, Christopher Xie, Henry Howard-Jenkins, Tsun-Yi Yang, Samir Aroudj, Suvam Patra, Fuyang Zhang, Duncan Frost, Luke Holland, Campbell Orme, Jakob Engel, Edward Miller, Richard Newcombe, and Vasileios Balntas. SceneScript: Reconstructing scenes with an autoregressive structured language model. InECCV, 2024. 2

work page 2024

[3] [3]

Stressing the elements, 2006

Jamie Berard. Stressing the elements, 2006. 3

work page 2006

[4] [4]

Perception encoder: The best visual embeddings are not at the output of the network

Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Abdul Rasheed, Junke Wang, Marco Monteiro, Hu Xu, Shiyu Dong, Nikhila Ravi, Shang-Wen Li, Piotr Dollar, and Christoph Feichtenhofer. Perception encoder: The best visual embeddings are not at the output of the network. InNeur...

work page 2025

[5] [5]

A parallel feature-preserving mesh variable offsetting method with dynamic programming, 2023

Hongyi Cao, Gang Xu, Renshu Gu, Jinlan Xu, Xiaoyu Zhang, and Timon Rabczuk. A parallel feature-preserving mesh variable offsetting method with dynamic programming, 2023. 4

work page 2023

[6] [6]

Brick-by- brick: Combinatorial construction with deep reinforcement learning

Hyunsoo Chung, Jungtaek Kim, Boris Knyazev, Jinhwi Lee, Graham W Taylor, Jaesik Park, and Minsu Cho. Brick-by- brick: Combinatorial construction with deep reinforcement learning. InNeurIPS, pages 5745–5757. Curran Associates, Inc., 2021. 2

work page 2021

[7] [7]

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025

Gheorghe Comanici et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. 5

work page 2025

[8] [8]

InverseCSG: automatic conversion of 3d models to CSG trees.ACM TOG, 37(6), 2018

Tao Du, Jeevana Priya Inala, Yewen Pu, Andrew Spielberg, Adriana Schulz, Daniela Rus, Armando Solar-Lezama, and Wojciech Matusik. InverseCSG: automatic conversion of 3d models to CSG trees.ACM TOG, 37(6), 2018. 2

work page 2018

[9] [9]

Generating context-aware natural answers for questions in 3d scenes

Mohammed Munzer Dwedari, Matthias Niessner, and Zhenyu Chen. Generating context-aware natural answers for questions in 3d scenes. InBMVC. BMV A, 2023. 2

work page 2023

[10] [10]

Learn to create simple LEGO micro buildings.ACM TOG, 43(6), 2024

Jiahao Ge, Mingjun Zhou, and Chi-Wing Fu. Learn to create simple LEGO micro buildings.ACM TOG, 43(6), 2024. 2

work page 2024

[11] [11]

Blocks assemble! learning to assemble with large- scale structured reinforcement learning

Seyed Kamyar Seyed Ghasemipour, Satoshi Kataoka, By- ron David, Daniel Freeman, Shixiang Shane Gu, and Igor Mordatch. Blocks assemble! learning to assemble with large- scale structured reinforcement learning. InICML, pages 7435–7469. PMLR, 2022. 2

work page 2022

[12] [12]

TreeSBA: Tree-transformer for self-supervised sequential brick assembly

Mengqi Guo, Chen Li, Yuyang Zhao, and Gim Hee Lee. TreeSBA: Tree-transformer for self-supervised sequential brick assembly. InECCV, pages 35–51, Cham, 2025. Springer Nature Switzerland. 2

work page 2025

[13] [13]

3D-LLM: In- jecting the 3D world into large language models

Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, and Chuang Gan. 3D-LLM: In- jecting the 3D world into large language models. InNeurIPS, pages 20482–20494. Curran Associates, Inc., 2023. 2

work page 2023

[14] [14]

Ross, Cordelia Schmid, and Alireza Fathi

Ziniu Hu, Ahmet Iscen, Aashi Jain, Thomas Kipf, Yisong Yue, David A. Ross, Cordelia Schmid, and Alireza Fathi. SceneCraft: An LLM agent for synthesizing 3d scenes as blender code. InICML, 2024. 2

work page 2024

[15] [15]

LDraw, 1995–2026

James Jessiman. LDraw, 1995–2026. 1, 3, 4

work page 1995

[16] [16]

Kenny Jones, Theresa Barton, Xianghao Xu, Kai Wang, Ellen Jiang, Paul Guerrero, Niloy J

R. Kenny Jones, Theresa Barton, Xianghao Xu, Kai Wang, Ellen Jiang, Paul Guerrero, Niloy J. Mitra, and Daniel Ritchie. ShapeAssembly: learning to generate programs for 3D shape structure synthesis.ACM TOG, 39(6), 2020. 1, 2

work page 2020

[17] [17]

Com- poseAnything: Composite object priors for text-to-image generation, 2025

Zeeshan Khan, Shizhe Chen, and Cordelia Schmid. Com- poseAnything: Composite object priors for text-to-image generation, 2025. 2

work page 2025

[18] [18]

Combinatorial 3D shape generation via sequen- tial assembly

Jungtaek Kim, Hyunsoo Chung, Jinhwi Lee, Minsu Cho, and Jaesik Park. Combinatorial 3D shape generation via sequen- tial assembly. InNeurIPS Workshop on Machine Learning for Engineering Modeling, Simulation, and Design (ML4Eng),

work page

[19] [19]

Jones, Maaz Bin Safeer Ah- mad, Vladimir G

Milin Kodnongbua, Benjamin T. Jones, Maaz Bin Safeer Ah- mad, Vladimir G. Kim, and Adriana Schulz. ReparamCAD: Zero-shot CAD re-parameterization for interactive manipula- tion.SIGGRAPH Asia, 2023. 2

work page 2023

[20] [20]

Peter Kulits, Haiwen Feng, Weiyang Liu, Victoria Fernandez Abrevaya, and Michael J. Black. Re-thinking inverse graphics with large language models.TMLR, 2024. 2

work page 2024

[21] [21]

Black, and Silvia Zuffi

Peter Kulits, Michael J. Black, and Silvia Zuffi. Reconstruct- ing animals and the wild. InCVPR, pages 16565–16577,

work page

[22] [22]

Eval- uating text-to-visual generation with image-to-text genera- tion

Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, and Deva Ramanan. Eval- uating text-to-visual generation with image-to-text genera- tion. InECCV, pages 366–384, Cham, 2025. Springer Nature Switzerland. 7

work page 2025

[23] [23]

GPT4Motion: Scripting physical motions in text-to-video generation via blender-oriented GPT planning

Jiaxi Lv, Yi Huang, Mingfu Yan, Jiancheng Huang, Jianzhuang Liu, Yifan Liu, Yafei Wen, Xiaoxin Chen, and Shifeng Chen. GPT4Motion: Scripting physical motions in text-to-video generation via blender-oriented GPT planning. InCVPRW, pages 1430–1440, 2024. 2

work page 2024

[24] [24]

Lengyel, and A

Khaled Mamou, E. Lengyel, and A. Peters. V olumetric hier- archical approximate convex decomposition.Game Engine Gems, 3:141–158, 2016. 4

work page 2016

[25] [25]

LDCad: LDraw CAD editor, 2026

Roland Melkert. LDCad: LDraw CAD editor, 2026. 3

work page 2026

[26] [26]

Mitra, and Leonidas J

Kaichun Mo, Paul Guerrero, Li Yi, Hao Su, Peter Wonka, Niloy J. Mitra, and Leonidas J. Guibas. StructureNet: Hierar- chical graph networks for 3D shape generation.ACM TOG, 38(6), 2019. 1, 2

work page 2019

[27] [27]

Maxim Peysakhov and William C. Regli. Using assembly representations to enable evolutionary design of LEGO struc- tures.Artificial Intelligence for Engineering Design, Analysis and Manufacturing, 17(2):155–168, 2003. 2

work page 2003

[28] [28]

Generating physically sta- ble and buildable brick structures from text

Ava Pun, Kangle Deng, Ruixuan Liu, Deva Ramanan, Changliu Liu, and Jun-Yan Zhu. Generating physically sta- ble and buildable brick structures from text. InICCV, pages 14798–14809, 2025. 2, 3, 4, 7, 8

work page 2025

[29] [29]

CSGNet: Neural shape parser for 9 constructive solid geometry

Gopal Sharma, Rishabh Goyal, Difan Liu, Evangelos Kaloger- akis, and Subhransu Maji. CSGNet: Neural shape parser for 9 constructive solid geometry. InCVPR, pages 5515–5523,

work page

[30] [30]

3D-GPT: Procedural 3D model- ing with large language models

Chunyi Sun, Junlin Han, Weijian Deng, Xinlong Wang, Zis- han Qin, and Stephen Gould. 3D-GPT: Procedural 3D model- ing with large language models. In3DV, pages 1253–1263,

work page

[31] [31]

Qwen2.5-VL, 2025

Qwen Team. Qwen2.5-VL, 2025. 7

work page 2025

[32] [32]

Rylee Thompson, Ghalebi Elahe, Terrance DeVries, and Gra- ham W. Taylor. Building LEGO using deep generative models of graphs.Machine Learning for Engineering Modeling, Sim- ulation, and Design Workshop at NeurIPS, 2020. 2, 3

work page 2020

[33] [33]

SigLIP 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features, 2025

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier H´enaff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua Zhai. SigLIP 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features, 2025. 7

work page 2025

[34] [34]

Break and make: Inter- active structural understanding using LEGO bricks

Aaron Walsman, Muru Zhang, Klemen Kotar, Karthik Desingh, Ali Farhadi, and Dieter Fox. Break and make: Inter- active structural understanding using LEGO bricks. InECCV, pages 90–107, Cham, 2022. Springer Nature Switzerland. 2, 3, 4

work page 2022

[35] [35]

Learning to build by building your own instructions

Aaron Walsman, Muru Zhang, Adam Fishman, Ali Farhadi, and Dieter Fox. Learning to build by building your own instructions. InECCV, pages 261–278, Cham, 2025. Springer Nature Switzerland. 2, 3

work page 2025

[36] [36]

Translating a visual LEGO man- ual to a machine-executable plan

Ruocheng Wang, Yunzhi Zhang, Jiayuan Mao, Chin-Yi Cheng, and Jiajun Wu. Translating a visual LEGO man- ual to a machine-executable plan. InECCV, pages 677–694. Springer, 2022. 2

work page 2022

[37] [37]

3D ShapeNets: A deep representation for volumetric shapes

Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Lin- guang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3D ShapeNets: A deep representation for volumetric shapes. In CVPR, 2015. 7

work page 2015

[38] [38]

ULIP: Learning a unified representation of language, images, and point clouds for 3D understanding

Le Xue, Mingfei Gao, Chen Xing, Roberto Mart ´ın-Mart´ın, Jiajun Wu, Caiming Xiong, Ran Xu, Juan Carlos Niebles, and Silvio Savarese. ULIP: Learning a unified representation of language, images, and point clouds for 3D understanding. In CVPR, pages 1179–1189, 2023. 2

work page 2023

[39] [39]

Qwen3 technical report, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page 2025

[40] [40]

Holodeck: Lan- guage guided generation of 3d embodied ai environments

Yue Yang, Fan-Yun Sun, Luca Weihs, Eli VanderBilt, Al- varo Herrasti, Winson Han, Jiajun Wu, Nick Haber, Ranjay Krishna, Lingjie Liu, Chris Callison-Burch, Mark Yatskar, Aniruddha Kembhavi, and Christopher Clark. Holodeck: Lan- guage guided generation of 3d embodied ai environments. In CVPR, pages 16227–16237, 2024. 2

work page 2024

[41] [41]

Tenenbaum, Tianmin Shu, and Chuang Gan

Hongxin Zhang, Weihua Du, Jiaming Shan, Qinhong Zhou, Yilun Du, Joshua B. Tenenbaum, Tianmin Shu, and Chuang Gan. Building cooperative embodied agents modularly with large language models. InICLR, 2024. 2 10

work page 2024