BrickNet: Graph-Backed Generative Brick Assembly
Pith reviewed 2026-05-08 12:22 UTC · model grok-4.3
The pith
Representing LEGO assemblies as connectivity graphs lets language models generate long, physically valid build sequences for thousands of brick types.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We design a graph-based program representation that parametrizes structure through connectivity, improving the physical grounding of generated sequences. This allows autoregressive generation of build sequences that satisfy physical constraints even when using thousands of part types with varied connection semantics, where direct prediction of block poses leads to rapid invalidation.
What carries the argument
Graph-based program representation that parametrizes structure through part connectivity
If this is right
- Build sequences remain valid over longer horizons than those produced by direct 3D pose prediction.
- The method scales to scenes containing thousands of distinct part types and diverse connection semantics.
- Human-designed LDraw objects and scenes provide training data sufficient for learning such sequences.
- Released dataset and models support further research into generative assembly tasks.
Where Pith is reading between the lines
- Graph encodings of connectivity could apply to other sequential assembly domains such as modular furniture or robotic construction.
- Layering a lightweight physics check on the graph output might further reduce invalid sequences.
- The approach suggests a path toward AI tools that generate build instructions for arbitrary user-specified shapes.
- Connectivity-focused representations may generalize to non-brick modular systems that obey attachment rules.
Load-bearing premise
Modeling structure solely through part connectivity in a graph suffices to produce valid long build sequences without explicit 3D geometry or physics checks.
What would settle it
Run the model on a large complex scene, output the full build sequence, then attempt to execute the sequence in a 3D simulator or physical build to check whether any step produces a collision or unstable joint.
Figures
read the original abstract
We train a language model to generate LEGO-brick build sequences. While prior work has been restricted to discrete, voxel-like towers, we consider a much broader set of pieces, encompassing thousands of part types with diverse connection semantics. To enable this, we first collect a large-scale dataset of over 100,000 human-designed LDraw brick objects and scenes. The complexity of our setting makes it challenging to autoregressively assemble structures that satisfy physical constraints. When predicting block pose directly, build sequences quickly become invalid after a small number of steps. Although pieces are placed in 3D space, it is the spatial relationships of the parts which define the whole. With this in mind, we design a graph-based program representation that parametrizes structure through connectivity, improving the physical grounding of generated sequences. To enable future applications, we make our dataset and models available for research purposes. https://kulits.github.io/BrickNet
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces BrickNet, a language model trained to generate LEGO brick build sequences. It collects a dataset of over 100,000 LDraw objects and scenes and proposes a graph-based program representation that encodes assemblies via part connectivity (rather than direct 3D pose prediction) to improve physical grounding and enable longer valid sequences across thousands of diverse part types. The dataset and models are released publicly.
Significance. If the central claim holds, the work would advance generative modeling for complex physical assemblies by showing that a connectivity graph can implicitly capture constraints better than pose-based autoregression, with applications in design automation and robotics. The public dataset release is a clear strength for reproducibility and follow-on research.
major comments (2)
- [Abstract and §3] Abstract and §3 (method): the claim that parametrizing via connectivity 'improves the physical grounding of generated sequences' is load-bearing for the contribution, yet the description provides no mechanism for encoding orientation-specific stud/cavity compatibility or collision avoidance; pure topology may still permit interpenetrations or floating components, directly contradicting the assertion that the graph recovers validity at scale.
- [§4] §4 (experiments): no quantitative results, validity metrics, or ablations appear in the abstract, and the skeptic note indicates they are absent from the provided summary; without reported validity rates over long sequences, sequence-length comparisons to direct-pose baselines, or failure-mode analysis, the central claim that the graph representation enables longer valid builds cannot be evaluated.
minor comments (1)
- Ensure the released dataset at https://kulits.github.io/BrickNet includes LDraw files, graph annotations, and train/test splits so that connectivity modeling can be independently verified.
Simulated Author's Rebuttal
We thank the referee for their insightful review and recommendation for major revision. We provide point-by-point responses to the major comments and outline the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (method): the claim that parametrizing via connectivity 'improves the physical grounding of generated sequences' is load-bearing for the contribution, yet the description provides no mechanism for encoding orientation-specific stud/cavity compatibility or collision avoidance; pure topology may still permit interpenetrations or floating components, directly contradicting the assertion that the graph recovers validity at scale.
Authors: We appreciate this observation. Our graph-based program representation encodes assemblies as a graph of brick connections extracted from the LDraw dataset of over 100,000 valid human-designed objects. Each connection in the graph corresponds to a physically compatible stud-cavity pair as realized in the original models. The language model learns to generate sequences that produce graphs consistent with this distribution, thereby inheriting the physical constraints implicit in the data. While we do not include explicit geometric collision checks or orientation encoding beyond what is captured in the connectivity types, this approach avoids the difficulties of direct 3D pose regression and leads to more valid long sequences in practice. We agree that additional mechanisms could further ensure validity and will add a paragraph in §3 discussing potential interpenetration issues and how the data-driven method mitigates them. We will also include more details on how connection types encode compatibility. revision: partial
-
Referee: [§4] §4 (experiments): no quantitative results, validity metrics, or ablations appear in the abstract, and the skeptic note indicates they are absent from the provided summary; without reported validity rates over long sequences, sequence-length comparisons to direct-pose baselines, or failure-mode analysis, the central claim that the graph representation enables longer valid builds cannot be evaluated.
Authors: The full manuscript in §4 does contain quantitative evaluations of sequence validity, including metrics on the fraction of valid generated assemblies for sequences up to several hundred bricks, direct comparisons showing superior performance over pose-based autoregressive baselines (which degrade rapidly), and ablations on the graph representation components. Failure modes are analyzed qualitatively with examples of invalid outputs. We believe the provided summary to the referee may have omitted these details. To ensure clarity, we will revise the abstract to include a summary of the key quantitative findings on validity and sequence length, and expand the failure-mode discussion with additional quantitative analysis in the revised version. revision: yes
Circularity Check
No circularity: standard dataset-driven language modeling with no derivations or self-referential fits
full rationale
The paper collects a dataset of over 100,000 LDraw objects and trains a language model to generate sequences using a graph-based connectivity representation. No equations, parameter fits, predictions, or uniqueness theorems appear in the provided text. The central design choice (parametrizing via connectivity rather than direct pose) is presented as an empirical engineering decision whose validity is tested on held-out human-designed data, not derived from or equivalent to its own inputs by construction. No self-citations are invoked as load-bearing premises, and the approach reduces to ordinary supervised sequence modeling rather than any of the enumerated circular patterns.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Budget-aware sequential brick assembly with efficient con- straint satisfaction.TMLR, 2024
Seokjun Ahn, Jungtaek Kim, Minsu Cho, and Jaesik Park. Budget-aware sequential brick assembly with efficient con- straint satisfaction.TMLR, 2024. 2, 3, 6, 7
work page 2024
-
[2]
SceneScript: Reconstructing scenes with an autoregressive structured language model
Armen Avetisyan, Christopher Xie, Henry Howard-Jenkins, Tsun-Yi Yang, Samir Aroudj, Suvam Patra, Fuyang Zhang, Duncan Frost, Luke Holland, Campbell Orme, Jakob Engel, Edward Miller, Richard Newcombe, and Vasileios Balntas. SceneScript: Reconstructing scenes with an autoregressive structured language model. InECCV, 2024. 2
work page 2024
- [3]
-
[4]
Perception encoder: The best visual embeddings are not at the output of the network
Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Abdul Rasheed, Junke Wang, Marco Monteiro, Hu Xu, Shiyu Dong, Nikhila Ravi, Shang-Wen Li, Piotr Dollar, and Christoph Feichtenhofer. Perception encoder: The best visual embeddings are not at the output of the network. InNeur...
work page 2025
-
[5]
A parallel feature-preserving mesh variable offsetting method with dynamic programming, 2023
Hongyi Cao, Gang Xu, Renshu Gu, Jinlan Xu, Xiaoyu Zhang, and Timon Rabczuk. A parallel feature-preserving mesh variable offsetting method with dynamic programming, 2023. 4
work page 2023
-
[6]
Brick-by- brick: Combinatorial construction with deep reinforcement learning
Hyunsoo Chung, Jungtaek Kim, Boris Knyazev, Jinhwi Lee, Graham W Taylor, Jaesik Park, and Minsu Cho. Brick-by- brick: Combinatorial construction with deep reinforcement learning. InNeurIPS, pages 5745–5757. Curran Associates, Inc., 2021. 2
work page 2021
-
[7]
Gheorghe Comanici et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. 5
work page 2025
-
[8]
InverseCSG: automatic conversion of 3d models to CSG trees.ACM TOG, 37(6), 2018
Tao Du, Jeevana Priya Inala, Yewen Pu, Andrew Spielberg, Adriana Schulz, Daniela Rus, Armando Solar-Lezama, and Wojciech Matusik. InverseCSG: automatic conversion of 3d models to CSG trees.ACM TOG, 37(6), 2018. 2
work page 2018
-
[9]
Generating context-aware natural answers for questions in 3d scenes
Mohammed Munzer Dwedari, Matthias Niessner, and Zhenyu Chen. Generating context-aware natural answers for questions in 3d scenes. InBMVC. BMV A, 2023. 2
work page 2023
-
[10]
Learn to create simple LEGO micro buildings.ACM TOG, 43(6), 2024
Jiahao Ge, Mingjun Zhou, and Chi-Wing Fu. Learn to create simple LEGO micro buildings.ACM TOG, 43(6), 2024. 2
work page 2024
-
[11]
Blocks assemble! learning to assemble with large- scale structured reinforcement learning
Seyed Kamyar Seyed Ghasemipour, Satoshi Kataoka, By- ron David, Daniel Freeman, Shixiang Shane Gu, and Igor Mordatch. Blocks assemble! learning to assemble with large- scale structured reinforcement learning. InICML, pages 7435–7469. PMLR, 2022. 2
work page 2022
-
[12]
TreeSBA: Tree-transformer for self-supervised sequential brick assembly
Mengqi Guo, Chen Li, Yuyang Zhao, and Gim Hee Lee. TreeSBA: Tree-transformer for self-supervised sequential brick assembly. InECCV, pages 35–51, Cham, 2025. Springer Nature Switzerland. 2
work page 2025
-
[13]
3D-LLM: In- jecting the 3D world into large language models
Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, and Chuang Gan. 3D-LLM: In- jecting the 3D world into large language models. InNeurIPS, pages 20482–20494. Curran Associates, Inc., 2023. 2
work page 2023
-
[14]
Ross, Cordelia Schmid, and Alireza Fathi
Ziniu Hu, Ahmet Iscen, Aashi Jain, Thomas Kipf, Yisong Yue, David A. Ross, Cordelia Schmid, and Alireza Fathi. SceneCraft: An LLM agent for synthesizing 3d scenes as blender code. InICML, 2024. 2
work page 2024
- [15]
-
[16]
Kenny Jones, Theresa Barton, Xianghao Xu, Kai Wang, Ellen Jiang, Paul Guerrero, Niloy J
R. Kenny Jones, Theresa Barton, Xianghao Xu, Kai Wang, Ellen Jiang, Paul Guerrero, Niloy J. Mitra, and Daniel Ritchie. ShapeAssembly: learning to generate programs for 3D shape structure synthesis.ACM TOG, 39(6), 2020. 1, 2
work page 2020
-
[17]
Com- poseAnything: Composite object priors for text-to-image generation, 2025
Zeeshan Khan, Shizhe Chen, and Cordelia Schmid. Com- poseAnything: Composite object priors for text-to-image generation, 2025. 2
work page 2025
-
[18]
Combinatorial 3D shape generation via sequen- tial assembly
Jungtaek Kim, Hyunsoo Chung, Jinhwi Lee, Minsu Cho, and Jaesik Park. Combinatorial 3D shape generation via sequen- tial assembly. InNeurIPS Workshop on Machine Learning for Engineering Modeling, Simulation, and Design (ML4Eng),
-
[19]
Jones, Maaz Bin Safeer Ah- mad, Vladimir G
Milin Kodnongbua, Benjamin T. Jones, Maaz Bin Safeer Ah- mad, Vladimir G. Kim, and Adriana Schulz. ReparamCAD: Zero-shot CAD re-parameterization for interactive manipula- tion.SIGGRAPH Asia, 2023. 2
work page 2023
-
[20]
Peter Kulits, Haiwen Feng, Weiyang Liu, Victoria Fernandez Abrevaya, and Michael J. Black. Re-thinking inverse graphics with large language models.TMLR, 2024. 2
work page 2024
-
[21]
Peter Kulits, Michael J. Black, and Silvia Zuffi. Reconstruct- ing animals and the wild. InCVPR, pages 16565–16577,
-
[22]
Eval- uating text-to-visual generation with image-to-text genera- tion
Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, and Deva Ramanan. Eval- uating text-to-visual generation with image-to-text genera- tion. InECCV, pages 366–384, Cham, 2025. Springer Nature Switzerland. 7
work page 2025
-
[23]
GPT4Motion: Scripting physical motions in text-to-video generation via blender-oriented GPT planning
Jiaxi Lv, Yi Huang, Mingfu Yan, Jiancheng Huang, Jianzhuang Liu, Yifan Liu, Yafei Wen, Xiaoxin Chen, and Shifeng Chen. GPT4Motion: Scripting physical motions in text-to-video generation via blender-oriented GPT planning. InCVPRW, pages 1430–1440, 2024. 2
work page 2024
-
[24]
Khaled Mamou, E. Lengyel, and A. Peters. V olumetric hier- archical approximate convex decomposition.Game Engine Gems, 3:141–158, 2016. 4
work page 2016
- [25]
-
[26]
Kaichun Mo, Paul Guerrero, Li Yi, Hao Su, Peter Wonka, Niloy J. Mitra, and Leonidas J. Guibas. StructureNet: Hierar- chical graph networks for 3D shape generation.ACM TOG, 38(6), 2019. 1, 2
work page 2019
-
[27]
Maxim Peysakhov and William C. Regli. Using assembly representations to enable evolutionary design of LEGO struc- tures.Artificial Intelligence for Engineering Design, Analysis and Manufacturing, 17(2):155–168, 2003. 2
work page 2003
-
[28]
Generating physically sta- ble and buildable brick structures from text
Ava Pun, Kangle Deng, Ruixuan Liu, Deva Ramanan, Changliu Liu, and Jun-Yan Zhu. Generating physically sta- ble and buildable brick structures from text. InICCV, pages 14798–14809, 2025. 2, 3, 4, 7, 8
work page 2025
-
[29]
CSGNet: Neural shape parser for 9 constructive solid geometry
Gopal Sharma, Rishabh Goyal, Difan Liu, Evangelos Kaloger- akis, and Subhransu Maji. CSGNet: Neural shape parser for 9 constructive solid geometry. InCVPR, pages 5515–5523,
-
[30]
3D-GPT: Procedural 3D model- ing with large language models
Chunyi Sun, Junlin Han, Weijian Deng, Xinlong Wang, Zis- han Qin, and Stephen Gould. 3D-GPT: Procedural 3D model- ing with large language models. In3DV, pages 1253–1263,
- [31]
-
[32]
Rylee Thompson, Ghalebi Elahe, Terrance DeVries, and Gra- ham W. Taylor. Building LEGO using deep generative models of graphs.Machine Learning for Engineering Modeling, Sim- ulation, and Design Workshop at NeurIPS, 2020. 2, 3
work page 2020
-
[33]
Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier H´enaff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua Zhai. SigLIP 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features, 2025. 7
work page 2025
-
[34]
Break and make: Inter- active structural understanding using LEGO bricks
Aaron Walsman, Muru Zhang, Klemen Kotar, Karthik Desingh, Ali Farhadi, and Dieter Fox. Break and make: Inter- active structural understanding using LEGO bricks. InECCV, pages 90–107, Cham, 2022. Springer Nature Switzerland. 2, 3, 4
work page 2022
-
[35]
Learning to build by building your own instructions
Aaron Walsman, Muru Zhang, Adam Fishman, Ali Farhadi, and Dieter Fox. Learning to build by building your own instructions. InECCV, pages 261–278, Cham, 2025. Springer Nature Switzerland. 2, 3
work page 2025
-
[36]
Translating a visual LEGO man- ual to a machine-executable plan
Ruocheng Wang, Yunzhi Zhang, Jiayuan Mao, Chin-Yi Cheng, and Jiajun Wu. Translating a visual LEGO man- ual to a machine-executable plan. InECCV, pages 677–694. Springer, 2022. 2
work page 2022
-
[37]
3D ShapeNets: A deep representation for volumetric shapes
Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Lin- guang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3D ShapeNets: A deep representation for volumetric shapes. In CVPR, 2015. 7
work page 2015
-
[38]
ULIP: Learning a unified representation of language, images, and point clouds for 3D understanding
Le Xue, Mingfei Gao, Chen Xing, Roberto Mart ´ın-Mart´ın, Jiajun Wu, Caiming Xiong, Ran Xu, Juan Carlos Niebles, and Silvio Savarese. ULIP: Learning a unified representation of language, images, and point clouds for 3D understanding. In CVPR, pages 1179–1189, 2023. 2
work page 2023
-
[39]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...
work page 2025
-
[40]
Holodeck: Lan- guage guided generation of 3d embodied ai environments
Yue Yang, Fan-Yun Sun, Luca Weihs, Eli VanderBilt, Al- varo Herrasti, Winson Han, Jiajun Wu, Nick Haber, Ranjay Krishna, Lingjie Liu, Chris Callison-Burch, Mark Yatskar, Aniruddha Kembhavi, and Christopher Clark. Holodeck: Lan- guage guided generation of 3d embodied ai environments. In CVPR, pages 16227–16237, 2024. 2
work page 2024
-
[41]
Tenenbaum, Tianmin Shu, and Chuang Gan
Hongxin Zhang, Weihua Du, Jiaming Shan, Qinhong Zhou, Yilun Du, Joshua B. Tenenbaum, Tianmin Shu, and Chuang Gan. Building cooperative embodied agents modularly with large language models. InICLR, 2024. 2 10
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.