arxiv: 2605.07605 · v1 · submitted 2026-05-08 · 💻 cs.RO

Recognition: 2 theorem links

· Lean Theorem

BrickCraft: Visuomotor Skill Composition with Situated Manual Guidance for Long-Horizon Interlocking Brick Assembly

Jichuan Yu , Bowei Li , Zhenran Tang , Guanxing Lu , Chuxiong Hu , Ruixuan Liu , Changliu Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:01 UTC · model grok-4.3

classification 💻 cs.RO

keywords brick assemblyvisuomotor skillsskill compositionrobot manipulationlong-horizon taskssituated guidancecompositional generalization

0 comments

The pith

BrickCraft lets robots assemble unseen interlocking brick structures by composing skills anchored to partial builds and guided by live visual projections.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

BrickCraft decomposes long-horizon brick assembly into a small set of reusable primitive skills, each defined relative to a reference brick already placed in the growing structure. Situated manuals then overlay the exact placement target directly onto the robot's current camera images, supplying the spatial details needed for the learned visuomotor controllers to act. A chaining pipeline executes these grounded skills in sequence to finish extended tasks. The result is that the system reaches good performance after only a few demonstrations and transfers successfully to brick layouts it never encountered during training.

Core claim

BrickCraft models the assembly process using a relative formulation, where each step is anchored to a reference brick within the partial structure, thereby decomposing complex tasks into a finite set of reusable primitive skills. It bridges the gap between high-level assembly plans and physical execution through situated manuals, which provide explicit spatial guidance for learned visuomotor skills by projecting the assembly intent onto real-time robot observations. Finally, BrickCraft employs a compositional execution pipeline that chains these spatially grounded skills to accomplish long-horizon assembly tasks, acquiring proficient skills from limited demonstrations and generalizing to new

What carries the argument

Situated manuals that project assembly intent onto real-time robot camera observations to give explicit spatial targets to the learned visuomotor primitive skills.

If this is right

Long-horizon assembly tasks become feasible by linking a small library of primitive skills instead of learning each sequence from scratch.
High-level plans translate directly into executable motions once the manuals supply the missing spatial details.
Performance remains high after training on only a limited number of demonstrations.
The same learned skills transfer to brick arrangements that differ from all training examples.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The relative anchoring to existing bricks could reduce sensitivity to global coordinate drift in other sequential manipulation settings.
If the manual projection works under varied lighting, the same interface might support human-in-the-loop corrections during execution.

Load-bearing premise

Projecting assembly intent onto real-time robot observations through situated manuals supplies sufficient and reliable spatial guidance for successful physical execution across varied structures and lighting conditions.

What would settle it

Repeated physical failures when the robot attempts an unseen structure under altered lighting or with a slightly different partial build, even though the high-level plan and skill sequence are correct.

Figures

Figures reproduced from arXiv: 2605.07605 by Bowei Li, Changliu Liu, Chuxiong Hu, Guanxing Lu, Jichuan Yu, Ruixuan Liu, Zhenran Tang.

**Figure 1.** Figure 1: Overview of BRICKCRAFT. BRICKCRAFT transforms a digital design into a physical product through three phases: (i) Skill-Oriented Assembly Reasoning decomposes the long-horizon task into steps anchored to reference bricks and maps them to reusable primitive skills. (ii) Assembly Intent Grounding generates situated manuals to provide spatial guidance; and (iii) Compositional Visuomotor Execution chains visuom… view at source ↗

**Figure 2.** Figure 2: Geometric task encoding. The 4D vector τ parameterizes the relative spatial relationship between the target brick btgt and the reference brick bref . assembly intents onto robot observations, forming situated manuals to establish spatial guidance; and (iii) Compositional Visuomotor Execution, which performs the assembly plan by composing reusable visuomotor skills under the guidance of situated manuals. Th… view at source ↗

**Figure 3.** Figure 3: Situated manual-guided visuomotor assembly. (a) Assembly Intent Grounding: Symbolic assembly plans are rendered into visual references in simulation and aligned with real-world observations Iws to extract task-relevant entity masks. These masks are tracked via SAM 2 [27] and overlaid onto real-time observations to yield the situated manual. (b) Visuomotor Skill Execution: We formulate the assembly skill as… view at source ↗

**Figure 4.** Figure 4: Evaluation of visuomotor assembly skills. We evaluate three distinct primitive skills, testing each skill across 8 seen and 8 unseen structural configurations, with 5 independent trials per structure. (a) Demonstrations of visuomotor assembly on diverse structures; (b) Success rate comparison (40 trials per bar). BRICKCRAFT consistently outperforms the baselines, demonstrating robust generalization to unse… view at source ↗

**Figure 5.** Figure 5: EigenCAM [30] heatmaps for (a) the GI-DP baseline and (b) BRICKCRAFT with situated manual guidance. B. Result and Discussion Skill Performance: We evaluate the single-step assembly performance across three distinct primitive skills, as illustrated in [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 8.** Figure 8: Typical failure modes in robotic brick assembly. ular, the substantial progress achieved on Castle, an entirely unseen structure during training, validates the compositional generalization of BRICKCRAFT. As expected, system performance inherently scales with structural difficulty. While fullysupported architectures like the Pyramid and Stairs achieve near-perfect execution, completion rates on the House … view at source ↗

**Figure 7.** Figure 7: Evaluation of long-horizon assembly. (a) Diverse structural designs for evaluation. (b) Completion rate for each task. Results show the mean and standard deviation across 6 independent trials. conduct an ablation study comparing two different visual prompting methods: our background dimming strategy and a bounding box alternative, evaluated on the skill τ = [0, 0, 1, 0]. As shown in [PITH_FULL_IMAGE:figur… view at source ↗

read the original abstract

Autonomous robotic assembly of interlocking bricks demands seamless integration of long-horizon task reasoning, spatial grounding, and fine-grained manipulation. This paper presents BrickCraft, a compositional framework designed for long-horizon and generalizable interlocking brick assembly. BrickCraft models the assembly process using a relative formulation, where each step is anchored to a reference brick within the partial structure, thereby decomposing complex tasks into a finite set of reusable primitive skills. BrickCraft bridges the gap between high-level assembly plans and physical execution through situated manuals, which provide explicit spatial guidance for learned visuomotor skills by projecting the assembly intent onto real-time robot observations. Finally, BrickCraft employs a compositional execution pipeline that chains these spatially grounded skills to accomplish long-horizon assembly tasks. Extensive experimental validations demonstrate that BrickCraft acquires proficient assembly skills from a limited set of demonstrations and exhibits strong compositional generalization to unseen structures. The project website is available at https://intelligent-control-lab.github.io/BrickCraft.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BrickCraft shows how relative anchoring plus situated visual manuals can turn limited demos into composable skills for real-robot brick assembly, with trials supporting generalization to new structures.

read the letter

BrickCraft's core idea is to decompose long-horizon brick assembly into reusable primitive skills by anchoring each step relative to a reference brick in the partial build, then use situated manuals to project the needed spatial guidance directly onto the robot's camera feed for the visuomotor policies. This pairing looks new compared to prior work on skill composition or imitation learning. The relative formulation avoids absolute coordinate issues, and the manuals make the intent explicit in the observation space rather than hoping the policy infers it. The compositional pipeline then chains these for full tasks. The paper does well here by showing real-robot results where the system learns proficient skills from a small set of demos and generalizes to structures not seen in training. The stress on multiple structures and the explicit handling of the guidance assumption through trials makes the evidence more convincing than just abstract claims. Soft spots are minor. More specifics on demonstration collection and exact error rates would help, but the overall results seem to hold without obvious gaps in the tested regimes. No load-bearing flaws in the logic or unaddressed assumptions appear. This work is for roboticists focused on assembly tasks, visuomotor learning, or manufacturing automation. Readers interested in practical ways to scale manipulation skills without huge datasets will get the most out of it. I think it deserves a serious referee.

Referee Report

0 major / 1 minor

Summary. The paper presents BrickCraft, a compositional framework for long-horizon interlocking brick assembly. It models the process with a relative formulation that anchors each step to a reference brick in the partial structure, decomposing tasks into reusable primitive skills. Situated manuals project assembly intent onto real-time robot observations to supply spatial guidance for learned visuomotor skills. These are chained via a compositional execution pipeline. The central claim is that the system acquires proficient assembly skills from a limited set of demonstrations and exhibits strong compositional generalization to unseen structures, as shown by extensive experimental validations on physical robots.

Significance. If the reported results hold, the work is significant for robotic assembly and manipulation. The relative formulation combined with situated-manual projection offers a practical way to ground high-level plans in low-level control, enabling skill reuse and generalization with few demonstrations. This addresses key challenges in long-horizon tasks and could reduce data requirements in real-world deployment. The real-robot focus on interlocking bricks adds direct applicability.

minor comments (1)

[Abstract] Abstract: the claim of 'extensive experimental validations' demonstrating 'proficient skills and strong generalization' would be strengthened by including at least one or two key quantitative results (e.g., success rates, number of structures tested, or comparison metrics) rather than leaving all details to the body.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary of BrickCraft and the recommendation for minor revision. The assessment correctly identifies the core contributions of the relative anchoring formulation, situated manual projection for spatial guidance, and the compositional execution pipeline, along with the emphasis on limited demonstrations and generalization to unseen structures. No specific major comments were listed in the provided report.

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper defines BrickCraft via a relative formulation that decomposes tasks into reusable primitives, situated manuals for spatial projection, and a compositional pipeline. These are presented as design choices, not derived from the target performance metrics. Experimental claims of limited-demonstration proficiency and generalization to unseen structures rest on reported real-robot trials rather than any self-referential definition, fitted parameter renamed as prediction, or self-citation chain that collapses the central result. No equations or steps reduce by construction to their inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that complex assembly tasks can be reliably decomposed into a finite set of reusable primitive skills anchored to reference bricks in the partial structure; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Complex long-horizon assembly tasks can be decomposed into a finite set of reusable primitive skills using relative anchoring to a reference brick in the partial structure.
This decomposition is presented as the core modeling choice that enables composition and generalization.

pith-pipeline@v0.9.0 · 5490 in / 1323 out tokens · 35964 ms · 2026-05-11T02:01:13.164273+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

BRICKCRAFT models the assembly process using a relative formulation, where each step is anchored to a reference brick within the partial structure, thereby decomposing complex tasks into a finite set of reusable primitive skills... situated manuals, which provide explicit spatial guidance... by projecting the assembly intent onto real-time robot observations.
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The 4D vector τ parameterizes the relative spatial relationship... τ=[τx,τy,τz,τθ]∈T... each τ corresponds to a distinct primitive assembly skill.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 3 internal anchors

[1]

Prompt-to-product: Generative assembly via bimanual manipulation,

R. Liu, P. Huang, A. Pun, K. Deng, S. Aggarwal, Z. Tang, M. Liu, D. Ramanan, J.-Y . Zhu, J. Li, and C. Liu, “Prompt-to-product: Generative assembly via bimanual manipulation,”IEEE Robotics and Automation Magazine, 2026

work page 2026
[2]

Fabrica: Dual- arm assembly of general multi-part objects via integrated planning and learning,

Y . Tian, J. Jacob, Y . Huang, J. Zhao, E. L. Gu, P. Ma, A. Zhang, F. Javid, B. Romero, S. Chitta, S. Sueda, H. Li, and W. Matusik, “Fabrica: Dual- arm assembly of general multi-part objects via integrated planning and learning,” in9th Annual Conference on Robot Learning, 2025

work page 2025
[3]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichteret al., “ π0: A vision-language-action flow model for general robot control,”arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

π0.5: A vision- language-action model with open-world generalization,

K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. R. Equi, C. Finn, N. Fusai, M. Y . Gallikeret al., “ π0.5: A vision- language-action model with open-world generalization,” in9th Annual Conference on Robot Learning, 2025

work page 2025
[5]

Manual2skill: Learning to read man- uals and acquire robotic skills for furniture assembly using vision-language models.arXiv preprint arXiv:2502.10090,

C. Tie, S. Sun, J. Zhu, Y . Liu, J. Guo, Y . Hu, H. Chen, J. Chen, R. Wu, and L. Shao, “Manual2skill: Learning to read manuals and acquire robotic skills for furniture assembly using vision-language models,” arXiv preprint arXiv:2502.10090, 2025

work page arXiv 2025
[6]

Query-centric diffusion policy for generalizable robotic assembly,

Z. Xu, H. Lin, S. Liu, and D. Zhao, “Query-centric diffusion policy for generalizable robotic assembly,”arXiv preprint arXiv:2509.18686, 2025

work page arXiv 2025
[7]

Arch: Hierarchical hybrid learning for long-horizon contact-rich robotic assembly,

J. Sun, A. Curtis, Y . You, Y . Xu, M. Koehle, Q. Chen, S. Huang, L. Guibas, S. Chitta, M. Schwageret al., “Arch: Hierarchical hybrid learning for long-horizon contact-rich robotic assembly,”arXiv preprint arXiv:2409.16451, 2024

work page arXiv 2024
[8]

Manualvla: A unified vla model for chain-of-thought manual generation and robotic manipulation.arXiv preprint arXiv:2512.02013, 2025

C. Gu, J. Liu, H. Chen, R. Huang, Q. Wuwu, Z. Liu, X. Li, Y . Li, R. Zhang, P. Jiaet al., “Manualvla: A unified vla model for chain-of- thought manual generation and robotic manipulation,”arXiv preprint arXiv:2512.02013, 2025

work page arXiv 2025
[9]

Assemble them all: Physics-based planning for generalizable assembly by disassembly,

Y . Tian, J. Xu, Y . Li, J. Luo, S. Sueda, H. Li, K. D. Willis, and W. Ma- tusik, “Assemble them all: Physics-based planning for generalizable assembly by disassembly,”ACM Transactions on Graphics (TOG), vol. 41, no. 6, pp. 1–11, 2022

work page 2022
[10]

Apex-mr: Multi- robot asynchronous planning and execution for cooperative assembly,

P. Huang, R. Liu, S. Aggarwal, C. Liu, and J. Li, “Apex-mr: Multi- robot asynchronous planning and execution for cooperative assembly,” arXiv preprint arXiv:2503.15836, 2025

work page arXiv 2025
[11]

From bricks to bots: Automated collision-aware sequence planning for lego reconstruction with a two-finger gripper,

A. Barghi, N. Pasiar, M. T. Masouleh, and A. Kalhor, “From bricks to bots: Automated collision-aware sequence planning for lego reconstruction with a two-finger gripper,” in2024 10th International Conference on Control, Instrumentation and Automation (ICCIA). IEEE, 2024, pp. 1–6

work page 2024
[12]

A lightweight and transferable design for robust lego manipulation,

R. Liu, Y . Sun, and C. Liu, “A lightweight and transferable design for robust lego manipulation,” inInternational Symposium on Flexible Automation, vol. 87882. American Society of Mechanical Engineers, 2024, p. V001T07A004

work page 2024
[13]

Eye-in-finger: Smart fingers for delicate assembly and disassembly of lego,

Z. Tang, R. Liu, and C. Liu, “Eye-in-finger: Smart fingers for delicate assembly and disassembly of lego,” in2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2025, pp. 20 990–20 996

work page 2025
[14]

A learning framework for high precision industrial assembly,

Y . Fan, J. Luo, and M. Tomizuka, “A learning framework for high precision industrial assembly,” in2019 International Conference on Robotics and Automation (ICRA). IEEE, 2019, pp. 811–817

work page 2019
[15]

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

T. Z. Zhao, V . Kumar, S. Levine, and C. Finn, “Learning fine- grained bimanual manipulation with low-cost hardware,”arXiv preprint arXiv:2304.13705, 2023

work page internal anchor Pith review arXiv 2023
[16]

Diffusion policy: Visuomotor policy learning via action diffusion,

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,”The International Journal of Robotics Research, vol. 44, no. 10-11, pp. 1684–1704, 2025

work page 2025
[17]

Goal-conditioned imitation learning us- ing score-based diffusion policies

M. Reuss, M. Li, X. Jia, and R. Lioutikov, “Goal-conditioned imitation learning using score-based diffusion policies,”arXiv preprint arXiv:2304.02532, 2023

work page arXiv 2023
[18]

Rt-2: Vision-language-action models transfer web knowledge to robotic control,

B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahidet al., “Rt-2: Vision-language-action models transfer web knowledge to robotic control,” inConference on Robot Learning. PMLR, 2023, pp. 2165–2183

work page 2023
[19]

Cot-vla: Visual chain-of-thought reasoning for vision-language-action models,

Q. Zhao, Y . Lu, M. J. Kim, Z. Fu, Z. Zhang, Y . Wu, Z. Li, Q. Ma, S. Han, C. Finnet al., “Cot-vla: Visual chain-of-thought reasoning for vision-language-action models,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 1702–1713

work page 2025
[20]

Bricksim: A physics- based simulator for manipulating interlocking brick assemblies,

H. Wen, R. Liu, W. Piao, S. Li, and C. Liu, “Bricksim: A physics- based simulator for manipulating interlocking brick assemblies,”arXiv preprint arXiv:2603.16853, 2026

work page arXiv 2026
[21]

Stablelego: Stability analysis of block stacking assembly,

R. Liu, K. Deng, Z. Wang, and C. Liu, “Stablelego: Stability analysis of block stacking assembly,”IEEE Robotics and Automation Letters, vol. 9, no. 11, pp. 9383–9390, 2024

work page 2024
[22]

Autonomous integration and improvement of robotic assembly using skill graph representations,

P. Yu, P. Huang, C. Chawla, G. Shi, J. Li, and C. Liu, “Autonomous integration and improvement of robotic assembly using skill graph representations,”arXiv preprint arXiv:2603.12649, 2026

work page arXiv 2026
[23]

Mosaic: A skill-centric algorithmic framework for long-horizon manipulation planning,

I. Mishani, Y . Shaoul, and M. Likhachev, “Mosaic: A skill-centric algorithmic framework for long-horizon manipulation planning,”arXiv preprint arXiv:2504.16738, 2025

work page arXiv 2025
[24]

Physics-aware combinatorial assembly sequence planning using data-free action masking,

R. Liu, A. Chen, W. Zhao, and C. Liu, “Physics-aware combinatorial assembly sequence planning using data-free action masking,”IEEE Robotics and Automation Letters, 2025

work page 2025
[25]

Skilldiffuser: Interpretable hierarchical planning via skill abstractions in diffusion-based task execution,

Z. Liang, Y . Mu, H. Ma, M. Tomizuka, M. Ding, and P. Luo, “Skilldiffuser: Interpretable hierarchical planning via skill abstractions in diffusion-based task execution,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 16 467–16 476

work page 2024
[26]

[Online]

Blender Online Community,Blender: A 3D modelling and rendering package, Blender Foundation, Stichting Blender Foundation, Amsterdam, 2026. [Online]. Available: http://www.blender.org

work page 2026
[27]

SAM 2: Segment Anything in Images and Videos

N. Ravi, V . Gabeur, Y .-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. R ¨adle, C. Rolland, L. Gustafson, E. Mintun, J. Pan, K. V . Alwala, N. Carion, C.-Y . Wu, R. Girshick, P. Doll ´ar, and C. Feichtenhofer, “Sam 2: Segment anything in images and videos,”arXiv preprint arXiv:2408.00714, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Yolov8: A novel object detection algorithm with enhanced performance and robustness,

R. Varghese and M. Sambath, “Yolov8: A novel object detection algorithm with enhanced performance and robustness,” in2024 Inter- national conference on advances in data engineering and intelligent computing systems (ADICS). IEEE, 2024, pp. 1–6

work page 2024
[29]

Parametric image alignment using enhanced correlation coefficient maximization,

G. D. Evangelidis and E. Z. Psarakis, “Parametric image alignment using enhanced correlation coefficient maximization,”IEEE transac- tions on pattern analysis and machine intelligence, vol. 30, no. 10, pp. 1858–1865, 2008

work page 2008
[30]

Eigen-cam: Class activation map using principal components,

M. B. Muhammad and M. Yeasin, “Eigen-cam: Class activation map using principal components,” in2020 international joint conference on neural networks (IJCNN). IEEE, 2020, pp. 1–7

work page 2020
[31]

Diffusion policy policy optimization.arXiv preprint arXiv:2409.00588, 2024

A. Z. Ren, J. Lidard, L. L. Ankile, A. Simeonov, P. Agrawal, A. Majumdar, B. Burchfiel, H. Dai, and M. Simchowitz, “Diffusion policy policy optimization,” inarXiv preprint arXiv:2409.00588, 2024

work page arXiv 2024