2.5-D Decomposition for LLM-Based Spatial Construction

Li-Jen Chen; Paul Whitten; Sharath Baddam

arxiv: 2605.07066 · v2 · pith:VIAW4PMPnew · submitted 2026-05-08 · 💻 cs.AI

2.5-D Decomposition for LLM-Based Spatial Construction

Paul Whitten , Li-Jen Chen , Sharath Baddam This is my paper

Pith reviewed 2026-05-20 23:48 UTC · model grok-4.3

classification 💻 cs.AI

keywords 2.5-D decompositionspatial constructionneuro-symbolic pipelineLLM coordinate errorscolumn occupancyBuild What I Mean benchmarkautonomous assemblyedge hardware transfer

0 comments

The pith

Decomposing spatial construction into 2D LLM planning plus deterministic vertical rules from column occupancy eliminates a major class of coordinate errors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that language models generate reliable three-dimensional block structures from natural-language instructions once the task is split so the model plans only the horizontal plane and a simple executor derives all heights from which columns are already occupied. This 2.5-D decomposition removes the need for the model to output vertical coordinates directly, cutting the systematic placement mistakes that appear in full 3-D generation. On the Build What I Mean benchmark the pipeline produces 94.6 percent mean structural accuracy across repeated runs with a small model, coming within three points of the performance ceiling set by architect errors alone. Controlled tests show the decomposition accounts for most of the gain, and the same method runs without changes on local hardware while extending to other building tasks.

Core claim

The authors introduce a neuro-symbolic pipeline that performs 2.5-D decomposition: the LLM outputs only two-dimensional horizontal placements while a deterministic component computes every vertical coordinate from the current column-occupancy state. This separation eliminates an entire class of three-dimensional coordinate errors. On the Build What I Mean benchmark the method reaches 94.6 percent mean structural accuracy with GPT-4o-mini, within 3.0 points of the 97.6 percent limit imposed by architect-agent mistakes, and outperforms both full 3-D generation by GPT-4o and prior competing systems; ablation confirms the decomposition drives 50.7 percentage points of the improvement.

What carries the argument

2.5-D decomposition, in which the language model plans only the horizontal plane while a deterministic executor derives vertical placement solely from column occupancy.

Load-bearing premise

Vertical placement decisions in the target tasks are fully and correctly determined by column occupancy alone, with no information loss from removing the LLM from that dimension.

What would settle it

A new set of building instructions in which correct vertical placement requires information beyond column occupancy, such as stability constraints or overhang rules not encoded in occupancy, would produce a sharp drop in structural accuracy if the decomposition assumption fails.

Figures

Figures reproduced from arXiv: 2605.07066 by Li-Jen Chen, Paul Whitten, Sharath Baddam.

**Figure 1.** Figure 1: A benchmark round requiring T-shape recognition. Instruction: “Keeping the T shape, extend the existing green structure by adding two green blocks [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: 2.5-D decomposition: the LLM planner generates 2D plans with [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

read the original abstract

Autonomous systems that build structures from natural-language instructions need reliable spatial reasoning, yet large language models (LLMs) make systematic coordinate errors when generating three-dimensional block placements. We present a neuro-symbolic pipeline based on \emph{2.5-D decomposition}: the LLM plans in the two-dimensional horizontal plane while a deterministic executor computes all vertical placement from column occupancy, eliminating an entire class of errors. On the Build What I Mean benchmark (160 rounds), GPT-4o-mini with this pipeline achieves 94.6\% mean structural accuracy across 12 independent runs, within 3.0 percentage points of the 97.6\% ceiling imposed by architect-agent errors that no builder-side improvement can address. This outperforms both GPT-4o at 90.3\% and the best competing system at 76.3\%. A controlled ablation confirms that 2.5-D decomposition is the dominant contributor, accounting for 50.7 percentage points of accuracy. The pipeline transfers directly to edge hardware: Nemotron-3 120B running locally on an NVIDIA Jetson Thor AGX matches the cloud result at 94.5\% with no prompt modifications. The underlying principle, removing deterministic dimensions from the LLM's output space, applies to any autonomous construction or assembly task where gravity or other physical constraints fix one or more degrees of freedom. A transfer experiment on 500 IGLU collaborative building tasks confirm the effect generalizes beyond the primary benchmark.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The 2.5-D split gives a clear accuracy lift by handing vertical placement to a simple occupancy rule, but the size of the gain likely depends on how many benchmark tasks actually need independent height choices.

read the letter

The main thing here is that splitting spatial construction so the LLM only plans the horizontal plane while a deterministic executor sets heights from column occupancy removes a big source of LLM coordinate errors and produces a 50.7-point accuracy jump on the Build What I Mean benchmark. GPT-4o-mini reaches 94.6% mean structural accuracy across 12 runs, close to the 97.6% ceiling set by architect mistakes, and the same pipeline runs at 94.5% on edge hardware with Nemotron-3. A transfer test on 500 IGLU tasks adds some support that the pattern holds elsewhere. The paper does well by running a controlled ablation that pins most of the improvement on the decomposition itself rather than model size or prompting tricks, and by showing the edge result without any prompt changes. That makes the practical takeaway easy to test. The soft spot is the vertical assumption. The executor only inspects occupancy, so it works cleanly only when the correct height for each block is the unique function of what is already in that column. If any of the 160 rounds require leaving a gap, building a cantilever, or matching a non-grounded height reference, the method either fails or those cases are missing from the benchmark. The abstract does not break down task types on this point, so it is hard to tell how much the headline numbers reflect general robustness versus benchmark design. Minor gaps include missing error bars and limited dataset detail in the summary. This paper is for people working on LLM agents for robotics or embodied construction who want a concrete way to reduce spatial errors. A reader already building similar pipelines would find the numbers and the edge transfer directly usable for comparison. The empirical claims are specific enough to deserve a serious referee even if the generality needs tightening in revision.

Referee Report

1 major / 2 minor

Summary. The paper introduces a neuro-symbolic 2.5-D decomposition pipeline for LLM-based spatial construction: the LLM generates 2D horizontal block placements while a deterministic executor derives all vertical coordinates from column occupancy alone. On the Build What I Mean benchmark (160 rounds), GPT-4o-mini with this pipeline reaches 94.6% mean structural accuracy (12 runs), within 3 pp of the 97.6% architect-agent ceiling; this outperforms GPT-4o (90.3%) and the best baseline (76.3%). An ablation attributes 50.7 pp of the gain to the decomposition. The approach transfers without modification to Nemotron-3 120B on NVIDIA Jetson Thor AGX (94.5%) and generalizes to 500 IGLU collaborative tasks.

Significance. If the central empirical claims hold, the work provides concrete evidence that removing deterministic degrees of freedom from LLM output spaces can eliminate an entire class of coordinate errors in autonomous construction. The large ablation effect, direct edge-hardware transfer, and cross-benchmark generalization are strengths that would make the result useful for any assembly task where gravity or similar constraints fix one axis. The approach is simple enough to be adopted quickly while still being grounded in the physical structure of the problem.

major comments (1)

[Benchmark description and ablation study] The headline accuracy (94.6% vs. 97.6% ceiling) and the +50.7 pp ablation gain rest on the assumption that every target structure in the 160-round Build What I Mean benchmark has a unique, deterministic vertical coordinate for each block given only the occupancy of its (x,y) column. The manuscript does not report an explicit audit of the benchmark tasks confirming the absence of structures that require non-unique height decisions (gaps, cantilevers, or matching non-grounded references). Without this verification, the reported improvement could partly reflect task selection rather than a general elimination of coordinate errors.

minor comments (2)

[Results] The abstract and results section report mean accuracy but omit per-run standard deviations or confidence intervals, making it difficult to assess the stability of the 94.6% figure across the 12 independent runs.
[Experimental setup] Dataset details for the Build What I Mean benchmark (task generation procedure, exact distribution of structure types, and how the 97.6% architect-agent ceiling was measured) are referenced but not fully specified, which limits reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the significance of the 2.5-D decomposition approach, including its ablation results, hardware transfer, and generalization. We address the major comment below.

read point-by-point responses

Referee: [Benchmark description and ablation study] The headline accuracy (94.6% vs. 97.6% ceiling) and the +50.7 pp ablation gain rest on the assumption that every target structure in the 160-round Build What I Mean benchmark has a unique, deterministic vertical coordinate for each block given only the occupancy of its (x,y) column. The manuscript does not report an explicit audit of the benchmark tasks confirming the absence of structures that require non-unique height decisions (gaps, cantilevers, or matching non-grounded references). Without this verification, the reported improvement could partly reflect task selection rather than a general elimination of coordinate errors.

Authors: We agree that an explicit audit would strengthen the presentation. The Build What I Mean benchmark consists exclusively of grounded, column-stacking structures in which every block rests on either the ground or another block in the same (x, y) column; the benchmark definition excludes gaps, cantilevers, and non-grounded references, so vertical coordinates are uniquely determined by column occupancy. We acknowledge, however, that the current manuscript does not include a dedicated verification step. In the revised version we will add a short subsection (and supporting appendix) that describes the benchmark construction rules and reports the results of a complete audit of all 160 tasks, confirming the absence of the structures mentioned by the referee. This addition will make clear that the observed gains are attributable to the decomposition rather than task selection. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark results with ablation

full rationale

The paper presents a neuro-symbolic pipeline and reports direct empirical measurements (94.6% structural accuracy on 160-round Build What I Mean benchmark, 50.7 pp ablation gain, transfer to local hardware). These are observed outcomes from running the described 2.5-D decomposition on fixed tasks, not quantities derived by fitting parameters to the target metric or by self-referential equations. No load-bearing step reduces to a definition, prior self-citation, or ansatz that is equivalent to the claimed result by construction. The central claim remains an independent empirical finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central performance claim rests on the domain assumption that construction tasks permit clean separation of horizontal planning from vertical execution without loss of required information.

axioms (1)

domain assumption Vertical placements are fully determined by column occupancy under gravity or similar physical constraints.
This premise enables the deterministic executor and is invoked to justify removing the vertical dimension from the LLM's output space.

pith-pipeline@v0.9.0 · 5796 in / 1290 out tokens · 27889 ms · 2026-05-20T23:48:15.230421+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the vertical coordinate of any new block is: y*(x, z, G) = min{y ∈ {0, …, 4} | (x, y, z) ∉ dom(G)} This reduces the LLM’s output space from |G| × |C| to |{0, …, 8}|^2 × |C|, eliminating y-coordinate errors entirely.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We observe that many construction domains exhibit a 2.5-D structure: one or more output dimensions are not free variables but deterministic functions of the others and the current state. In gravity-constrained block construction, the vertical coordinate of any new block is fully determined by the column occupancy below it.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 4 internal anchors

[1]

Evaluating spatial understanding of large language models,

Y . Yamada, Y . Bao, A. K. Lampinen, J. Kasai, and I. Yildirim, “Evaluating spatial understanding of large language models,” Transactions on Machine Learning Research, 2024. [Online]. Available: https://openreview.net/forum?id=xkiflfKCw3

work page 2024
[2]

A multitask, multilingual, multimodal evaluation of ChatGPT on reasoning, hallucination, and interactivity,

Y . Bang, S. Cahyawijaya, N. Lee, W. Dai, D. Su, B. Wilie, H. Lovenia, Z. Ji, T. Yu, W. Chunget al., “A multitask, multilingual, multimodal evaluation of ChatGPT on reasoning, hallucination, and interactivity,” inProceedings of the 13th international joint conference on natural language processing and the 3rd conference of the asia-pacific chapter of the ...

work page 2023
[3]

A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27,

Y . LeCunet al., “A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27,” Tech. Rep. 1, 2022

work page 2022
[4]

Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models,

L. Wang, W. Xu, Y . Lan, Z. Hu, Y . Lan, R. K.-W. Lee, and E.-P. Lim, “Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models,” inProceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers), 2023, pp. 2609–2634

work page 2023
[5]

Decomposed prompting: A modular approach for solving complex tasks,

T. Khot, H. Trivedi, M. Finlayson, Y . Fu, K. Richardson, P. Clark, and A. Sabharwal, “Decomposed prompting: A modular approach for solving complex tasks,” inInternational Conference on Learning Representations (ICLR), 2023

work page 2023
[6]

Neural- symbolic vqa: Disentangling reasoning from vision and language under- standing,

K. Yi, J. Wu, C. Gan, A. Torralba, P. Kohli, and J. Tenenbaum, “Neural- symbolic vqa: Disentangling reasoning from vision and language under- standing,” vol. 31, 2018

work page 2018
[7]

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

M. Ahn, A. Brohan, N. Brown, Y . Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gopalakrishnan, K. Hausman, A. Herzog, D. Ho, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, E. Jang, R. J. Ruano, K. Jeffrey, S. Jesmonth, N. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, K.-H. Lee, S. Levine, Y . Lu, L. Luu, C. Parada, P. Pastor, J. Quiambao, K. Rao, J. Rettinghou...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Code as policies: Language model programs for embod- ied control,

J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichter, P. Florence, and A. Zeng, “Code as policies: Language model programs for embod- ied control,” in2023 IEEE International conference on robotics and automation (ICRA). IEEE, 2023, pp. 9493–9500

work page 2023
[9]

Inner Monologue: Embodied Reasoning through Planning with Language Models

W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. Florence, A. Zeng, J. Tompson, I. Mordatch, Y . Chebotar, P. Sermanet, N. Brown, T. Jackson, L. Luu, S. Levine, K. Hausman, and B. Ichter, “Inner monologue: Embodied reasoning through planning with language models,” inProc. 6th Conf. Robot Learning (CoRL), ser. PMLR, vol. 205, 2023, pp. 1769–1782. [Online]....

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

Voyager: An Open-Ended Embodied Agent with Large Language Models

G. Wang, Y . Xie, Y . Jiang, A. Mandlekar, C. Xiao, Y . Zhu, L. Fan, and A. Anandkumar, “VOY AGER: An open-ended embodied agent with large language models,”arXiv:2305.16291, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

Ghost in the Minecraft: Generally Capable Agents for Open-World Environments via Large Language Models with Text-based Knowledge and Memory

X. Zhu, Y . Chen, H. Tian, C. Tao, W. Su, C. Yang, G. Huang, B. Li, L. Lu, X. Wanget al., “Ghost in the minecraft: Generally capable agents for open-world environments via large language models with text-based knowledge and memory,”arXiv preprint arXiv:2305.17144, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

Build what I mean,

UvA LTL, “Build what I mean,” https://github.com/ltl-uva/build what i mean, 2026

work page 2026
[13]

Marr,Vision: A computational investigation into the human repre- sentation and processing of visual information

D. Marr,Vision: A computational investigation into the human repre- sentation and processing of visual information. MIT press, 2010

work page 2010
[14]

Held,On the computational geometry of pocket machining

M. Held,On the computational geometry of pocket machining. Springer Science & Business Media, 1991, vol. 500

work page 1991
[15]

Chain-of-thought prompting elicits reasoning in large language models,

J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhouet al., “Chain-of-thought prompting elicits reasoning in large language models,” vol. 35, 2022, pp. 24 824–24 837

work page 2022
[16]

Peephole optimization,

W. M. McKeeman, “Peephole optimization,”Commun. ACM, vol. 8, no. 7, pp. 443–444, 1965

work page 1965
[17]

Nemotron-3-super-120b-a12b,

NVIDIA, “Nemotron-3-super-120b-a12b,” https://huggingface.co/ nvidia/Nemotron-3-Super-120B-A12B-NVFP4, 2024

work page 2024
[18]

Efficient memory management for large language model serving with pagedattention,

W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large language model serving with pagedattention,” inProceedings of the 29th symposium on operating systems principles, 2023, pp. 611–626

work page 2023
[19]

Jetson thor,

NVIDIA, “Jetson thor,” https://www.nvidia.com/en-us/ autonomous-machines/embedded-systems/jetson-thor/, 2025

work page 2025
[20]

Measures of the amount of ecologic association between species,

L. R. Dice, “Measures of the amount of ecologic association between species,”Ecology, vol. 26, no. 3, pp. 297–302, 1945

work page 1945
[21]

build what i mean baseline purple,

CdavM, “build what i mean baseline purple,” https://agentbeats.dev/ CdavM/build-what-i-mean-baseline-purple, 2026, original purple agent from the BWIM benchmark

work page 2026
[22]

Purple builder agent: BWIM competition agent,

D. S. S. Higuera, S. A. R. Mahecha, J. A. H. Garcia, and A. F. G. Sanchez, “Purple builder agent: BWIM competition agent,” https://github.com/hisandan/Purple-Agent-Beats-build-what-i-mean, 2026, team Manada Werewolve, AgentBeats Phase 2

work page 2026
[23]

Interactive grounded language understanding in a collaborative environment: Iglu 2021,

J. Kiseleva, Z. Li, M. Aliannejadi, S. Mohanty, M. ter Hoeve, M. Burtsev, A. Skrynnik, A. Zholus, A. Panov, K. Srinetet al., “Interactive grounded language understanding in a collaborative environment: Iglu 2021,” in NeurIPS 2021 Competitions and Demonstrations Track. PMLR, 2022, pp. 146–161

work page 2021

[1] [1]

Evaluating spatial understanding of large language models,

Y . Yamada, Y . Bao, A. K. Lampinen, J. Kasai, and I. Yildirim, “Evaluating spatial understanding of large language models,” Transactions on Machine Learning Research, 2024. [Online]. Available: https://openreview.net/forum?id=xkiflfKCw3

work page 2024

[2] [2]

A multitask, multilingual, multimodal evaluation of ChatGPT on reasoning, hallucination, and interactivity,

Y . Bang, S. Cahyawijaya, N. Lee, W. Dai, D. Su, B. Wilie, H. Lovenia, Z. Ji, T. Yu, W. Chunget al., “A multitask, multilingual, multimodal evaluation of ChatGPT on reasoning, hallucination, and interactivity,” inProceedings of the 13th international joint conference on natural language processing and the 3rd conference of the asia-pacific chapter of the ...

work page 2023

[3] [3]

A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27,

Y . LeCunet al., “A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27,” Tech. Rep. 1, 2022

work page 2022

[4] [4]

Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models,

L. Wang, W. Xu, Y . Lan, Z. Hu, Y . Lan, R. K.-W. Lee, and E.-P. Lim, “Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models,” inProceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers), 2023, pp. 2609–2634

work page 2023

[5] [5]

Decomposed prompting: A modular approach for solving complex tasks,

T. Khot, H. Trivedi, M. Finlayson, Y . Fu, K. Richardson, P. Clark, and A. Sabharwal, “Decomposed prompting: A modular approach for solving complex tasks,” inInternational Conference on Learning Representations (ICLR), 2023

work page 2023

[6] [6]

Neural- symbolic vqa: Disentangling reasoning from vision and language under- standing,

K. Yi, J. Wu, C. Gan, A. Torralba, P. Kohli, and J. Tenenbaum, “Neural- symbolic vqa: Disentangling reasoning from vision and language under- standing,” vol. 31, 2018

work page 2018

[7] [7]

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

M. Ahn, A. Brohan, N. Brown, Y . Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gopalakrishnan, K. Hausman, A. Herzog, D. Ho, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, E. Jang, R. J. Ruano, K. Jeffrey, S. Jesmonth, N. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, K.-H. Lee, S. Levine, Y . Lu, L. Luu, C. Parada, P. Pastor, J. Quiambao, K. Rao, J. Rettinghou...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

Code as policies: Language model programs for embod- ied control,

J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichter, P. Florence, and A. Zeng, “Code as policies: Language model programs for embod- ied control,” in2023 IEEE International conference on robotics and automation (ICRA). IEEE, 2023, pp. 9493–9500

work page 2023

[9] [9]

Inner Monologue: Embodied Reasoning through Planning with Language Models

W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. Florence, A. Zeng, J. Tompson, I. Mordatch, Y . Chebotar, P. Sermanet, N. Brown, T. Jackson, L. Luu, S. Levine, K. Hausman, and B. Ichter, “Inner monologue: Embodied reasoning through planning with language models,” inProc. 6th Conf. Robot Learning (CoRL), ser. PMLR, vol. 205, 2023, pp. 1769–1782. [Online]....

work page internal anchor Pith review Pith/arXiv arXiv 2023

[10] [10]

Voyager: An Open-Ended Embodied Agent with Large Language Models

G. Wang, Y . Xie, Y . Jiang, A. Mandlekar, C. Xiao, Y . Zhu, L. Fan, and A. Anandkumar, “VOY AGER: An open-ended embodied agent with large language models,”arXiv:2305.16291, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[11] [11]

Ghost in the Minecraft: Generally Capable Agents for Open-World Environments via Large Language Models with Text-based Knowledge and Memory

X. Zhu, Y . Chen, H. Tian, C. Tao, W. Su, C. Yang, G. Huang, B. Li, L. Lu, X. Wanget al., “Ghost in the minecraft: Generally capable agents for open-world environments via large language models with text-based knowledge and memory,”arXiv preprint arXiv:2305.17144, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[12] [12]

Build what I mean,

UvA LTL, “Build what I mean,” https://github.com/ltl-uva/build what i mean, 2026

work page 2026

[13] [13]

Marr,Vision: A computational investigation into the human repre- sentation and processing of visual information

D. Marr,Vision: A computational investigation into the human repre- sentation and processing of visual information. MIT press, 2010

work page 2010

[14] [14]

Held,On the computational geometry of pocket machining

M. Held,On the computational geometry of pocket machining. Springer Science & Business Media, 1991, vol. 500

work page 1991

[15] [15]

Chain-of-thought prompting elicits reasoning in large language models,

J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhouet al., “Chain-of-thought prompting elicits reasoning in large language models,” vol. 35, 2022, pp. 24 824–24 837

work page 2022

[16] [16]

Peephole optimization,

W. M. McKeeman, “Peephole optimization,”Commun. ACM, vol. 8, no. 7, pp. 443–444, 1965

work page 1965

[17] [17]

Nemotron-3-super-120b-a12b,

NVIDIA, “Nemotron-3-super-120b-a12b,” https://huggingface.co/ nvidia/Nemotron-3-Super-120B-A12B-NVFP4, 2024

work page 2024

[18] [18]

Efficient memory management for large language model serving with pagedattention,

W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large language model serving with pagedattention,” inProceedings of the 29th symposium on operating systems principles, 2023, pp. 611–626

work page 2023

[19] [19]

Jetson thor,

NVIDIA, “Jetson thor,” https://www.nvidia.com/en-us/ autonomous-machines/embedded-systems/jetson-thor/, 2025

work page 2025

[20] [20]

Measures of the amount of ecologic association between species,

L. R. Dice, “Measures of the amount of ecologic association between species,”Ecology, vol. 26, no. 3, pp. 297–302, 1945

work page 1945

[21] [21]

build what i mean baseline purple,

CdavM, “build what i mean baseline purple,” https://agentbeats.dev/ CdavM/build-what-i-mean-baseline-purple, 2026, original purple agent from the BWIM benchmark

work page 2026

[22] [22]

Purple builder agent: BWIM competition agent,

D. S. S. Higuera, S. A. R. Mahecha, J. A. H. Garcia, and A. F. G. Sanchez, “Purple builder agent: BWIM competition agent,” https://github.com/hisandan/Purple-Agent-Beats-build-what-i-mean, 2026, team Manada Werewolve, AgentBeats Phase 2

work page 2026

[23] [23]

Interactive grounded language understanding in a collaborative environment: Iglu 2021,

J. Kiseleva, Z. Li, M. Aliannejadi, S. Mohanty, M. ter Hoeve, M. Burtsev, A. Skrynnik, A. Zholus, A. Panov, K. Srinetet al., “Interactive grounded language understanding in a collaborative environment: Iglu 2021,” in NeurIPS 2021 Competitions and Demonstrations Track. PMLR, 2022, pp. 146–161

work page 2021