Automating the Design of Embodied AgentArchitectures

Gengze Zhou; Jian Zhou; Jin Li; Qi Wu; Shuai Fu; Sihao Lin

arxiv: 2606.30111 · v1 · pith:NWOITNOZnew · submitted 2026-06-29 · 💻 cs.RO · cs.AI· cs.LG

Automating the Design of Embodied AgentArchitectures

Jian Zhou , Sihao Lin , Jin Li , Shuai Fu , Gengze Zhou , Qi Wu This is my paper

Pith reviewed 2026-06-30 05:40 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.LG

keywords embodied agentsarchitecture searchAgentCanvasKDLoopvision-language navigationembodied question answeringlanguage-conditioned manipulationsimulator rollouts

0 comments

The pith

Architecture search produces directional success-rate gains for embodied agents on simulator tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Embodied agents are built by hand as compositions of perception, memory, planning, and action modules, leaving a large design space that researchers normally navigate by intuition. The paper tests whether automated architecture search can improve those designs when the agents must handle visual inputs and act in simulators. It supplies a graph-based runtime for running candidate designs and a loop that proposes, critiques, tests, and refines them through repeated simulator episodes. Across navigation, question-answering, and manipulation tasks the search finds architectures that raise success rates, although some high-scoring candidates contain leaks and must be discarded. The same experiments reveal that simulator noise and narrow search paths limit how much improvement can be extracted.

Core claim

Agent Architecture Search transfers to perceptual embodied agents when evaluated through simulator rollouts, yielding deployable architectures that deliver directional success-rate gains on vision-language navigation, embodied question answering, and language-conditioned manipulation; a 3-by-4 matrix of variants shows these gains while also exposing that rollout noise can mask signals, search can trap in local edit basins, and episode-level credit assignment appears only partially even with detailed logs.

What carries the argument

AgentCanvas, a typed-graph runtime that hosts embodied executors as editable node-and-wire programs with simulator-aware execution and episode-level logs, together with KDLoop, a search procedure that cycles through proposal, critique, experiment, and distillation with triggered reflection after stalls.

If this is right

Architecture search identifies variants that raise success rates on the three tested embodied tasks.
High-scoring candidates can be rejected after manual inspection for data leaks.
Rollout noise can mask true differences between architectures.
Search can become trapped inside local edit basins with limited further improvement.
Detailed episode logs produce only partial episode-level credit assignment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same search loop could be applied to agent designs that include learned internal modules rather than fixed components.
Reducing rollout variance through more episodes or better simulators would likely increase the reliability of the discovered gains.
Transferring the resulting architectures from simulation to physical robots would test whether the measured improvements survive domain shift.

Load-bearing premise

Simulator rollouts supply sufficiently reliable optimization signals for the search to identify better architectures despite rollout noise and local edit basins.

What would settle it

Running the same search procedure on a fresh set of embodied tasks and finding that no searched architecture improves success rate over the original hand-designed baselines on held-out episodes.

Figures

Figures reproduced from arXiv: 2606.30111 by Gengze Zhou, Jian Zhou, Jin Li, Qi Wu, Shuai Fu, Sihao Lin.

**Figure 1.** Figure 1: Embodied Agent Architecture Search. (1) Four seed Executors emit per-episode traces. (2) ADAS/AFlow/KDLoop edit Executor code through a shared coding-agent harness and only the proposer differs. (3) The 3×4 ∆SR matrix shows several deployable or directional improvements plus one leak-bearing apparent gain, and surfaces three constraints of embodied AAS: evaluation noise, local edit basins, and partial cre… view at source ↗

**Figure 2.** Figure 2: Top: AAS-method diagrams. Each variant differs in proposer logic and persistent memory. Bottom: the shared coding-agent harness. The outer loop invokes a proposer, implementer, and evaluator. Only the proposer is variant-specific, while implementation, evaluation, validation, and file access are shared. composes over a curated text-workflow operator library, whereas embodied AAS must edit the seed graph di… view at source ↗

**Figure 3.** Figure 3: Representative search trajectories. Each panel shows single-pass SR, best-so-far SR, the shared [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: AgentCanvas as an embodied-AAS substrate. Left: an agent factors into typed nodes, typed-port wires, and shared state in a container reached by access grants (shown: the MapGPT-MP3D loop body; node colour = source nodeset; the loop is an iter out→iter in carry, not a back-edge; §A.2–§A.3). Right: the batch-eval optimisation (§A.4). Naive synchronous lock-step (top) stalls every worker on the slowest module… view at source ↗

**Figure 5.** Figure 5: The AgentCanvas UI, for a human. The typed graph an Optimizer edits as JSON is, for a researcher, a node-and-wire canvas with an inspectable state panel (shown: the SmartWay-CE graph and its run state). The visual editor and the Optimizer’s edit/evaluate interface are two views of one artifact. A.2 The Executor as an editable typed graph An Executor in AGENTCANVAS is a pure-data GraphDefinition: a JSON doc… view at source ↗

**Figure 6.** Figure 6: Full 3 × 4 per-cell search trajectories (rows: optimizers, columns: executors), same conventions as [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

read the original abstract

Embodied agents are typically built as hand-designed compositions of perception, memory, planning, and action modules. This modularity exposes a large architectural design space, but current systems still rely on researcher intuition to choose where information is stored, how observations are processed, and how model calls are connected. Agent Architecture Search (AAS) automates such design for text-domain agents, but has not been systematically evaluated on perceptual embodied agents through simulator rollouts. We study this transfer. We introduce AgentCanvas, a typed-graph runtime that hosts embodied executors as editable node-and-wire programs with simulator-aware execution and episode-level logs, and KDLoop, a coding-agent search procedure that cycles through proposal, critique, experiment, and distillation, with triggered reflection after stalls. We evaluate three AAS variants across four embodied executors spanning vision-language navigation, embodied question answering, and language-conditioned manipulation. The resulting 3x4 matrix shows that architecture-level search can produce deployable and directional success-rate gains on embodied tasks, while one apparent high-scoring candidate is rejected as leak-bearing. At the same time, the experiments expose constraints that are muted in text-domain AAS: optimization signals can be masked by rollout noise, search can become trapped in local edit basins, and episode-level credit assignment only partially emerges even when detailed logs are available. These results characterize both the promise and the current limits of automated architecture search for embodied agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Architecture search yields some directional gains on embodied tasks but rollout noise and local basins make the improvements unreliable.

read the letter

This paper applies architecture search to embodied agents and reports directional success-rate gains, but the authors flag that rollout noise and local search basins make those gains hard to rely on.

The work introduces AgentCanvas, a typed-graph runtime for representing and editing embodied agent programs with simulator-aware execution and logs, plus KDLoop, an iterative search procedure that proposes edits, critiques them, runs experiments, distills results, and reflects on stalls. They test three search variants across four executors covering vision-language navigation, embodied QA, and language-conditioned manipulation, producing a 3x4 matrix.

This is new because prior AAS work stayed in text domains; the systematic simulator rollout evaluation on perceptual agents is the fresh piece. The paper does a solid job documenting the practical differences from text settings, such as noise burying signals, search getting trapped in local edits, and only partial episode-level credit assignment even with detailed logs. They also caught and discarded one high-scoring but leaky candidate.

The soft spot is that these same limitations directly weaken the central claim. If optimization signals are masked by noise and search often stalls, the reported gains rest on preliminary evidence that may include artifacts rather than clear architecture superiority. Everything stays in simulation, with no real-robot validation.

This is for embodied AI and robotics researchers who want to reduce hand-design of agent modules. A reader in that niche would find the tooling descriptions and the honest limits section useful. It deserves peer review because it maps both the potential and the current blockers in this area.

Referee Report

1 major / 1 minor

Summary. The paper introduces AgentCanvas, a typed-graph runtime for hosting embodied executors as editable node-and-wire programs with simulator-aware execution, and KDLoop, a coding-agent search procedure cycling through proposal, critique, experiment, and distillation. It evaluates three AAS variants across four embodied executors (vision-language navigation, embodied QA, language-conditioned manipulation), reporting in a 3x4 matrix that architecture-level search yields deployable directional success-rate gains, while rejecting one high-scoring candidate for a leak and noting constraints from rollout noise, local edit basins, and partial episode-level credit assignment.

Significance. If the directional gains hold under more reliable signals, the work provides new tooling (AgentCanvas, KDLoop) and the first systematic transfer of AAS to perceptual embodied agents via simulator rollouts, characterizing both promise and limits that are muted in text-domain settings.

major comments (1)

[Abstract] Abstract: the claim that architecture-level search produces 'deployable and directional success-rate gains' rests on the 3x4 matrix, yet the same paragraph states that 'optimization signals can be masked by rollout noise' and 'search can become trapped in local edit basins'; these acknowledged conditions directly threaten whether observed gains reflect architecture superiority rather than noise artifacts, requiring explicit quantification of signal reliability or additional controls.

minor comments (1)

[Abstract] The abstract notes that 'episode-level credit assignment only partially emerges even when detailed logs are available' but provides no measurement details or comparison to text-domain AAS.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on the abstract. We address it directly below and will revise the manuscript accordingly to better qualify our claims in light of the acknowledged experimental constraints.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that architecture-level search produces 'deployable and directional success-rate gains' rests on the 3x4 matrix, yet the same paragraph states that 'optimization signals can be masked by rollout noise' and 'search can become trapped in local edit basins'; these acknowledged conditions directly threaten whether observed gains reflect architecture superiority rather than noise artifacts, requiring explicit quantification of signal reliability or additional controls.

Authors: We agree that the abstract must more explicitly reconcile the reported directional gains with the documented effects of rollout noise and local edit basins. The 3x4 matrix demonstrates that, across three AAS variants and four embodied executors, the search procedure consistently identifies architectures that either match or exceed the hand-designed baselines in final success rate, with one high-scoring candidate correctly rejected after leak detection. These outcomes are directional rather than absolute, and the paper already frames them as subject to simulator noise. To strengthen the presentation, we will revise the abstract to state that the gains are 'directional under the evaluated conditions' and will insert a short clause noting the observed variance in repeated rollouts (computed from the existing episode logs). This provides the requested quantification of signal reliability without requiring new experiments, while preserving the characterization of both promise and limits. revision: yes

Circularity Check

0 steps flagged

No circularity: purely experimental tooling and evaluation

full rationale

The manuscript introduces AgentCanvas (typed-graph runtime) and KDLoop (search procedure) then reports a 3x4 matrix of simulator-based success rates for three AAS variants on four embodied executors. No equations, fitted parameters, or first-principles derivations appear; the reported directional gains are direct empirical measurements from rollouts, not quantities that reduce to their own inputs by construction. Prior AAS work is cited only as background motivation and is not used to justify uniqueness theorems or load-bearing premises within the present results. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

Only the abstract is available; no free parameters, mathematical axioms, or invented physical entities are described. The two introduced software components are engineering artifacts rather than postulated entities with independent evidence.

invented entities (2)

AgentCanvas no independent evidence
purpose: typed-graph runtime that hosts embodied executors as editable node-and-wire programs
New software artifact introduced to enable the search experiments.
KDLoop no independent evidence
purpose: coding-agent search procedure that cycles through proposal, critique, experiment, and distillation
New search algorithm introduced to automate architecture design.

pith-pipeline@v0.9.1-grok · 5795 in / 1167 out tokens · 30561 ms · 2026-06-30T05:40:50.580688+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 15 canonical work pages · 9 internal anchors

[1]

Anderson, Q

P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. S ¨underhauf, I. Reid, S. Gould, and A. Van Den Hengel. Vision-and-language navigation: Interpreting visually-grounded naviga- tion instructions in real environments. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3674–3683, 2018

2018
[2]

G. Zhou, Y . Hong, and Q. Wu. Navgpt: Explicit reasoning in vision-and-language navigation with large language models. InProceedings of the AAAI Conference on Artificial Intelligence, 2024

2024
[3]

J. Chen, B. Lin, R. Xu, Z. Chai, X. Liang, and K.-Y . Wong. Mapgpt: Map-guided prompting with adaptive path planning for vision-and-language navigation. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9796–9810, 2024

2024
[4]

A. Das, S. Datta, G. Gkioxari, S. Lee, D. Parikh, and D. Batra. Embodied question answering. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1–10, 2018

2018
[5]

A. Z. Ren, J. Clark, A. Dixit, M. Itkina, A. Majumdar, and D. Sadigh. Explore until confident: Efficient exploration for embodied question answering.arXiv preprint arXiv:2403.15941, 2024

work page arXiv 2024
[6]

Saxena, B

S. Saxena, B. Buchanan, C. Paxton, P. Liu, B. Chen, N. Vaskevicius, L. Palmieri, J. Francis, and O. Kroemer. Grapheqa: Using 3d semantic scene graphs for real-time embodied question answering.arXiv preprint arXiv:2412.14480, 2024

work page arXiv 2024
[7]

Liang, W

J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichter, P. Florence, and A. Zeng. Code as policies: Language model programs for embodied control. In2023 IEEE International conference on robotics and automation (ICRA), pages 9493–9500. IEEE, 2023

2023
[8]

VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

W. Huang, C. Wang, R. Zhang, Y . Li, J. Wu, and L. Fei-Fei. V oxposer: Composable 3d value maps for robotic manipulation with language models.arXiv preprint arXiv:2307.05973, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

S. Hu, C. Lu, and J. Clune. Automated design of agentic systems. InInternational Conference on Learning Representations, 2025

2025
[10]

Zhang, J

J. Zhang, J. Xiang, Z. Yu, F. Teng, X. Chen, J. Chen, M. Zhuge, X. Cheng, S. Hong, J. Wang, et al. Aflow: Automating agentic workflow generation. InInternational Conference on Learn- ing Representations, 2025

2025
[11]

Shang, Y

Y . Shang, Y . Li, K. Zhao, L. Ma, J. Liu, F. Xu, and Y . Li. Agentsquare: Automatic llm agent search in modular design space. InInternational Conference on Learning Representations, 2025

2025
[12]

Zhang, L

G. Zhang, L. Niu, J. Fang, K. Wang, L. Bai, and X. Wang. Multi-agent architecture search via agentic supernet.arXiv preprint arXiv:2502.04180, 2025

work page arXiv 2025
[13]

X. Shi, Z. Li, W. Lyu, J. Xia, F. Dayoub, Y . Qiao, and Q. Wu. Smartway: Enhanced waypoint prediction and backtracking for zero-shot vision-and-language navigation. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2025

2025
[14]

J. Xu, A. Koesdwiady, S. Bei, Y . Han, B. Huang, D. Wang, Y . Chen, Z. Wang, P. Wang, P. Li, et al. Rethinking the value of multi-agent workflow: A strong single agent baseline.arXiv preprint arXiv:2601.12307, 2026

work page arXiv 2026
[15]

Shridhar, J

M. Shridhar, J. Thomason, D. Gordon, Y . Bisk, W. Han, R. Mottaghi, L. Zettlemoyer, and D. Fox. Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10740–10749, 2020. 9

2020
[16]

Brohan, Y

A. Brohan, Y . Chebotar, C. Finn, K. Hausman, A. Herzog, D. Ho, J. Ibarz, A. Irpan, E. Jang, R. Julian, et al. Do as i can, not as i say: Grounding language in robotic affordances. In Conference on robot learning, pages 287–318. PMLR, 2023

2023
[17]

ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation

W. Huang, C. Wang, Y . Li, R. Zhang, and L. Fei-Fei. Rekep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation.arXiv preprint arXiv:2409.01652, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

M. Zhai, H. Liang, X. Fan, Z. Gao, C. Li, C. Sun, X. Bin, Y . Wu, and Y . Jia. Multi-step reason- ing for embodied question answering via tool augmentation.arXiv preprint arXiv:2510.20310, 2025

work page arXiv 2025
[19]

PaLM-E: An Embodied Multimodal Language Model

D. Driess, F. Xia, M. S. M. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, W. Huang, Y . Chebotar, P. Sermanet, D. Duckworth, S. Levine, V . Vanhoucke, K. Hausman, M. Toussaint, K. Greff, A. Zeng, I. Mordatch, and P. Florence. Palm-e: An embodied multimodal language model, 2023. URLhttps://arxiv.org/abs/2303.03378

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

RT-1: Robotics Transformer for Real-World Control at Scale

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Haus- man, A. Herzog, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, T. Jackson, S. Jesmonth, N. J. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, I. Leal, K.-H. Lee, S. Levine, Y . Lu, U. Malla, D. Man- junath, I. Mordatch, O. Nachum, C. Parada, J. Peralta, E. Perez, K. Pertsc...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

Zitkovich, T

B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pages 2165–2183. PMLR, 2023

2023
[22]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Haus- man, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, L. X. Shi, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zhilinsky.π 0: A vision-language-action flow model for general robot control, 2024. URLhttps://arxiv. o...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

G. Wang, Y . Xie, Y . Jiang, A. Mandlekar, C. Xiao, Y . Zhu, L. Fan, and A. Anandku- mar. V oyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

Shinn, F

N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36: 8634–8652, 2023

2023
[26]

Zhuge, W

M. Zhuge, W. Wang, L. Kirsch, F. Faccio, D. Khizbullin, and J. Schmidhuber. Gptswarm: Language agents as optimizable graphs. InForty-first International Conference on Machine Learning, 2024

2024
[27]

Cheng, A

C.-A. Cheng, A. Nie, and A. Swaminathan. Trace is the next autodiff: Generative optimization with rich feedback, execution traces, and llms.Advances in Neural Information Processing Systems, 37:71596–71642, 2024

2024
[28]

Y . Wang, S. Liu, J. Fang, and Z. Meng. Evoagentx: An automated framework for evolving agentic workflows. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 643–655, 2025. 10

2025
[29]

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

O. Khattab, A. Singhvi, P. Maheshwari, Z. Zhang, K. Santhanam, S. Vardhamanan, S. Haq, A. Sharma, T. T. Joshi, H. Moazam, H. Miller, M. Zaharia, and C. Potts. Dspy: Compiling declarative language model calls into self-improving pipelines, 2023. URLhttps://arxiv. org/abs/2310.03714

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

C. Yang, X. Wang, Y . Lu, H. Liu, Q. V . Le, D. Zhou, and X. Chen. Large language models as optimizers. InInternational Conference on Learning Representations, 2024

2024
[31]

TextGrad: Automatic "Differentiation" via Text

M. Yuksekgonul, F. Bianchi, J. Boen, S. Liu, Z. Huang, C. Guestrin, and J. Zou. Textgrad: Automatic” differentiation” via text.arXiv preprint arXiv:2406.07496, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

S. Xu, J. Zhang, S. Di, Y . Luo, L. Yao, H. Liu, J. Zhu, F. Liu, and M.-L. Zhang. Robustflow: Towards robust agentic workflow generation.arXiv preprint arXiv:2509.21834, 2025

work page arXiv 2025
[33]

J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press. Swe- agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

2024
[34]

X. Wang, B. Li, Y . Song, F. F. Xu, X. Tang, M. Zhuge, J. Pan, Y . Song, B. Li, J. Singh, et al. Openhands: An open platform for ai software developers as generalist agents. InInternational Conference on Learning Representations, volume 2025, 2025

2025
[35]

Claude code, 2025.https://www.anthropic.com/product/claude-code

Anthropic. Claude code, 2025.https://www.anthropic.com/product/claude-code

2025
[36]

LangGraph, 2024.https://github.com/langchain-ai/langgraph

LangChain. LangGraph, 2024.https://github.com/langchain-ai/langgraph

2024
[37]

Move the stop decision after the landmark check

Q. Wu, G. Bansal, J. Zhang, Y . Wu, B. Li, E. Zhu, L. Jiang, X. Zhang, S. Zhang, J. Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversations. InFirst conference on language modeling, 2024. 11 Appendix Contents •Appendix A – AgentCanvas.The typed-graph Executor substrate (§A). •Appendix B – Per-Cell Search Trajectories.Full sea...

2024

[1] [1]

Anderson, Q

P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. S ¨underhauf, I. Reid, S. Gould, and A. Van Den Hengel. Vision-and-language navigation: Interpreting visually-grounded naviga- tion instructions in real environments. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3674–3683, 2018

2018

[2] [2]

G. Zhou, Y . Hong, and Q. Wu. Navgpt: Explicit reasoning in vision-and-language navigation with large language models. InProceedings of the AAAI Conference on Artificial Intelligence, 2024

2024

[3] [3]

J. Chen, B. Lin, R. Xu, Z. Chai, X. Liang, and K.-Y . Wong. Mapgpt: Map-guided prompting with adaptive path planning for vision-and-language navigation. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9796–9810, 2024

2024

[4] [4]

A. Das, S. Datta, G. Gkioxari, S. Lee, D. Parikh, and D. Batra. Embodied question answering. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1–10, 2018

2018

[5] [5]

A. Z. Ren, J. Clark, A. Dixit, M. Itkina, A. Majumdar, and D. Sadigh. Explore until confident: Efficient exploration for embodied question answering.arXiv preprint arXiv:2403.15941, 2024

work page arXiv 2024

[6] [6]

Saxena, B

S. Saxena, B. Buchanan, C. Paxton, P. Liu, B. Chen, N. Vaskevicius, L. Palmieri, J. Francis, and O. Kroemer. Grapheqa: Using 3d semantic scene graphs for real-time embodied question answering.arXiv preprint arXiv:2412.14480, 2024

work page arXiv 2024

[7] [7]

Liang, W

J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichter, P. Florence, and A. Zeng. Code as policies: Language model programs for embodied control. In2023 IEEE International conference on robotics and automation (ICRA), pages 9493–9500. IEEE, 2023

2023

[8] [8]

VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

W. Huang, C. Wang, R. Zhang, Y . Li, J. Wu, and L. Fei-Fei. V oxposer: Composable 3d value maps for robotic manipulation with language models.arXiv preprint arXiv:2307.05973, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[9] [9]

S. Hu, C. Lu, and J. Clune. Automated design of agentic systems. InInternational Conference on Learning Representations, 2025

2025

[10] [10]

Zhang, J

J. Zhang, J. Xiang, Z. Yu, F. Teng, X. Chen, J. Chen, M. Zhuge, X. Cheng, S. Hong, J. Wang, et al. Aflow: Automating agentic workflow generation. InInternational Conference on Learn- ing Representations, 2025

2025

[11] [11]

Shang, Y

Y . Shang, Y . Li, K. Zhao, L. Ma, J. Liu, F. Xu, and Y . Li. Agentsquare: Automatic llm agent search in modular design space. InInternational Conference on Learning Representations, 2025

2025

[12] [12]

Zhang, L

G. Zhang, L. Niu, J. Fang, K. Wang, L. Bai, and X. Wang. Multi-agent architecture search via agentic supernet.arXiv preprint arXiv:2502.04180, 2025

work page arXiv 2025

[13] [13]

X. Shi, Z. Li, W. Lyu, J. Xia, F. Dayoub, Y . Qiao, and Q. Wu. Smartway: Enhanced waypoint prediction and backtracking for zero-shot vision-and-language navigation. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2025

2025

[14] [14]

J. Xu, A. Koesdwiady, S. Bei, Y . Han, B. Huang, D. Wang, Y . Chen, Z. Wang, P. Wang, P. Li, et al. Rethinking the value of multi-agent workflow: A strong single agent baseline.arXiv preprint arXiv:2601.12307, 2026

work page arXiv 2026

[15] [15]

Shridhar, J

M. Shridhar, J. Thomason, D. Gordon, Y . Bisk, W. Han, R. Mottaghi, L. Zettlemoyer, and D. Fox. Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10740–10749, 2020. 9

2020

[16] [16]

Brohan, Y

A. Brohan, Y . Chebotar, C. Finn, K. Hausman, A. Herzog, D. Ho, J. Ibarz, A. Irpan, E. Jang, R. Julian, et al. Do as i can, not as i say: Grounding language in robotic affordances. In Conference on robot learning, pages 287–318. PMLR, 2023

2023

[17] [17]

ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation

W. Huang, C. Wang, Y . Li, R. Zhang, and L. Fei-Fei. Rekep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation.arXiv preprint arXiv:2409.01652, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[18] [18]

M. Zhai, H. Liang, X. Fan, Z. Gao, C. Li, C. Sun, X. Bin, Y . Wu, and Y . Jia. Multi-step reason- ing for embodied question answering via tool augmentation.arXiv preprint arXiv:2510.20310, 2025

work page arXiv 2025

[19] [19]

PaLM-E: An Embodied Multimodal Language Model

D. Driess, F. Xia, M. S. M. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, W. Huang, Y . Chebotar, P. Sermanet, D. Duckworth, S. Levine, V . Vanhoucke, K. Hausman, M. Toussaint, K. Greff, A. Zeng, I. Mordatch, and P. Florence. Palm-e: An embodied multimodal language model, 2023. URLhttps://arxiv.org/abs/2303.03378

work page internal anchor Pith review Pith/arXiv arXiv 2023

[20] [20]

RT-1: Robotics Transformer for Real-World Control at Scale

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Haus- man, A. Herzog, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, T. Jackson, S. Jesmonth, N. J. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, I. Leal, K.-H. Lee, S. Levine, Y . Lu, U. Malla, D. Man- junath, I. Mordatch, O. Nachum, C. Parada, J. Peralta, E. Perez, K. Pertsc...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[21] [21]

Zitkovich, T

B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pages 2165–2183. PMLR, 2023

2023

[22] [22]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Haus- man, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, L. X. Shi, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zhilinsky.π 0: A vision-language-action flow model for general robot control, 2024. URLhttps://arxiv. o...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[24] [24]

G. Wang, Y . Xie, Y . Jiang, A. Mandlekar, C. Xiao, Y . Zhu, L. Fan, and A. Anandku- mar. V oyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[25] [25]

Shinn, F

N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36: 8634–8652, 2023

2023

[26] [26]

Zhuge, W

M. Zhuge, W. Wang, L. Kirsch, F. Faccio, D. Khizbullin, and J. Schmidhuber. Gptswarm: Language agents as optimizable graphs. InForty-first International Conference on Machine Learning, 2024

2024

[27] [27]

Cheng, A

C.-A. Cheng, A. Nie, and A. Swaminathan. Trace is the next autodiff: Generative optimization with rich feedback, execution traces, and llms.Advances in Neural Information Processing Systems, 37:71596–71642, 2024

2024

[28] [28]

Y . Wang, S. Liu, J. Fang, and Z. Meng. Evoagentx: An automated framework for evolving agentic workflows. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 643–655, 2025. 10

2025

[29] [29]

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

O. Khattab, A. Singhvi, P. Maheshwari, Z. Zhang, K. Santhanam, S. Vardhamanan, S. Haq, A. Sharma, T. T. Joshi, H. Moazam, H. Miller, M. Zaharia, and C. Potts. Dspy: Compiling declarative language model calls into self-improving pipelines, 2023. URLhttps://arxiv. org/abs/2310.03714

work page internal anchor Pith review Pith/arXiv arXiv 2023

[30] [30]

C. Yang, X. Wang, Y . Lu, H. Liu, Q. V . Le, D. Zhou, and X. Chen. Large language models as optimizers. InInternational Conference on Learning Representations, 2024

2024

[31] [31]

TextGrad: Automatic "Differentiation" via Text

M. Yuksekgonul, F. Bianchi, J. Boen, S. Liu, Z. Huang, C. Guestrin, and J. Zou. Textgrad: Automatic” differentiation” via text.arXiv preprint arXiv:2406.07496, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[32] [32]

S. Xu, J. Zhang, S. Di, Y . Luo, L. Yao, H. Liu, J. Zhu, F. Liu, and M.-L. Zhang. Robustflow: Towards robust agentic workflow generation.arXiv preprint arXiv:2509.21834, 2025

work page arXiv 2025

[33] [33]

J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press. Swe- agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

2024

[34] [34]

X. Wang, B. Li, Y . Song, F. F. Xu, X. Tang, M. Zhuge, J. Pan, Y . Song, B. Li, J. Singh, et al. Openhands: An open platform for ai software developers as generalist agents. InInternational Conference on Learning Representations, volume 2025, 2025

2025

[35] [35]

Claude code, 2025.https://www.anthropic.com/product/claude-code

Anthropic. Claude code, 2025.https://www.anthropic.com/product/claude-code

2025

[36] [36]

LangGraph, 2024.https://github.com/langchain-ai/langgraph

LangChain. LangGraph, 2024.https://github.com/langchain-ai/langgraph

2024

[37] [37]

Move the stop decision after the landmark check

Q. Wu, G. Bansal, J. Zhang, Y . Wu, B. Li, E. Zhu, L. Jiang, X. Zhang, S. Zhang, J. Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversations. InFirst conference on language modeling, 2024. 11 Appendix Contents •Appendix A – AgentCanvas.The typed-graph Executor substrate (§A). •Appendix B – Per-Cell Search Trajectories.Full sea...

2024