Learning Structured Robot Policies from Vision-Language Models via Synthetic Neuro-Symbolic Supervision

Alessandro Adami; Marco Todescato; Pietro Falco; Ruggero Carli; Tommaso Tubaldo

REVIEW 2 major objections 2 minor 28 references

Reviewed by Pith at T0; open to challenge.

T0 means a machine referee read the full paper against a public rubric. The mark states how deep the mechanical check went, never who wrote it. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

A 12B-parameter model learns to output executable Behavior Tree policies for robots from vision and language using only synthetic data.

2026-05-19 17:26 UTC pith:SXAS6Z57

load-bearing objection The paper trains a VLM on synthetic data to output Behavior Trees instead of end-to-end policies, and claims zero-shot real-robot transfer on two arms. the 2 major comments →

arxiv 2604.02812 v2 pith:SXAS6Z57 submitted 2026-04-03 cs.RO

Learning Structured Robot Policies from Vision-Language Models via Synthetic Neuro-Symbolic Supervision

Alessandro Adami , Tommaso Tubaldo , Marco Todescato , Ruggero Carli , Pietro Falco This is my paper

classification cs.RO

keywords vision-language modelsbehavior treesneuro-symbolic policiessynthetic datazero-shot transferrobotic manipulatorsstructured policiestask decomposition

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that vision-language models can be trained to produce structured and interpretable robot policies rather than opaque end-to-end mappings. An automated pipeline creates synthetic multimodal data of randomized scenes paired with natural language instructions and corresponding Behavior Tree policies generated by a foundation model. This supervision lets the 12B-parameter model acquire the spatial-symbolic mappings needed for policy synthesis without manual real-world labels. If the approach holds, it decouples high-level task decomposition from low-level motor control and supports direct transfer to physical hardware.

Core claim

By decoupling structured task decomposition under constrained symbolic grammars from hardware-specific motor control, the authors show that a 12B-parameter model can learn the spatial-symbolic mappings required for executable Behavior Tree synthesis solely through in-silico supervision. An automated pipeline generates a synthetic multimodal dataset of domain-randomized scenes paired with instruction-policy examples produced by a foundation model. Real-world experiments on two heterogeneous robotic manipulators confirm that the resulting policies achieve zero-shot transfer to physical environments.

What carries the argument

The automated pipeline that generates synthetic multimodal datasets of domain-randomized scenes paired with instruction-policy examples from a foundation model to supervise Behavior Tree synthesis.

Load-bearing premise

The synthetic dataset generated by the automated pipeline supplies high-fidelity supervision that is sufficient for the trained model to produce policies that work directly on physical robots.

What would settle it

Running the trained model on the two real robotic manipulators in new physical scenes and observing whether the output Behavior Trees correctly and safely execute the given instructions would confirm or refute the zero-shot transfer.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

Structured Behavior Tree policies provide interpretability, modularity, and reactive execution compared with end-to-end visuomotor policies.
The data bottleneck in robotic planning can be bypassed by procedurally synthesizing high-fidelity neuro-symbolic training data.
Policies trained in simulation transfer zero-shot to heterogeneous real-world manipulators without additional fine-tuning.
Task decomposition under symbolic grammars remains separate from hardware-specific control, enabling reuse across platforms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same synthetic supervision approach could be applied to other structured policy representations such as finite-state machines.
Scaling the domain randomization and scene variety in the pipeline might support more complex multi-step tasks.
Testing the method on additional robot types or sensor modalities would check how broadly the zero-shot transfer generalizes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Referee Report

2 major / 2 minor

Summary. The paper proposes a neuro-symbolic pipeline in which a VLM generates executable Behavior Tree (BT) policies from visual observations, language instructions, and system specifications. An automated synthetic dataset generator produces domain-randomized scene-instruction-BT triples that supervise a 12B-parameter model; the resulting policies are claimed to achieve zero-shot transfer on two heterogeneous real manipulators.

Significance. If the central claim holds, the work would be significant for bridging foundation-model perception with interpretable, modular robot control. The explicit decoupling of symbolic task decomposition from hardware-specific actuation, together with the demonstration of real-world zero-shot transfer on two distinct manipulators, constitutes a concrete strength that could reduce reliance on manual annotation and improve policy analyzability.

major comments (2)

[Experiments] § Experiments / real-robot evaluation: the abstract and results claim zero-shot success on two manipulators, yet no quantitative success rates, baseline comparisons, or systematic failure-mode analysis are reported. This absence directly undermines the load-bearing claim that the learned BTs transfer reliably to physical hardware.
[Method / Synthetic Dataset Pipeline] § Synthetic dataset pipeline: the central assumption that procedurally generated instruction-BT pairs supply high-fidelity supervision is not supported by any quantitative validation of label correctness, physical validity, or coverage of edge cases. Because the generator is itself a VLM, systematic symbolic or dynamic errors would be learned and exposed on real robots where contact, sensing, and dynamics differ from simulation.

minor comments (2)

[Method] Figure captions and text should explicitly state the exact BT grammar constraints and how they are enforced during VLM prompting to improve reproducibility.
[Training Details] Clarify whether the 12B model is fine-tuned or prompted; the current description leaves the training regime ambiguous.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications from the manuscript and indicate the specific revisions that will be incorporated.

read point-by-point responses

Referee: [Experiments] § Experiments / real-robot evaluation: the abstract and results claim zero-shot success on two manipulators, yet no quantitative success rates, baseline comparisons, or systematic failure-mode analysis are reported. This absence directly undermines the load-bearing claim that the learned BTs transfer reliably to physical hardware.

Authors: We acknowledge that the current manuscript presents real-robot results primarily via qualitative demonstrations of successful task completion on the two manipulators rather than aggregated quantitative success rates or formal baseline comparisons. This choice reflected an initial emphasis on cross-platform zero-shot feasibility. We agree that quantitative metrics and failure analysis are needed to support the reliability claim. In the revised manuscript we will add tables reporting success rates over repeated trials for each task and platform, comparisons against direct VLM prompting and scripted BT baselines, and a categorized failure-mode breakdown (perception, symbolic planning, and execution errors). revision: yes
Referee: [Method / Synthetic Dataset Pipeline] § Synthetic dataset pipeline: the central assumption that procedurally generated instruction-BT pairs supply high-fidelity supervision is not supported by any quantitative validation of label correctness, physical validity, or coverage of edge cases. Because the generator is itself a VLM, systematic symbolic or dynamic errors would be learned and exposed on real robots where contact, sensing, and dynamics differ from simulation.

Authors: The referee is correct that the original submission does not provide direct quantitative metrics on the correctness or coverage of the VLM-generated BT labels. We had treated downstream real-world transfer as indirect validation. To address the concern about learned errors, the revised manuscript will include a new validation subsection reporting (i) syntactic and simulation-based physical validity rates on a held-out sample of generated triples, (ii) agreement with human annotators on a random subset, and (iii) statistics on task-type coverage and edge-case inclusion. These additions will make the fidelity of the synthetic supervision explicit. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external VLM supervision and real-robot validation

full rationale

The paper's central claim rests on generating synthetic multimodal data via an external foundation model (VLM) to produce instruction-policy pairs under constrained symbolic grammars, then training a 12B model on this in-silico supervision to learn spatial-symbolic mappings for Behavior Tree synthesis. This is followed by zero-shot transfer testing on two heterogeneous physical manipulators. No load-bearing step reduces to self-definition, fitted inputs renamed as predictions, or self-citation chains; the data generator is independent, the symbolic constraints are stated as external grammars, and empirical hardware results supply falsifiable external grounding. The approach is therefore self-contained against benchmarks outside its own fitted values.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that synthetic data from foundation models can substitute for real annotated data and that Behavior Trees under constrained grammars capture necessary task structure for transfer.

axioms (1)

domain assumption Behavior Trees under constrained symbolic grammars provide a suitable modular and reactive representation for robot task policies.
Invoked to justify decoupling high-level symbolic decomposition from low-level motor control.

invented entities (1)

Automated synthetic multimodal dataset pipeline no independent evidence
purpose: Generate scalable instruction-policy pairs without manual annotation
Procedurally created using domain randomization and foundation model outputs; fidelity to real scenes is unverified in abstract.

pith-pipeline@v0.9.0 · 5781 in / 1369 out tokens · 59616 ms · 2026-05-19T17:26:22.195699+00:00 · methodology

0 comments

read the original abstract

Vision-Language Models (VLMs) have recently demonstrated strong capabilities in mapping multimodal observations to robot behaviors. However, most current approaches rely on end-to-end visuomotor policies that remain opaque and difficult to analyze, limiting their use in real-world robotic applications. In contrast, classical robotic systems often rely on structured policy representations that provide interpretability, modularity, and reactive execution. This work investigates how foundation models can be specialized to generate structured robot policies grounded in multimodal perception, bridging high-dimensional learning and symbolic control. We propose a neuro-symbolic approach in which a VLM synthesizes executable Behavior Tree policies from visual observations, natural language instructions, and structured system specifications. To enable scalable supervision without manual annotation, we introduce an automated pipeline that generates a synthetic multimodal dataset of domain-randomized scenes paired with instruction-policy examples produced by a foundation model. By decoupling structured task decomposition under constrained symbolic grammars from hardware-specific motor control, we demonstrate that a 12B-parameter model can learn structured spatial-symbolic mappings required for executable BT synthesis, solely through in-silico supervision. Real-world physical experiments on two heterogeneous robotic manipulators confirm that these structurally constrained policies achieve zero-shot transfer to real-world environments. The results emphasize that the data bottleneck in robotic planning can be bypassed by procedurally synthesizing high-fidelity, neuro-symbolic training data.

Figures

Figures reproduced from arXiv: 2604.02812 by Alessandro Adami, Marco Todescato, Pietro Falco, Ruggero Carli, Tommaso Tubaldo.

**Figure 1.** Figure 1: Overview of the proposed framework. Given synthetic observations, a large foundation model is first used to automatically generate a synthetic supervision dataset composed of task instructions and corresponding Behavior Trees from visual observations and structured system specifications. This dataset is then used to fine-tune the Pixtral-12B vision-language model for constrained symbolic generation. At inf… view at source ↗

**Figure 3.** Figure 3: Representation of the target Behavior Tree [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 2.** Figure 2: Examples of synthetic tabletop scenes used in dataset generation. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 4.** Figure 4: Example of real-world images, representing scenarios coherent with [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Real-world experimental platforms used to validate hardware-agnostic [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: Sequence of the task computed by the UR5 platform in which the [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 7 internal anchors

[1]

A survey of behavior trees in robotics and ai,

M. Iovino, E. Scukins, J. Styrud, P. ¨Ogren, and C. Smith, “A survey of behavior trees in robotics and ai,”Robotics and Autonomous Systems, vol. 154, p. 104096, 2022. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0921889022000513

work page 2022
[2]

Behavior trees in robotics and ai: An introduction,

M. Colledanchise and P. Ogren, “Behavior trees in robotics and ai: An introduction,” 07 2018

work page 2018
[3]

Vlm-driven behavior tree for context-aware task planning,

N. Wake, A. Kanehira, J. Takamatsu, K. Sasabuchi, and K. Ikeuchi, “Vlm-driven behavior tree for context-aware task planning,” 2025. [Online]. Available: https://arxiv.org/abs/2501.03968

work page arXiv 2025
[4]

Real2sim based on active perception with automatically vlm-generated behavior trees,

A. Adami, S. Zudaire, R. Carli, and P. Falco, “Real2sim based on active perception with automatically vlm-generated behavior trees,”

work page
[5]

Available: https://arxiv.org/abs/2601.08454

[Online]. Available: https://arxiv.org/abs/2601.08454

work page internal anchor Pith review arXiv
[6]

Gemini: A Family of Highly Capable Multimodal Models

G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millicanet al., “Gemini: a family of highly capable multimodal models,”arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Openvla: An open-source vision-language-action model,

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn, “Openvla: An open-source vision-language-action model,”

work page
[8]

OpenVLA: An Open-Source Vision-Language-Action Model

[Online]. Available: https://arxiv.org/abs/2406.09246

work page internal anchor Pith review Pith/arXiv arXiv
[9]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Dubey, C. Finn, P. Florence, C. Fu, M. G. Arenas, K. Gopalakrishnan, K. Han, K. Hausman, A. Herzog, J. Hsu, B. Ichter, A. Irpan, N. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, I. Leal, L. Lee, T.-W. E. Lee, S. Levine, Y . Lu, H. Michalewski, I. Mordatch, K. Pe...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

A3vlm: Actionable articulation-aware vision language model.arXiv preprint arXiv:2406.07549,

S. Huang, H. Chang, Y . Liu, Y . Zhu, H. Dong, P. Gao, A. Boularias, and H. Li, “A3vlm: Actionable articulation-aware vision language model,” 2024. [Online]. Available: https://arxiv.org/abs/2406.07549

work page arXiv 2024
[11]

From synthetic scenes to real performance: Enhancing spatial reasoning in vlms,

M. Rizzoli, S. Alghisi, S. M. Mousavi, and G. Riccardi, “From synthetic scenes to real performance: Enhancing spatial reasoning in vlms,” 2026. [Online]. Available: https://arxiv.org/abs/2511.11440

work page internal anchor Pith review arXiv 2026
[12]

Llm-brain: Ai-driven fast generation of robot behaviour tree based on large language model,

A. Lykov and D. Tsetserukou, “Llm-brain: Ai-driven fast generation of robot behaviour tree based on large language model,” in2024 2nd International Conference on Foundation and Large Language Models (FLLM), 2024, pp. 392–397

work page 2024
[13]

Stanford alpaca: An instruction-following llama model,

R. Taori, I. Gulrajani, T. Zhang, Y . Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto, “Stanford alpaca: An instruction-following llama model,” https://github.com/tatsu-lab/stanford alpaca, 2023

work page 2023
[14]

Btgenbot-2: Efficient behavior tree generation with small language models,

R. A. Izzo, G. Bardaro, and M. Matteucci, “Btgenbot-2: Efficient behavior tree generation with small language models,” 2026. [Online]. Available: https://arxiv.org/abs/2602.01870

work page arXiv 2026
[15]

Llm-as-bt-planner: Leveraging llms for behavior tree generation in robot task planning,

J. Ao, F. Wu, Y . Wu, A. Swiki, and S. Haddadin, “Llm-as-bt-planner: Leveraging llms for behavior tree generation in robot task planning,” in2025 IEEE International Conference on Robotics and Automation (ICRA), 2025, pp. 1233–1239

work page 2025
[16]

Multimodal behavior tree generation: A small vision-language model for robot task planning,

C. Battistini, R. A. Izzo, G. Bardaro, and M. Matteucci, “Multimodal behavior tree generation: A small vision-language model for robot task planning,”arXiv preprint arXiv:2603.06084, 2026

work page arXiv 2026
[17]

Ontology-guided diffusion for zero-shot visual sim2real transfer,

M. Youssef, M. Elfares, A.-M. Meer, M. Bortoletto, and A. Bulling, “Ontology-guided diffusion for zero-shot visual sim2real transfer,”

work page
[18]

Available: https://arxiv.org/abs/2603.18719

[Online]. Available: https://arxiv.org/abs/2603.18719

work page arXiv
[19]

Sim2real VLA: Zero-shot generalization of synthesized skills to realistic manipulation,

R. Zhao, S. Xu, R. Jin, Y . Deng, Y . Tai, K. Jia, and G. Liu, “Sim2real VLA: Zero-shot generalization of synthesized skills to realistic manipulation,” inThe Fourteenth International Conference on Learning Representations, 2026. [Online]. Available: https: //openreview.net/forum?id=H4SyKHjd4c

work page 2026
[20]

A study on training and developing large language models for behavior tree generation,

F. Li, X. Wang, B. Li, Y . Wu, Y . Wang, and X. Yi, “A study on training and developing large language models for behavior tree generation,”

work page
[21]

Available: https://arxiv.org/abs/2401.08089

[Online]. Available: https://arxiv.org/abs/2401.08089

work page arXiv
[22]

FoundationPose: Unified 6d pose estimation and tracking of novel objects,

B. Wen, W. Yang, J. Kautz, and S. Birchfield, “FoundationPose: Unified 6d pose estimation and tracking of novel objects,” inCVPR, 2024

work page 2024
[23]

Deep object pose estimation for semantic robotic grasping of household objects,

J. Tremblay, T. To, B. Sundaralingam, Y . Xiang, D. Fox, and S. Birch- field, “Deep object pose estimation for semantic robotic grasping of household objects,” 09 2018

work page 2018
[24]

Chain-of- symbol prompting elicits planning in large langauge models

H. Hu, H. Lu, H. Zhang, Y .-Z. Song, W. Lam, and Y . Zhang, “Chain- of-symbol prompting elicits planning in large langauge models,” 2024. [Online]. Available: https://arxiv.org/abs/2305.10276

work page arXiv 2024
[25]

Pixtral 12B

P. Agrawal, S. Antoniak, E. B. Hanna, B. Bout, D. Chaplot, J. Chudnovsky, D. Costa, B. D. Monicault, S. Garg, T. Gervet, S. Ghosh, A. H ´eliou, P. Jacob, A. Q. Jiang, K. Khandelwal, T. Lacroix, G. Lample, D. L. Casas, T. Lavril, T. L. Scao, A. Lo, W. Marshall, L. Martin, A. Mensch, P. Muddireddy, V . Nemychnikova, M. Pellat, P. V . Platen, N. Raghuraman, ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Mujoco: A physics engine for model-based control,

E. Todorov, T. Erez, and Y . Tassa, “Mujoco: A physics engine for model-based control,” in2012 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2012, pp. 5026–5033

work page 2012
[27]

Lora: Low-rank adaptation of large language models,

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,”

work page
[28]

LoRA: Low-Rank Adaptation of Large Language Models

[Online]. Available: https://arxiv.org/abs/2106.09685

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

A survey of behavior trees in robotics and ai,

M. Iovino, E. Scukins, J. Styrud, P. ¨Ogren, and C. Smith, “A survey of behavior trees in robotics and ai,”Robotics and Autonomous Systems, vol. 154, p. 104096, 2022. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0921889022000513

work page 2022

[2] [2]

Behavior trees in robotics and ai: An introduction,

M. Colledanchise and P. Ogren, “Behavior trees in robotics and ai: An introduction,” 07 2018

work page 2018

[3] [3]

Vlm-driven behavior tree for context-aware task planning,

N. Wake, A. Kanehira, J. Takamatsu, K. Sasabuchi, and K. Ikeuchi, “Vlm-driven behavior tree for context-aware task planning,” 2025. [Online]. Available: https://arxiv.org/abs/2501.03968

work page arXiv 2025

[4] [4]

Real2sim based on active perception with automatically vlm-generated behavior trees,

A. Adami, S. Zudaire, R. Carli, and P. Falco, “Real2sim based on active perception with automatically vlm-generated behavior trees,”

work page

[5] [5]

Available: https://arxiv.org/abs/2601.08454

[Online]. Available: https://arxiv.org/abs/2601.08454

work page internal anchor Pith review arXiv

[6] [6]

Gemini: A Family of Highly Capable Multimodal Models

G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millicanet al., “Gemini: a family of highly capable multimodal models,”arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[7] [7]

Openvla: An open-source vision-language-action model,

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn, “Openvla: An open-source vision-language-action model,”

work page

[8] [8]

OpenVLA: An Open-Source Vision-Language-Action Model

[Online]. Available: https://arxiv.org/abs/2406.09246

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Dubey, C. Finn, P. Florence, C. Fu, M. G. Arenas, K. Gopalakrishnan, K. Han, K. Hausman, A. Herzog, J. Hsu, B. Ichter, A. Irpan, N. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, I. Leal, L. Lee, T.-W. E. Lee, S. Levine, Y . Lu, H. Michalewski, I. Mordatch, K. Pe...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[10] [10]

A3vlm: Actionable articulation-aware vision language model.arXiv preprint arXiv:2406.07549,

S. Huang, H. Chang, Y . Liu, Y . Zhu, H. Dong, P. Gao, A. Boularias, and H. Li, “A3vlm: Actionable articulation-aware vision language model,” 2024. [Online]. Available: https://arxiv.org/abs/2406.07549

work page arXiv 2024

[11] [11]

From synthetic scenes to real performance: Enhancing spatial reasoning in vlms,

M. Rizzoli, S. Alghisi, S. M. Mousavi, and G. Riccardi, “From synthetic scenes to real performance: Enhancing spatial reasoning in vlms,” 2026. [Online]. Available: https://arxiv.org/abs/2511.11440

work page internal anchor Pith review arXiv 2026

[12] [12]

Llm-brain: Ai-driven fast generation of robot behaviour tree based on large language model,

A. Lykov and D. Tsetserukou, “Llm-brain: Ai-driven fast generation of robot behaviour tree based on large language model,” in2024 2nd International Conference on Foundation and Large Language Models (FLLM), 2024, pp. 392–397

work page 2024

[13] [13]

Stanford alpaca: An instruction-following llama model,

R. Taori, I. Gulrajani, T. Zhang, Y . Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto, “Stanford alpaca: An instruction-following llama model,” https://github.com/tatsu-lab/stanford alpaca, 2023

work page 2023

[14] [14]

Btgenbot-2: Efficient behavior tree generation with small language models,

R. A. Izzo, G. Bardaro, and M. Matteucci, “Btgenbot-2: Efficient behavior tree generation with small language models,” 2026. [Online]. Available: https://arxiv.org/abs/2602.01870

work page arXiv 2026

[15] [15]

Llm-as-bt-planner: Leveraging llms for behavior tree generation in robot task planning,

J. Ao, F. Wu, Y . Wu, A. Swiki, and S. Haddadin, “Llm-as-bt-planner: Leveraging llms for behavior tree generation in robot task planning,” in2025 IEEE International Conference on Robotics and Automation (ICRA), 2025, pp. 1233–1239

work page 2025

[16] [16]

Multimodal behavior tree generation: A small vision-language model for robot task planning,

C. Battistini, R. A. Izzo, G. Bardaro, and M. Matteucci, “Multimodal behavior tree generation: A small vision-language model for robot task planning,”arXiv preprint arXiv:2603.06084, 2026

work page arXiv 2026

[17] [17]

Ontology-guided diffusion for zero-shot visual sim2real transfer,

M. Youssef, M. Elfares, A.-M. Meer, M. Bortoletto, and A. Bulling, “Ontology-guided diffusion for zero-shot visual sim2real transfer,”

work page

[18] [18]

Available: https://arxiv.org/abs/2603.18719

[Online]. Available: https://arxiv.org/abs/2603.18719

work page arXiv

[19] [19]

Sim2real VLA: Zero-shot generalization of synthesized skills to realistic manipulation,

R. Zhao, S. Xu, R. Jin, Y . Deng, Y . Tai, K. Jia, and G. Liu, “Sim2real VLA: Zero-shot generalization of synthesized skills to realistic manipulation,” inThe Fourteenth International Conference on Learning Representations, 2026. [Online]. Available: https: //openreview.net/forum?id=H4SyKHjd4c

work page 2026

[20] [20]

A study on training and developing large language models for behavior tree generation,

F. Li, X. Wang, B. Li, Y . Wu, Y . Wang, and X. Yi, “A study on training and developing large language models for behavior tree generation,”

work page

[21] [21]

Available: https://arxiv.org/abs/2401.08089

[Online]. Available: https://arxiv.org/abs/2401.08089

work page arXiv

[22] [22]

FoundationPose: Unified 6d pose estimation and tracking of novel objects,

B. Wen, W. Yang, J. Kautz, and S. Birchfield, “FoundationPose: Unified 6d pose estimation and tracking of novel objects,” inCVPR, 2024

work page 2024

[23] [23]

Deep object pose estimation for semantic robotic grasping of household objects,

J. Tremblay, T. To, B. Sundaralingam, Y . Xiang, D. Fox, and S. Birch- field, “Deep object pose estimation for semantic robotic grasping of household objects,” 09 2018

work page 2018

[24] [24]

Chain-of- symbol prompting elicits planning in large langauge models

H. Hu, H. Lu, H. Zhang, Y .-Z. Song, W. Lam, and Y . Zhang, “Chain- of-symbol prompting elicits planning in large langauge models,” 2024. [Online]. Available: https://arxiv.org/abs/2305.10276

work page arXiv 2024

[25] [25]

Pixtral 12B

P. Agrawal, S. Antoniak, E. B. Hanna, B. Bout, D. Chaplot, J. Chudnovsky, D. Costa, B. D. Monicault, S. Garg, T. Gervet, S. Ghosh, A. H ´eliou, P. Jacob, A. Q. Jiang, K. Khandelwal, T. Lacroix, G. Lample, D. L. Casas, T. Lavril, T. L. Scao, A. Lo, W. Marshall, L. Martin, A. Mensch, P. Muddireddy, V . Nemychnikova, M. Pellat, P. V . Platen, N. Raghuraman, ...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [26]

Mujoco: A physics engine for model-based control,

E. Todorov, T. Erez, and Y . Tassa, “Mujoco: A physics engine for model-based control,” in2012 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2012, pp. 5026–5033

work page 2012

[27] [27]

Lora: Low-rank adaptation of large language models,

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,”

work page

[28] [28]

LoRA: Low-Rank Adaptation of Large Language Models

[Online]. Available: https://arxiv.org/abs/2106.09685

work page internal anchor Pith review Pith/arXiv arXiv