pith. sign in

arxiv: 2604.02812 · v2 · pith:SXAS6Z57new · submitted 2026-04-03 · 💻 cs.RO

Learning Structured Robot Policies from Vision-Language Models via Synthetic Neuro-Symbolic Supervision

Pith reviewed 2026-05-19 17:26 UTC · model grok-4.3

classification 💻 cs.RO
keywords vision-language modelsbehavior treesneuro-symbolic policiessynthetic datazero-shot transferrobotic manipulatorsstructured policiestask decomposition
0
0 comments X

The pith

A 12B-parameter model learns to output executable Behavior Tree policies for robots from vision and language using only synthetic data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that vision-language models can be trained to produce structured and interpretable robot policies rather than opaque end-to-end mappings. An automated pipeline creates synthetic multimodal data of randomized scenes paired with natural language instructions and corresponding Behavior Tree policies generated by a foundation model. This supervision lets the 12B-parameter model acquire the spatial-symbolic mappings needed for policy synthesis without manual real-world labels. If the approach holds, it decouples high-level task decomposition from low-level motor control and supports direct transfer to physical hardware.

Core claim

By decoupling structured task decomposition under constrained symbolic grammars from hardware-specific motor control, the authors show that a 12B-parameter model can learn the spatial-symbolic mappings required for executable Behavior Tree synthesis solely through in-silico supervision. An automated pipeline generates a synthetic multimodal dataset of domain-randomized scenes paired with instruction-policy examples produced by a foundation model. Real-world experiments on two heterogeneous robotic manipulators confirm that the resulting policies achieve zero-shot transfer to physical environments.

What carries the argument

The automated pipeline that generates synthetic multimodal datasets of domain-randomized scenes paired with instruction-policy examples from a foundation model to supervise Behavior Tree synthesis.

If this is right

  • Structured Behavior Tree policies provide interpretability, modularity, and reactive execution compared with end-to-end visuomotor policies.
  • The data bottleneck in robotic planning can be bypassed by procedurally synthesizing high-fidelity neuro-symbolic training data.
  • Policies trained in simulation transfer zero-shot to heterogeneous real-world manipulators without additional fine-tuning.
  • Task decomposition under symbolic grammars remains separate from hardware-specific control, enabling reuse across platforms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same synthetic supervision approach could be applied to other structured policy representations such as finite-state machines.
  • Scaling the domain randomization and scene variety in the pipeline might support more complex multi-step tasks.
  • Testing the method on additional robot types or sensor modalities would check how broadly the zero-shot transfer generalizes.

Load-bearing premise

The synthetic dataset generated by the automated pipeline supplies high-fidelity supervision that is sufficient for the trained model to produce policies that work directly on physical robots.

What would settle it

Running the trained model on the two real robotic manipulators in new physical scenes and observing whether the output Behavior Trees correctly and safely execute the given instructions would confirm or refute the zero-shot transfer.

Figures

Figures reproduced from arXiv: 2604.02812 by Alessandro Adami, Marco Todescato, Pietro Falco, Ruggero Carli, Tommaso Tubaldo.

Figure 1
Figure 1. Figure 1: Overview of the proposed framework. Given synthetic observations, a large foundation model is first used to automatically generate a synthetic supervision dataset composed of task instructions and corresponding Behavior Trees from visual observations and structured system specifications. This dataset is then used to fine-tune the Pixtral-12B vision-language model for constrained symbolic generation. At inf… view at source ↗
Figure 3
Figure 3. Figure 3: Representation of the target Behavior Tree [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 2
Figure 2. Figure 2: Examples of synthetic tabletop scenes used in dataset generation. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Example of real-world images, representing scenarios coherent with [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Real-world experimental platforms used to validate hardware-agnostic [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Sequence of the task computed by the UR5 platform in which the [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
read the original abstract

Vision-Language Models (VLMs) have recently demonstrated strong capabilities in mapping multimodal observations to robot behaviors. However, most current approaches rely on end-to-end visuomotor policies that remain opaque and difficult to analyze, limiting their use in real-world robotic applications. In contrast, classical robotic systems often rely on structured policy representations that provide interpretability, modularity, and reactive execution. This work investigates how foundation models can be specialized to generate structured robot policies grounded in multimodal perception, bridging high-dimensional learning and symbolic control. We propose a neuro-symbolic approach in which a VLM synthesizes executable Behavior Tree policies from visual observations, natural language instructions, and structured system specifications. To enable scalable supervision without manual annotation, we introduce an automated pipeline that generates a synthetic multimodal dataset of domain-randomized scenes paired with instruction-policy examples produced by a foundation model. By decoupling structured task decomposition under constrained symbolic grammars from hardware-specific motor control, we demonstrate that a 12B-parameter model can learn structured spatial-symbolic mappings required for executable BT synthesis, solely through in-silico supervision. Real-world physical experiments on two heterogeneous robotic manipulators confirm that these structurally constrained policies achieve zero-shot transfer to real-world environments. The results emphasize that the data bottleneck in robotic planning can be bypassed by procedurally synthesizing high-fidelity, neuro-symbolic training data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a neuro-symbolic pipeline in which a VLM generates executable Behavior Tree (BT) policies from visual observations, language instructions, and system specifications. An automated synthetic dataset generator produces domain-randomized scene-instruction-BT triples that supervise a 12B-parameter model; the resulting policies are claimed to achieve zero-shot transfer on two heterogeneous real manipulators.

Significance. If the central claim holds, the work would be significant for bridging foundation-model perception with interpretable, modular robot control. The explicit decoupling of symbolic task decomposition from hardware-specific actuation, together with the demonstration of real-world zero-shot transfer on two distinct manipulators, constitutes a concrete strength that could reduce reliance on manual annotation and improve policy analyzability.

major comments (2)
  1. [Experiments] § Experiments / real-robot evaluation: the abstract and results claim zero-shot success on two manipulators, yet no quantitative success rates, baseline comparisons, or systematic failure-mode analysis are reported. This absence directly undermines the load-bearing claim that the learned BTs transfer reliably to physical hardware.
  2. [Method / Synthetic Dataset Pipeline] § Synthetic dataset pipeline: the central assumption that procedurally generated instruction-BT pairs supply high-fidelity supervision is not supported by any quantitative validation of label correctness, physical validity, or coverage of edge cases. Because the generator is itself a VLM, systematic symbolic or dynamic errors would be learned and exposed on real robots where contact, sensing, and dynamics differ from simulation.
minor comments (2)
  1. [Method] Figure captions and text should explicitly state the exact BT grammar constraints and how they are enforced during VLM prompting to improve reproducibility.
  2. [Training Details] Clarify whether the 12B model is fine-tuned or prompted; the current description leaves the training regime ambiguous.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications from the manuscript and indicate the specific revisions that will be incorporated.

read point-by-point responses
  1. Referee: [Experiments] § Experiments / real-robot evaluation: the abstract and results claim zero-shot success on two manipulators, yet no quantitative success rates, baseline comparisons, or systematic failure-mode analysis are reported. This absence directly undermines the load-bearing claim that the learned BTs transfer reliably to physical hardware.

    Authors: We acknowledge that the current manuscript presents real-robot results primarily via qualitative demonstrations of successful task completion on the two manipulators rather than aggregated quantitative success rates or formal baseline comparisons. This choice reflected an initial emphasis on cross-platform zero-shot feasibility. We agree that quantitative metrics and failure analysis are needed to support the reliability claim. In the revised manuscript we will add tables reporting success rates over repeated trials for each task and platform, comparisons against direct VLM prompting and scripted BT baselines, and a categorized failure-mode breakdown (perception, symbolic planning, and execution errors). revision: yes

  2. Referee: [Method / Synthetic Dataset Pipeline] § Synthetic dataset pipeline: the central assumption that procedurally generated instruction-BT pairs supply high-fidelity supervision is not supported by any quantitative validation of label correctness, physical validity, or coverage of edge cases. Because the generator is itself a VLM, systematic symbolic or dynamic errors would be learned and exposed on real robots where contact, sensing, and dynamics differ from simulation.

    Authors: The referee is correct that the original submission does not provide direct quantitative metrics on the correctness or coverage of the VLM-generated BT labels. We had treated downstream real-world transfer as indirect validation. To address the concern about learned errors, the revised manuscript will include a new validation subsection reporting (i) syntactic and simulation-based physical validity rates on a held-out sample of generated triples, (ii) agreement with human annotators on a random subset, and (iii) statistics on task-type coverage and edge-case inclusion. These additions will make the fidelity of the synthetic supervision explicit. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external VLM supervision and real-robot validation

full rationale

The paper's central claim rests on generating synthetic multimodal data via an external foundation model (VLM) to produce instruction-policy pairs under constrained symbolic grammars, then training a 12B model on this in-silico supervision to learn spatial-symbolic mappings for Behavior Tree synthesis. This is followed by zero-shot transfer testing on two heterogeneous physical manipulators. No load-bearing step reduces to self-definition, fitted inputs renamed as predictions, or self-citation chains; the data generator is independent, the symbolic constraints are stated as external grammars, and empirical hardware results supply falsifiable external grounding. The approach is therefore self-contained against benchmarks outside its own fitted values.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that synthetic data from foundation models can substitute for real annotated data and that Behavior Trees under constrained grammars capture necessary task structure for transfer.

axioms (1)
  • domain assumption Behavior Trees under constrained symbolic grammars provide a suitable modular and reactive representation for robot task policies.
    Invoked to justify decoupling high-level symbolic decomposition from low-level motor control.
invented entities (1)
  • Automated synthetic multimodal dataset pipeline no independent evidence
    purpose: Generate scalable instruction-policy pairs without manual annotation
    Procedurally created using domain randomization and foundation model outputs; fidelity to real scenes is unverified in abstract.

pith-pipeline@v0.9.0 · 5781 in / 1369 out tokens · 59616 ms · 2026-05-19T17:26:22.195699+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 6 internal anchors

  1. [1]

    A survey of behavior trees in robotics and ai,

    M. Iovino, E. Scukins, J. Styrud, P. ¨Ogren, and C. Smith, “A survey of behavior trees in robotics and ai,”Robotics and Autonomous Systems, vol. 154, p. 104096, 2022. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0921889022000513

  2. [2]

    Behavior trees in robotics and ai: An introduction,

    M. Colledanchise and P. Ogren, “Behavior trees in robotics and ai: An introduction,” 07 2018

  3. [3]

    Vlm-driven behavior tree for context-aware task planning,

    N. Wake, A. Kanehira, J. Takamatsu, K. Sasabuchi, and K. Ikeuchi, “Vlm-driven behavior tree for context-aware task planning,” 2025. [Online]. Available: https://arxiv.org/abs/2501.03968

  4. [4]

    Real2sim based on active perception with automatically vlm-generated behavior trees,

    A. Adami, S. Zudaire, R. Carli, and P. Falco, “Real2sim based on active perception with automatically vlm-generated behavior trees,”

  5. [5]

    Available: https://arxiv.org/abs/2601.08454

    [Online]. Available: https://arxiv.org/abs/2601.08454

  6. [6]

    Gemini: A Family of Highly Capable Multimodal Models

    G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millicanet al., “Gemini: a family of highly capable multimodal models,”arXiv preprint arXiv:2312.11805, 2023

  7. [7]

    Openvla: An open-source vision-language-action model,

    M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn, “Openvla: An open-source vision-language-action model,”

  8. [8]

    OpenVLA: An Open-Source Vision-Language-Action Model

    [Online]. Available: https://arxiv.org/abs/2406.09246

  9. [9]

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Dubey, C. Finn, P. Florence, C. Fu, M. G. Arenas, K. Gopalakrishnan, K. Han, K. Hausman, A. Herzog, J. Hsu, B. Ichter, A. Irpan, N. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, I. Leal, L. Lee, T.-W. E. Lee, S. Levine, Y . Lu, H. Michalewski, I. Mordatch, K. Pe...

  10. [10]

    Huang, H

    S. Huang, H. Chang, Y . Liu, Y . Zhu, H. Dong, P. Gao, A. Boularias, and H. Li, “A3vlm: Actionable articulation-aware vision language model,” 2024. [Online]. Available: https://arxiv.org/abs/2406.07549

  11. [11]

    From synthetic scenes to real performance: Enhancing spatial reasoning in vlms,

    M. Rizzoli, S. Alghisi, S. M. Mousavi, and G. Riccardi, “From synthetic scenes to real performance: Enhancing spatial reasoning in vlms,” 2026. [Online]. Available: https://arxiv.org/abs/2511.11440

  12. [12]

    Llm-brain: Ai-driven fast generation of robot behaviour tree based on large language model,

    A. Lykov and D. Tsetserukou, “Llm-brain: Ai-driven fast generation of robot behaviour tree based on large language model,” in2024 2nd International Conference on Foundation and Large Language Models (FLLM), 2024, pp. 392–397

  13. [13]

    Stanford alpaca: An instruction-following llama model,

    R. Taori, I. Gulrajani, T. Zhang, Y . Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto, “Stanford alpaca: An instruction-following llama model,” https://github.com/tatsu-lab/stanford alpaca, 2023

  14. [14]

    Btgenbot-2: Efficient behavior tree generation with small language models,

    R. A. Izzo, G. Bardaro, and M. Matteucci, “Btgenbot-2: Efficient behavior tree generation with small language models,” 2026. [Online]. Available: https://arxiv.org/abs/2602.01870

  15. [15]

    Llm-as-bt-planner: Leveraging llms for behavior tree generation in robot task planning,

    J. Ao, F. Wu, Y . Wu, A. Swiki, and S. Haddadin, “Llm-as-bt-planner: Leveraging llms for behavior tree generation in robot task planning,” in2025 IEEE International Conference on Robotics and Automation (ICRA), 2025, pp. 1233–1239

  16. [16]

    Multimodal behavior tree generation: A small vision-language model for robot task planning,

    C. Battistini, R. A. Izzo, G. Bardaro, and M. Matteucci, “Multimodal behavior tree generation: A small vision-language model for robot task planning,”arXiv preprint arXiv:2603.06084, 2026

  17. [17]

    Ontology-guided diffusion for zero-shot visual sim2real transfer,

    M. Youssef, M. Elfares, A.-M. Meer, M. Bortoletto, and A. Bulling, “Ontology-guided diffusion for zero-shot visual sim2real transfer,”

  18. [18]

    Available: https://arxiv.org/abs/2603.18719

    [Online]. Available: https://arxiv.org/abs/2603.18719

  19. [19]

    Sim2real VLA: Zero-shot generalization of synthesized skills to realistic manipulation,

    R. Zhao, S. Xu, R. Jin, Y . Deng, Y . Tai, K. Jia, and G. Liu, “Sim2real VLA: Zero-shot generalization of synthesized skills to realistic manipulation,” inThe Fourteenth International Conference on Learning Representations, 2026. [Online]. Available: https: //openreview.net/forum?id=H4SyKHjd4c

  20. [20]

    A study on training and developing large language models for behavior tree generation,

    F. Li, X. Wang, B. Li, Y . Wu, Y . Wang, and X. Yi, “A study on training and developing large language models for behavior tree generation,”

  21. [21]

    Available: https://arxiv.org/abs/2401.08089

    [Online]. Available: https://arxiv.org/abs/2401.08089

  22. [22]

    FoundationPose: Unified 6d pose estimation and tracking of novel objects,

    B. Wen, W. Yang, J. Kautz, and S. Birchfield, “FoundationPose: Unified 6d pose estimation and tracking of novel objects,” inCVPR, 2024

  23. [23]

    Deep object pose estimation for semantic robotic grasping of household objects,

    J. Tremblay, T. To, B. Sundaralingam, Y . Xiang, D. Fox, and S. Birch- field, “Deep object pose estimation for semantic robotic grasping of household objects,” 09 2018

  24. [24]

    Chain-of- symbol prompting elicits planning in large langauge models

    H. Hu, H. Lu, H. Zhang, Y .-Z. Song, W. Lam, and Y . Zhang, “Chain- of-symbol prompting elicits planning in large langauge models,” 2024. [Online]. Available: https://arxiv.org/abs/2305.10276

  25. [25]

    Pixtral 12B

    P. Agrawal, S. Antoniak, E. B. Hanna, B. Bout, D. Chaplot, J. Chudnovsky, D. Costa, B. D. Monicault, S. Garg, T. Gervet, S. Ghosh, A. H ´eliou, P. Jacob, A. Q. Jiang, K. Khandelwal, T. Lacroix, G. Lample, D. L. Casas, T. Lavril, T. L. Scao, A. Lo, W. Marshall, L. Martin, A. Mensch, P. Muddireddy, V . Nemychnikova, M. Pellat, P. V . Platen, N. Raghuraman, ...

  26. [26]

    Mujoco: A physics engine for model-based control,

    E. Todorov, T. Erez, and Y . Tassa, “Mujoco: A physics engine for model-based control,” in2012 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2012, pp. 5026–5033

  27. [27]

    Lora: Low-rank adaptation of large language models,

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,”

  28. [28]

    LoRA: Low-Rank Adaptation of Large Language Models

    [Online]. Available: https://arxiv.org/abs/2106.09685