pith. sign in

arxiv: 2605.07764 · v1 · submitted 2026-05-08 · 💻 cs.RO

CommandSwarm: Safety-Aware Natural Language-to-Behavior-Tree Generation for Robotic Swarms

Pith reviewed 2026-05-11 02:50 UTC · model grok-4.3

classification 💻 cs.RO
keywords natural language interfacesbehavior treesrobotic swarmsLoRA adaptationsafety filteringlarge language modelsswarm controlparser validation
0
0 comments X

The pith

A safety pipeline around LoRA-adapted LLMs lifts valid behavior-tree generation for robot swarms from zero to 72 percent syntactic acceptance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how natural-language commands can be turned into executable behavior trees for groups of robots without producing unsafe or malformed outputs. It combines translation, safety filters, constrained prompts, and a parser that checks against a fixed list of allowed swarm actions. When Falcon3-Instruct-10B is adapted with LoRA on synthetic examples, zero-shot BLEU rises from 0.267 to 0.663 and parser-accepted trees jump from 0 percent to 72 percent. Few-shot prompting helps some models but the adaptation delivers the largest reliable gains. The work demonstrates that generation quality by itself is not enough and that explicit validation steps remain essential for practical use.

Core claim

CommandSwarm integrates multilingual translation, command-level safety filtering, constrained prompting, a LoRA-adapted 10B LLM, and deterministic parser validation to produce XML behavior trees from speech or text. On representative swarm scenarios the adaptation raises zero-shot BLEU from 0.267 to 0.663, ROUGE-L from 0.366 to 0.692, and parser-accepted syntactic validity from 0 percent to 72 percent while other models reach above 0.60 BLEU with few-shot prompts alone.

What carries the argument

The safety-aware language-to-behavior-tree pipeline that chains translation, safety filtering, constrained LLM prompting, and whitelist-based parser validation.

If this is right

  • Compact quantized LLMs can produce useful swarm behavior trees when placed inside a validated pipeline.
  • Parser acceptance and safety filtering stay necessary even after adaptation improves generation scores.
  • Few-shot prompting raises baseline quality for several models but adaptation yields stronger zero-shot results.
  • Multilingual front-end models such as SeamlessM4T v2-large balance quality and speed for non-English commands.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Non-expert users could direct complex multi-robot tasks without programming if the synthetic data generalizes to live operations.
  • The same pipeline structure could be applied to other robot control languages beyond behavior trees.
  • Adding execution feedback loops might let the system learn new safe primitives over time without expanding the whitelist manually.

Load-bearing premise

The 2,063 synthetic instruction-BT examples and the fixed whitelist of swarm primitives represent the commands and safety limits that real users will actually need.

What would settle it

Run the full pipeline with non-expert operators giving varied spoken commands to physical robot swarms and measure whether any unsafe or unsupported behaviors are executed.

Figures

Figures reproduced from arXiv: 2605.07764 by Amjad Yousef Majid, Mohammed Majid.

Figure 1
Figure 1. Figure 1: CommandSwarm system overview. User speech or text is translated into English, filtered for safety, converted by an LLM into an XML BT, and [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Distribution of behavior names in the synthetic instruction–BT [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Translation latency and quality. Top row: Whisper-medium versus SeamlessM4T v2-large for speech translation. Bottom row: EuroLLM-1.7B versus [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Stage-one comparison of eleven 4-bit quantized LLMs under zero-shot, one-shot, and two-shot prompting. Left: BLEU. Middle: ROUGE-L. Right: [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Stage-two evaluation of the strongest three LLMs on 50 held-out behavior-tree examples. Left: BLEU. Middle: ROUGE-L. Right: syntactic [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Prompt-engineered Falcon3 versus LoRA-adapted Falcon3-FT on 50 held-out examples. Left: BLEU. Middle: ROUGE-L. Right: syntactic correctness. [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
read the original abstract

Natural-language interfaces can make swarm robotics more accessible to non-expert operators, but they must translate ambiguous user intent into executable swarm behaviors without unsupported actions, malformed programs, or unsafe plans. This paper presents CommandSwarm, a safety-aware language-to-behavior-tree pipeline for generating XML behavior trees (BTs) from speech or text commands. The system combines multilingual translation, command-level safety filtering, constrained prompting, a LoRA-adapted large language model (LLM), and deterministic parser validation against a whitelist of executable swarm primitives. We evaluate eleven open 6.7B--14B parameter LLMs, all using 4-bit quantization, on representative swarm-control scenarios under zero-shot, one-shot, and two-shot prompting. Falcon3-Instruct-10B and Mistral-7B-v3 are the strongest prompt-engineered candidates, reaching BLEU scores above 0.60 and high syntactic validity in few-shot settings. LoRA adaptation of Falcon3-Instruct-10B on a 2,063-example synthetic instruction--BT corpus improves zero-shot BLEU from 0.267 to 0.663, ROUGE-L from 0.366 to 0.692, and parser-accepted syntactic validity from 0% to 72%. Translation experiments further show that SeamlessM4T v2-large and EuroLLM-9B provide the best quality-latency trade-offs for the multilingual front end. The results indicate that compact, quantized, domain-adapted LLMs can generate useful swarm BTs when embedded in a validated systems pipeline. They also show that parser acceptance and safety filtering remain necessary execution gates; generation quality alone is not sufficient for autonomous deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript presents CommandSwarm, a safety-aware natural language to behavior tree (BT) generation pipeline for robotic swarms. It integrates multilingual translation, command-level safety filtering, constrained prompting, LoRA adaptation of LLMs on a synthetic 2,063-example corpus, and deterministic parser validation against a whitelist of primitives. Evaluations across eleven 6.7B-14B LLMs under zero-, one-, and two-shot prompting show Falcon3-Instruct-10B and Mistral-7B-v3 as strong baselines, with LoRA adaptation yielding substantial gains in BLEU (0.267 to 0.663), ROUGE-L (0.366 to 0.692), and parser-accepted validity (0% to 72%) for the adapted model.

Significance. If the synthetic corpus adequately represents real user intents and the system performs well in physical deployments, this could be a significant contribution to human-swarm interaction by enabling non-experts to command complex swarm behaviors safely. The strength lies in the end-to-end validated pipeline rather than generation alone, and the systematic comparison of multiple models and prompting methods offers valuable insights for the field. The use of open, quantized models also supports reproducibility and deployment on resource-constrained platforms.

major comments (1)
  1. [Abstract and Results] The central quantitative claims rely on metrics computed against held-out synthetic instruction-BT pairs from the same distribution as the fine-tuning data. This setup does not address whether the generated BTs would be executable or safe in real robotic swarms, as no closed-loop experiments or human validation studies are reported.
minor comments (2)
  1. [Abstract] No error bars, standard deviations, or details on the number of evaluation runs are provided for the BLEU, ROUGE-L, and validity rates.
  2. The paper would benefit from including a few concrete examples of input commands and corresponding generated BTs in the main text or appendix to illustrate the output quality.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for the constructive feedback and positive assessment of the pipeline's potential significance. We address the major comment below, clarifying the scope of our synthetic evaluation while acknowledging its limitations for real-world claims.

read point-by-point responses
  1. Referee: [Abstract and Results] The central quantitative claims rely on metrics computed against held-out synthetic instruction-BT pairs from the same distribution as the fine-tuning data. This setup does not address whether the generated BTs would be executable or safe in real robotic swarms, as no closed-loop experiments or human validation studies are reported.

    Authors: We agree that the reported metrics (BLEU, ROUGE-L, and parser-accepted validity) are computed on held-out synthetic data drawn from the same distribution as the 2,063-example LoRA training corpus, and that the manuscript contains no closed-loop robotic experiments or human validation studies. Our contribution focuses on the integrated pipeline—multilingual translation, command-level safety filtering, constrained prompting, LoRA adaptation, and deterministic parser validation against executable primitives—rather than end-to-end physical deployment. The abstract and results already note that “parser acceptance and safety filtering remain necessary execution gates; generation quality alone is not sufficient for autonomous deployment.” We have revised the abstract, results discussion, and conclusion to more explicitly frame the synthetic metrics as evidence of generation quality within the controlled domain, to state that real executability and safety require the downstream filters and parser, and to outline future physical validation as necessary next steps. revision: partial

standing simulated objections not resolved
  • We cannot add closed-loop experiments or human validation studies in the current revision, as these require physical swarm hardware, real-time execution environments, and user studies that are outside the scope and resources of this work.

Circularity Check

0 steps flagged

No significant circularity; empirical metrics measured directly on held-out synthetic data

full rationale

The paper reports standard machine-learning evaluation results: BLEU/ROUGE-L scores and parser validity percentages computed on held-out examples from the same 2,063-example synthetic corpus used for LoRA fine-tuning. These quantities are obtained via independent, off-the-shelf metrics and a deterministic whitelist parser; they are not algebraically or definitionally forced by the fine-tuning procedure itself. No equations appear in the provided text, no self-definitional loops exist, and no load-bearing self-citations or ansatz smuggling are invoked to justify the central claims. The evaluation pipeline (translation, safety filter, parser) supplies external checks that remain logically independent of the reported generation scores. This is a conventional empirical setup whose results stand or fall on the representativeness of the synthetic data rather than on any internal reduction to the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions about behavior-tree expressiveness for swarm tasks and the adequacy of a fixed primitive whitelist; no new physical entities or free parameters beyond conventional LLM training are introduced.

axioms (1)
  • domain assumption Behavior trees are a suitable formalism for representing safe, executable swarm behaviors from natural language.
    Invoked in the pipeline design and parser validation step.

pith-pipeline@v0.9.0 · 5615 in / 1253 out tokens · 48198 ms · 2026-05-11T02:50:25.723753+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages

  1. [1]

    Swarm robotics: a review from the swarm engineering perspective,

    M. Brambilla, E. Ferrante, M. Birattari, and M. Dorigo, “Swarm robotics: a review from the swarm engineering perspective,”Swarm Intelligence, vol. 7, no. 1, pp. 1–41, 2013

  2. [2]

    An introduction to swarm robotics,

    I. Navarro and F. Matía, “An introduction to swarm robotics,”ISRN Robotics, vol. 2013, p. 608164, 2013

  3. [3]

    A comprehensive overview of large language models,

    H. Naveed, A. U. Khan, S. Qiu, M. Saqib, S. Anwar, M. Usman, N. Akhtar, N. Barnes, and A. Mian, “A comprehensive overview of large language models,” 2024

  4. [4]

    A survey on large language models with some insights on their capabilities and limitations,

    A. Matarazzo and R. Torlone, “A survey on large language models with some insights on their capabilities and limitations,” 2025

  5. [5]

    LLM2Swarm: Robot Swarms that Responsively Reason, Plan, and Collab- orate through LLMs

    V . Strobel, M. Dorigo, and M. Fritz, “LLM2Swarm: Robot swarms that responsively reason, plan, and collaborate through LLMs,” inNeurIPS 2024 Workshop on Open-World Agents, 2024. [Online]. Available: https://arxiv.org/abs/2410.11387

  6. [6]

    Large language models for multi- robot systems: A survey,

    P. Li, Z. An, S. Abrar, and L. Zhou, “Large language models for multi- robot systems: A survey,” 2025

  7. [7]

    LLM-BRAIn: Ai-driven fast generation of robot behaviour tree based on large language model,

    A. Lykov and D. Tsetserukou, “LLM-BRAIn: Ai-driven fast generation of robot behaviour tree based on large language model,” 2023

  8. [8]

    LLM-BT: Performing robotic adaptive tasks based on large language models and behavior trees,

    H. Zhou, Y . Lin, L. Yan, J. Zhu, and H. Min, “LLM-BT: Performing robotic adaptive tasks based on large language models and behavior trees,” in2024 IEEE International Conference on Robotics and Au- tomation (ICRA), 2024

  9. [9]

    BTGenBot: Behavior tree generation for robotic tasks with lightweight LLMs,

    R. A. Izzo, G. Bardaro, and M. Matteucci, “BTGenBot: Behavior tree generation for robotic tasks with lightweight LLMs,” in2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2024, pp. 9684–9690

  10. [10]

    A review of swarm robotics in a nutshell,

    M. M. Shahzad, Z. Saeed, A. Akhtar, H. Munawar, M. H. Yousaf, N. K. Baloach, and F. Hussain, “A review of swarm robotics in a nutshell,” Drones, vol. 7, no. 4, p. 269, 2023

  11. [11]

    Lightweight audio source localization for swarm robots,

    A. Y . Majid, C. van der Horst, T. van Rietbergen, D. J. Zwart, and R. V . Prasad, “Lightweight audio source localization for swarm robots,” in2021 IEEE 18th Annual Consumer Communications & Networking Conference, 2021, pp. 1–6

  12. [12]

    From saying to doing: Natural language interaction with artificial agents and robots,

    C. Kemke, “From saying to doing: Natural language interaction with artificial agents and robots,” inHuman Robot Interaction. IntechOpen, 2007, ch. 9

  13. [13]

    Ai-based simultaneous audio localization and com- munication for robots,

    A. Y . Majid, C. van der Horst, L. de Groot, M. Jonker, R. V . Prasad, and S. Narayana, “Ai-based simultaneous audio localization and com- munication for robots,” inProceedings of the ACM/IEEE International Conference on Internet of Things Design and Implementation, 2023, pp. 172–183

  14. [14]

    Challenging con- ventions towards reliable robot navigation using deep reinforcement learning,

    A. Y . Majid, T. van Rietbergen, and R. V . Prasad, “Challenging con- ventions towards reliable robot navigation using deep reinforcement learning,”Computing&AI Connect, vol. 1, no. 1, pp. 1–10, 2024

  15. [15]

    Deep reinforcement learning versus evolution strategies: A com- parative survey,

    A. Y . Majid, S. Saaybi, V . François-Lavet, R. V . Prasad, and C. Verho- even, “Deep reinforcement learning versus evolution strategies: A com- parative survey,”IEEE Transactions on Neural Networks and Learning Systems, vol. 35, no. 9, pp. 11 939–11 957, 2024

  16. [16]

    Colledanchise and P

    M. Colledanchise and P. Ögren,Behavior Trees in Robotics and AI: An Introduction. CRC Press, 2018

  17. [17]

    A survey of behavior trees in robotics and ai,

    M. Iovino, E. Scukins, J. Styrud, P. Ögren, and C. Smith, “A survey of behavior trees in robotics and ai,”Robotics and Autonomous Systems, vol. 154, p. 104096, 2022

  18. [18]

    BTGenBot-2: Efficient behavior tree generation with small language models,

    R. A. Izzo, G. Bardaro, and M. Matteucci, “BTGenBot-2: Efficient behavior tree generation with small language models,” 2026

  19. [19]

    Robust speech recognition via large-scale weak super- vision,

    A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak super- vision,” 2022

  20. [20]

    SeamlessM4T: Massively multilingual and multimodal machine translation,

    Seamless Communicationet al., “SeamlessM4T: Massively multilingual and multimodal machine translation,” 2023

  21. [21]

    EuroLLM-9B: Technical report,

    P. H. Martins, J. Alves, P. Fernandes, N. M. Guerreiro, R. Rei, A. Fara- jian, M. Klimaszewski, D. M. Alves, J. Pombal, M. Faysse, P. Colombo, F. Yvon, B. Haddow, J. G. C. de Souza, A. Birch, and A. F. T. Martins, “EuroLLM-9B: Technical report,” 2025

  22. [22]

    Llama guard: Llm- based input-output safeguard for human-ai conversations,

    H. Inan, K. Upasani, J. Chi, R. Rungta, K. Iyer, Y . Mao, M. Tontchev, Q. Hu, B. Fuller, D. Testuggine, and M. Khabsa, “Llama guard: Llm- based input-output safeguard for human-ai conversations,” 2023

  23. [23]

    Safety guardrails for llm-enabled robots,

    Z. Ravichandran, A. Robey, V . Kumar, G. J. Pappas, and H. Hassani, “Safety guardrails for llm-enabled robots,” 2025

  24. [24]

    LoRA: Low-rank adaptation of large language models,

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large language models,” 2021

  25. [25]

    Violet api documentation,

    Violet Development Team, “Violet api documentation,” https://api.viol et.m-rots.com/vi, 2025

  26. [26]

    Code llama: Open foundation models for code,

    B. Rozière, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y . Adi, J. Liu, T. Remez, J. Rapinet al., “Code llama: Open foundation models for code,” 2024

  27. [27]

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,

    DeepSeek-AI, “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,” 2025

  28. [28]

    The falcon 3 family of open models,

    Technology Innovation Institute, “The falcon 3 family of open models,”

  29. [29]

    Available: https://huggingface.co/collections/tiiuae/falc on3-6766a04a1b7be3b5589a4a84

    [Online]. Available: https://huggingface.co/collections/tiiuae/falc on3-6766a04a1b7be3b5589a4a84

  30. [30]

    The llama 3 herd of models,

    A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughanet al., “The llama 3 herd of models,” 2024

  31. [31]

    Mistral 7b,

    A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. Singh Chaplot, D. de las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnieret al., “Mistral 7b,” 2023

  32. [32]

    Qwen2.5-coder technical report,

    B. Hui, J. Yang, Z. Cui, J. Yang, D. Liu, L. Zhang, T. Liu, J. Zhang, B. Yu, K. Luet al., “Qwen2.5-coder technical report,” 2024

  33. [33]

    Deepseek-coder: When the large language model meets programming – the rise of code intelligence,

    D. Guo, Q. Zhu, D. Yang, Z. Xie, K. Dong, W. Zhang, G. Chen, X. Bi, Y . Wu, Y . K. Liet al., “Deepseek-coder: When the large language model meets programming – the rise of code intelligence,” 2024

  34. [34]

    Phi- 4 technical report,

    M. Abdin, J. Aneja, H. Behl, S. Bubeck, R. Eldan, S. Gunasekar, M. Harrison, R. J. Hewett, M. Javaheripi, P. Kauffmannet al., “Phi- 4 technical report,” 2024