pith. machine review for the scientific record. sign in

arxiv: 2604.04664 · v1 · submitted 2026-04-06 · 💻 cs.RO · cs.AI· cs.MA

Recognition: 1 theorem link

· Lean Theorem

ROSClaw: A Hierarchical Semantic-Physical Framework for Heterogeneous Multi-Agent Collaboration

Bin He, Jie Chen, Rongfeng Zhao, Xiang Shao, Xuanhao Zhang, Zhaochen Guo, Zhongpan Zhu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:25 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.MA
keywords ROSClawheterogeneous robotsvision-language modelmulti-agent collaborationsim-to-realclosed-loop optimizationpolicy learningembodied AI
0
0 comments X

The pith

ROSClaw uses one vision-language model to coordinate different robots on extended tasks by linking reasoning directly to physical actions in a self-improving loop.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes ROSClaw to bridge the gap between language-based understanding and physical robot control for teams of unlike machines working on sequences of actions. Current approaches rely on separate modules for planning, training, and running code, which makes testing and updating slow and expensive. ROSClaw instead puts a single VLM in charge of both thinking through the task and deciding which robot does what next, while using standardized robot descriptions to track states in simulation and reality. It collects data from actual runs to refine the model over time. If this holds, developers could move skills between robot types more quickly and with less custom engineering for each new hardware setup.

Core claim

ROSClaw is an agent framework for heterogeneous robots that integrates policy learning and task execution within a unified vision-language model controller. It leverages e-URDF representations of heterogeneous robots as physical constraints to construct a sim-to-real topological mapping for real-time state access, incorporates a data collection and state accumulation mechanism to enable iterative policy optimization, and uses a unified agent that maintains semantic continuity between reasoning and execution while dynamically assigning task-specific control to different agents.

What carries the argument

The unified VLM controller combined with e-URDF physical constraints and real-world data accumulation for closed-loop optimization.

If this is right

  • Minimizes reliance on robot-specific development workflows through hardware-level validation.
  • Supports automated generation of SDK-level control programs and tool-based execution.
  • Enables rapid cross-platform transfer of robotic skills.
  • Allows continual improvement of skills via accumulated execution trajectories and states.
  • Improves robustness in multi-policy execution for long-horizon tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The framework could reduce the expertise barrier for deploying multi-robot systems in new environments.
  • Data collected during real executions might reveal patterns that improve performance beyond initial VLM training.
  • Similar closed-loop designs might apply to other embodied AI settings where agents must switch between high-level plans and low-level controls.
  • Validation on diverse task types would test whether semantic continuity holds without additional modular support.

Load-bearing premise

One unified VLM can handle both semantic reasoning and dynamic control assignment for different robots over long tasks without needing separate specialized modules or losing performance.

What would settle it

A direct comparison showing that modular pipeline systems achieve higher success rates on long sequential tasks or require significantly less development time for new robot platforms than the unified ROSClaw approach.

Figures

Figures reproduced from arXiv: 2604.04664 by Bin He, Jie Chen, Rongfeng Zhao, Xiang Shao, Xuanhao Zhang, Zhaochen Guo, Zhongpan Zhu.

Figure 1
Figure 1. Figure 1: The ROSClaw framework adopts a three-layer semantic–physical architecture to bridge high-level cognitive reasoning and low-level physical control. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: e-URDF-based physical firewall. Heterogeneous system resources (SDKs, MCPs, and APIs) are aggregated into an Online Tool Pool to enable the [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Real-World Environment for Collaborative Tasks. In the physical [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Heterogeneous Multi-Agent Collaboration. In (A), ROSClaw receives user requirements, initializes the sub-agents, assigns tasks to each agent, and [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Validation of e-URDF-based physical safeguarding and the data collection and state accumulation mechanism. In (A), a mobile user interacts with [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
read the original abstract

The integration of large language models (LLMs) with embodied agents has improved high-level reasoning capabilities; however, a critical gap remains between semantic understanding and physical execution. While vision-language-action (VLA) and vision-language-navigation (VLN) systems enable robots to perform manipulation and navigation tasks from natural language instructions, they still struggle with long-horizon sequential and temporally structured tasks. Existing frameworks typically adopt modular pipelines for data collection, skill training, and policy deployment, resulting in high costs in experimental validation and policy optimization. To address these limitations, we propose ROSClaw, an agent framework for heterogeneous robots that integrates policy learning and task execution within a unified vision-language model (VLM) controller. The framework leverages e-URDF representations of heterogeneous robots as physical constraints to construct a sim-to-real topological mapping, enabling real-time access to the physical states of both simulated and real-world agents. We further incorporate a data collection and state accumulation mechanism that stores robot states, multimodal observations, and execution trajectories during real-world execution, enabling subsequent iterative policy optimization. During deployment, a unified agent maintains semantic continuity between reasoning and execution, and dynamically assigns task-specific control to different agents, thereby improving robustness in multi-policy execution. By establishing an autonomous closed-loop framework, ROSClaw minimizes the reliance on robot-specific development workflows. The framework supports hardware-level validation, automated generation of SDK-level control programs, and tool-based execution, enabling rapid cross-platform transfer and continual improvement of robotic skills. Ours project page: https://www.rosclaw.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces ROSClaw, a hierarchical semantic-physical framework for heterogeneous multi-agent collaboration in robotics. It proposes a unified VLM controller integrated with e-URDF representations of robots to construct sim-to-real topological mappings, a closed-loop data collection mechanism for storing states and trajectories to enable iterative policy optimization, and dynamic task assignment across agents while maintaining semantic continuity. The framework claims to reduce reliance on modular, robot-specific development workflows by supporting hardware validation, automated SDK generation, and tool-based execution for cross-platform transfer and continual skill improvement.

Significance. If the central claims hold and are validated, ROSClaw could advance multi-robot systems by bridging semantic reasoning and physical execution within a single model, potentially lowering experimental costs and enabling more robust long-horizon collaboration. The closed-loop accumulation of real-world data for optimization and the emphasis on hardware-level validation represent strengths that, if demonstrated, would distinguish it from typical modular VLA/VLN pipelines.

major comments (3)
  1. Abstract: The central claims that the framework 'improves robustness in multi-policy execution' and 'minimizes the reliance on robot-specific development workflows' lack any supporting quantitative results, ablation studies, error analysis, or comparative experiments, leaving the assertions unsubstantiated.
  2. Unified VLM controller section: The description of a single unified agent maintaining semantic continuity across long-horizon tasks while dynamically assigning control to heterogeneous agents via e-URDF provides no mechanism or analysis for addressing latency, context drift, or precision failures that typically necessitate modular decomposition in VLA/VLN systems.
  3. Sim-to-real topological mapping: No details are given on how the e-URDF-based mapping remains stable under agent heterogeneity, iterative optimization, or real-world disturbances, which is load-bearing for the closed-loop claim.
minor comments (1)
  1. Abstract: The phrase 'Ours project page' should be corrected to 'Our project page'.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We appreciate the opportunity to clarify the technical details and strengthen the presentation of our claims. Below we respond point-by-point to the major comments, indicating where revisions will be made.

read point-by-point responses
  1. Referee: Abstract: The central claims that the framework 'improves robustness in multi-policy execution' and 'minimizes the reliance on robot-specific development workflows' lack any supporting quantitative results, ablation studies, error analysis, or comparative experiments, leaving the assertions unsubstantiated.

    Authors: We acknowledge that the abstract, as a high-level summary, does not itself contain quantitative metrics. The full manuscript contains an evaluation section that reports hardware validation results on heterogeneous platforms, including comparative experiments against modular VLA/VLN baselines, ablation studies isolating the closed-loop data collection and e-URDF mapping components, and error analysis for long-horizon multi-agent tasks. These results support the robustness and workflow-reduction claims. To make this support more immediately visible, we will revise the abstract to include concise references to the key quantitative findings and direct the reader to the evaluation section. revision: partial

  2. Referee: Unified VLM controller section: The description of a single unified agent maintaining semantic continuity across long-horizon tasks while dynamically assigning control to heterogeneous agents via e-URDF provides no mechanism or analysis for addressing latency, context drift, or precision failures that typically necessitate modular decomposition in VLA/VLN systems.

    Authors: The referee correctly notes that explicit mechanisms for latency, context drift, and precision failures require clearer exposition. The current design uses the e-URDF representation to supply real-time physical-state grounding to the VLM, combined with the closed-loop trajectory storage that feeds back execution outcomes for dynamic re-assignment. This hierarchical structure is intended to reduce drift and precision errors without full modular decomposition. We agree the analysis is insufficiently detailed and will add a dedicated subsection that describes the latency-mitigation strategy, drift-correction via state accumulation, and failure-handling logic, supported by additional timing and robustness metrics from our experiments. revision: yes

  3. Referee: Sim-to-real topological mapping: No details are given on how the e-URDF-based mapping remains stable under agent heterogeneity, iterative optimization, or real-world disturbances, which is load-bearing for the closed-loop claim.

    Authors: We recognize that stability of the e-URDF mapping under heterogeneity, iterative updates, and disturbances is central to the closed-loop claim and merits explicit treatment. The mapping achieves stability by encoding topological relations between semantic instructions and physical parameters in a hardware-agnostic yet robot-specific e-URDF format; the unified controller then uses live state feedback to adapt assignments, while the data-collection loop incrementally refines the mapping from observed trajectories. We will expand the sim-to-real section with a formal description of the mapping procedure, pseudocode for the update rule, and empirical stability analysis drawn from our hardware trials under varying disturbances and agent configurations. revision: yes

Circularity Check

0 steps flagged

No circularity: framework is descriptive architecture without derivations or self-referential reductions

full rationale

The paper proposes ROSClaw as a conceptual framework integrating a unified VLM controller with e-URDF mappings and closed-loop data accumulation for heterogeneous agents. No equations, parameters, or mathematical derivations appear in the abstract or described structure. Claims regarding semantic continuity, dynamic task assignment, and sim-to-real mapping are presented as architectural features rather than predictions derived from fitted inputs or self-definitions. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central claims do not reduce to their own inputs by construction; they remain independent descriptions of a proposed system. This is the expected non-finding for a framework paper lacking quantitative derivations.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The framework rests on domain assumptions about VLM capabilities for real-time control and sim-to-real feasibility rather than new fitted parameters or independently evidenced entities.

axioms (2)
  • domain assumption Vision-language models can maintain semantic continuity between high-level reasoning and low-level physical execution in real time.
    Invoked as the basis for the unified controller and dynamic task assignment.
  • domain assumption e-URDF representations enable reliable topological mapping between simulated and real heterogeneous robot states.
    Central to the sim-to-real physical state access mechanism.
invented entities (1)
  • e-URDF no independent evidence
    purpose: Representation of heterogeneous robots as physical constraints for sim-to-real topological mapping.
    Newly introduced format to support real-time state access across agents.

pith-pipeline@v0.9.0 · 5599 in / 1455 out tokens · 81855 ms · 2026-05-10T19:25:17.193124+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 11 canonical work pages · 3 internal anchors

  1. [1]

    Embodied large language models enable robots to complete complex tasks in unpredictable environments,

    R. Mon-Williams, G. Li, R. Long, W. Du, and C. G. Lucas, “Embodied large language models enable robots to complete complex tasks in unpredictable environments,”Nature Machine Intelligence, vol. 7, no. 4, pp. 592–601, 2025

  2. [2]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fu- sai, L. Groom, K. Hausman, B. Ichteret al., “pi-zero: A vision- language-action flow model for general robot control,”arXiv preprint arXiv:2410.24164, 2024

  3. [3]

    Janusvln: Decoupling semantics and spatiality with dual implicit memory for vision-language navigation.arXiv preprint arXiv:2509.22548, 2025

    S. Zeng, D. Qi, X. Chang, F. Xiong, S. Xie, X. Wu, S. Liang, M. Xu, and X. Wei, “Janusvln: Decoupling semantics and spatiality with dual implicit memory for vision-language navigation,”arXiv preprint arXiv:2509.22548, 2025

  4. [4]

    Learning a thousand tasks in a day,

    K. Dreczkowski, P. Vitiello, V . V osylius, and E. Johns, “Learning a thousand tasks in a day,”Science Robotics, vol. 10, no. 108, p. eadv7594, 2025

  5. [5]

    Cross-robot behavior adaptation through intention alignment,

    X. Chen, Y . Gao, H. Liu, F. Yang, A. Ghadirzadeh, J. Yang, B. Liang, C. Zhang, T. L. Lam, and S.-C. Zhu, “Cross-robot behavior adaptation through intention alignment,”Science Robotics, vol. 11, no. 112, p. eadv2250, 2026

  6. [6]

    Metagpt: Meta programming for a multi-agent collaborative framework,

    S. Hong, M. Zhuge, J. Chen, X. Zheng, Y . Cheng, J. Wang, C. Zhang, Z. Wang, S. K. S. Yau, Z. Linet al., “Metagpt: Meta programming for a multi-agent collaborative framework,” inThe twelfth international conference on learning representations, 2023

  7. [7]

    Oc-hmas: Dynamic self-organization and self-correction in heterogeneous multiagent systems using multimodal large models,

    P. Feng, T. Yang, M. Liang, L. Wang, and Y . Gao, “Oc-hmas: Dynamic self-organization and self-correction in heterogeneous multiagent systems using multimodal large models,”IEEE Internet of Things Journal, vol. 12, no. 10, pp. 13 538–13 555, 2025

  8. [8]

    Observer-assisted model free adaptive predictive control for distributed heterogeneous multi- agent systems against dos attacks,

    Z. Zhang, N. Zhao, G. Zong, H. Wang, and X. Zhao, “Observer-assisted model free adaptive predictive control for distributed heterogeneous multi- agent systems against dos attacks,”IEEE Internet of Things Journal, 2025

  9. [9]

    Multi-agent systems for robotic autonomy with llms,

    J. Chen, Z. Yang, H. G. Xu, D. Zhang, and G. Mylonas, “Multi-agent systems for robotic autonomy with llms,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 4194–4204

  10. [10]

    Selp: Generating safe and efficient task plans for robot agents with large language models,

    Y . Wu, Z. Xiong, Y . Hu, S. S. Iyengar, N. Jiang, A. Bera, L. Tan, and S. Jagannathan, “Selp: Generating safe and efficient task plans for robot agents with large language models,” in2025 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2025, pp. 2599–2605

  11. [11]

    Alshiekh, M., Bloem, R., Ehlers, R., Könighofer, B., Niekum, S., Topcu, U.,

    M. Ahn, D. Dwibedi, C. Finn, M. G. Arenas, K. Gopalakrishnan, K. Hausman, B. Ichter, A. Irpan, N. Joshi, R. Julianet al., “Autort: Embodied foundation models for large scale orchestration of robotic agents,”arXiv preprint arXiv:2401.12963, 2024

  12. [12]

    Robobrain: A unified brain model for robotic manipulation from abstract to concrete,

    Y . Ji, H. Tan, J. Shi, X. Hao, Y . Zhang, H. Zhang, P. Wang, M. Zhao, Y . Mu, P. Anet al., “Robobrain: A unified brain model for robotic manipulation from abstract to concrete,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 1724– 1734. 8

  13. [13]

    Smart-llm: Smart multi-agent robot task planning using large language models,

    S. S. Kannan, V . L. Venkatesh, and B.-C. Min, “Smart-llm: Smart multi-agent robot task planning using large language models,” in2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2024, pp. 12 140–12 147

  14. [14]

    Crab: Cross-environment agent benchmark for multimodal language model agents,

    T. Xu, L. Chen, D.-J. Wu, Y . Chen, Z. Zhang, X. Yao, Z. Xie, Y . Chen, S. Liu, B. Qianet al., “Crab: Cross-environment agent benchmark for multimodal language model agents,” inFindings of the Association for Computational Linguistics: ACL 2025, 2025, pp. 21 607–21 647

  15. [15]

    Autogen: Enabling next-gen llm applications via multi-agent conversations,

    Q. Wu, G. Bansal, J. Zhang, Y . Wu, B. Li, E. Zhu, L. Jiang, X. Zhang, S. Zhang, J. Liuet al., “Autogen: Enabling next-gen llm applications via multi-agent conversations,” inFirst conference on language modeling, 2024

  16. [16]

    Camel: Communicative agents for

    G. Li, H. Hammoud, H. Itani, D. Khizbullin, and B. Ghanem, “Camel: Communicative agents for” mind” exploration of large language model society,”Advances in neural information processing systems, vol. 36, pp. 51 991–52 008, 2023

  17. [17]

    arXiv preprint arXiv:2307.02485

    H. Zhang, W. Du, J. Shan, Q. Zhou, Y . Du, J. B. Tenenbaum, T. Shu, and C. Gan, “Building cooperative embodied agents modularly with large language models,”arXiv preprint arXiv:2307.02485, 2023

  18. [18]

    Roco: Dialectic multi-robot collaboration with large language models,

    Z. Mandi, S. Jain, and S. Song, “Roco: Dialectic multi-robot collaboration with large language models,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 286–299

  19. [19]

    Emos: Embodiment-aware heterogeneous multi-robot operating system with llm agents,

    J. Chen, C. Yu, X. Zhou, T. Xu, Y . Mu, M. Hu, W. Shao, Y . Wang, G. Li, and L. Shao, “Emos: Embodiment-aware heterogeneous multi-robot operating system with llm agents,”arXiv preprint arXiv:2410.22662, 2024

  20. [20]

    Decentralized task and path planning for multi-robot systems,

    Y . Chen, U. Rosolia, and A. D. Ames, “Decentralized task and path planning for multi-robot systems,”IEEE Robotics and Automation Letters, vol. 6, no. 3, pp. 4337–4344, 2021

  21. [21]

    Freeaskworld: An interactive and closed-loop simulator for human-centric embodied ai,

    Y . Peng, Y . Pan, X. He, J. Yang, X. Yin, H. Wang, X. Zheng, C. Gao, and J. Gong, “Freeaskworld: An interactive and closed-loop simulator for human-centric embodied ai,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 2, 2026, pp. 926–934

  22. [22]

    Closed loop interactive embodied reasoning for robot manipulation,

    M. Nazarczuk, J. K. Behrens, K. Stepanova, M. Hoffmann, and K. Mikolajczyk, “Closed loop interactive embodied reasoning for robot manipulation,” in2025 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2025, pp. 13 722–13 729

  23. [23]

    Closed-loop visuomotor control with generative expectation for robotic manipulation,

    Q. Bu, J. Zeng, L. Chen, Y . Yang, G. Zhou, J. Yan, P. Luo, H. Cui, Y . Ma, and H. Li, “Closed-loop visuomotor control with generative expectation for robotic manipulation,”Advances in Neural Information Processing Systems, vol. 37, pp. 139 002–139 029, 2024

  24. [24]

    Code as policies: Language model programs for embodied control,

    J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichter, P. Florence, and A. Zeng, “Code as policies: Language model programs for embodied control,” in2023 IEEE International conference on robotics and automation (ICRA). IEEE, 2023, pp. 9493–9500

  25. [25]

    VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

    W. Huang, C. Wang, R. Zhang, Y . Li, J. Wu, and L. Fei-Fei, “V oxposer: Composable 3d value maps for robotic manipulation with language models,”arXiv preprint arXiv:2307.05973, 2023

  26. [26]

    Srt-h: A hierarchical framework for autonomous surgery via language-conditioned imitation learning,

    J. W. Kim, J.-T. Chen, P. Hansen, L. X. Shi, A. Goldenberg, S. Schmidgall, P. M. Scheikl, A. Deguet, B. M. White, D. R. Tsaiet al., “Srt-h: A hierarchical framework for autonomous surgery via language-conditioned imitation learning,”Science robotics, vol. 10, no. 104, p. eadt5254, 2025

  27. [27]

    arXiv preprint arXiv:2502.05485 (2025)

    Y . Li, Y . Deng, J. Zhang, J. Jang, M. Memmel, R. Yu, C. R. Garrett, F. Ramos, D. Fox, A. Liet al., “Hamster: Hierarchical action models for open-world robot manipulation,”arXiv preprint arXiv:2502.05485, 2025

  28. [28]

    Coa-vla: Improving vision-language-action models via visual-text chain-of-affordance,

    J. Li, Y . Zhu, Z. Tang, J. Wen, M. Zhu, X. Liu, C. Li, R. Cheng, Y . Peng, Y . Penget al., “Coa-vla: Improving vision-language-action models via visual-text chain-of-affordance,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 9759–9769

  29. [29]

    PaLM-E: An Embodied Multimodal Language Model

    D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yuet al., “Palm-e: An embodied multimodal language model,”arXiv preprint arXiv:2303.03378, 2023

  30. [30]

    Openvla: An open-source vision-language-action model,

    M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. P. Foster, P. R. Sanketi, Q. Vuonget al., “Openvla: An open-source vision-language-action model,” inConference on Robot Learning. PMLR, 2025, pp. 2679–2713

  31. [31]

    Reflective planning: Vision-language models for multi-stage long-horizon robotic manipulation,

    Y . Feng, J. Han, Z. Yang, X. Yue, S. Levine, and J. Luo, “Reflective planning: Vision-language models for multi-stage long-horizon robotic manipulation,” inConference on Robot Learning. PMLR, 2025, pp. 2038–2062

  32. [32]

    Agentic self-evolutionary replanning for embodied navigation,

    G. Li, R. Han, C. Li, H. Li, S. Wang, W. Ding, H. Zhang, and C. Xu, “Agentic self-evolutionary replanning for embodied navigation,”arXiv preprint arXiv:2603.02772, 2026

  33. [33]

    OpenClaw-RL: Train Any Agent Simply by Talking

    Y . Wang, X. Chen, X. Jin, M. Wang, and L. Yang, “Openclaw-rl: Train any agent simply by talking,”arXiv preprint arXiv:2603.10165, 2026

  34. [34]

    Roboclaw: An agentic framework for scalable long-horizon robotic tasks,

    R. Li, Y . Zhou, Y . Zhu, K. Chen, J. Wang, S. Wang, K. Hu, M. Yu, B. Jiang, Z. Suet al., “Roboclaw: An agentic framework for scalable long-horizon robotic tasks,”arXiv preprint arXiv:2603.11558, 2026