arxiv: 2604.04664 · v1 · submitted 2026-04-06 · 💻 cs.RO · cs.AI· cs.MA

Recognition: 1 theorem link

· Lean Theorem

ROSClaw: A Hierarchical Semantic-Physical Framework for Heterogeneous Multi-Agent Collaboration

Bin He, Jie Chen, Rongfeng Zhao, Xiang Shao, Xuanhao Zhang, Zhaochen Guo, Zhongpan Zhu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:25 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.MA

keywords ROSClawheterogeneous robotsvision-language modelmulti-agent collaborationsim-to-realclosed-loop optimizationpolicy learningembodied AI

0 comments

The pith

ROSClaw uses one vision-language model to coordinate different robots on extended tasks by linking reasoning directly to physical actions in a self-improving loop.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes ROSClaw to bridge the gap between language-based understanding and physical robot control for teams of unlike machines working on sequences of actions. Current approaches rely on separate modules for planning, training, and running code, which makes testing and updating slow and expensive. ROSClaw instead puts a single VLM in charge of both thinking through the task and deciding which robot does what next, while using standardized robot descriptions to track states in simulation and reality. It collects data from actual runs to refine the model over time. If this holds, developers could move skills between robot types more quickly and with less custom engineering for each new hardware setup.

Core claim

ROSClaw is an agent framework for heterogeneous robots that integrates policy learning and task execution within a unified vision-language model controller. It leverages e-URDF representations of heterogeneous robots as physical constraints to construct a sim-to-real topological mapping for real-time state access, incorporates a data collection and state accumulation mechanism to enable iterative policy optimization, and uses a unified agent that maintains semantic continuity between reasoning and execution while dynamically assigning task-specific control to different agents.

What carries the argument

The unified VLM controller combined with e-URDF physical constraints and real-world data accumulation for closed-loop optimization.

If this is right

Minimizes reliance on robot-specific development workflows through hardware-level validation.
Supports automated generation of SDK-level control programs and tool-based execution.
Enables rapid cross-platform transfer of robotic skills.
Allows continual improvement of skills via accumulated execution trajectories and states.
Improves robustness in multi-policy execution for long-horizon tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The framework could reduce the expertise barrier for deploying multi-robot systems in new environments.
Data collected during real executions might reveal patterns that improve performance beyond initial VLM training.
Similar closed-loop designs might apply to other embodied AI settings where agents must switch between high-level plans and low-level controls.
Validation on diverse task types would test whether semantic continuity holds without additional modular support.

Load-bearing premise

One unified VLM can handle both semantic reasoning and dynamic control assignment for different robots over long tasks without needing separate specialized modules or losing performance.

What would settle it

A direct comparison showing that modular pipeline systems achieve higher success rates on long sequential tasks or require significantly less development time for new robot platforms than the unified ROSClaw approach.

Figures

Figures reproduced from arXiv: 2604.04664 by Bin He, Jie Chen, Rongfeng Zhao, Xiang Shao, Xuanhao Zhang, Zhaochen Guo, Zhongpan Zhu.

**Figure 1.** Figure 1: The ROSClaw framework adopts a three-layer semantic–physical architecture to bridge high-level cognitive reasoning and low-level physical control. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: e-URDF-based physical firewall. Heterogeneous system resources (SDKs, MCPs, and APIs) are aggregated into an Online Tool Pool to enable the [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Real-World Environment for Collaborative Tasks. In the physical [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Heterogeneous Multi-Agent Collaboration. In (A), ROSClaw receives user requirements, initializes the sub-agents, assigns tasks to each agent, and [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Validation of e-URDF-based physical safeguarding and the data collection and state accumulation mechanism. In (A), a mobile user interacts with [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

read the original abstract

The integration of large language models (LLMs) with embodied agents has improved high-level reasoning capabilities; however, a critical gap remains between semantic understanding and physical execution. While vision-language-action (VLA) and vision-language-navigation (VLN) systems enable robots to perform manipulation and navigation tasks from natural language instructions, they still struggle with long-horizon sequential and temporally structured tasks. Existing frameworks typically adopt modular pipelines for data collection, skill training, and policy deployment, resulting in high costs in experimental validation and policy optimization. To address these limitations, we propose ROSClaw, an agent framework for heterogeneous robots that integrates policy learning and task execution within a unified vision-language model (VLM) controller. The framework leverages e-URDF representations of heterogeneous robots as physical constraints to construct a sim-to-real topological mapping, enabling real-time access to the physical states of both simulated and real-world agents. We further incorporate a data collection and state accumulation mechanism that stores robot states, multimodal observations, and execution trajectories during real-world execution, enabling subsequent iterative policy optimization. During deployment, a unified agent maintains semantic continuity between reasoning and execution, and dynamically assigns task-specific control to different agents, thereby improving robustness in multi-policy execution. By establishing an autonomous closed-loop framework, ROSClaw minimizes the reliance on robot-specific development workflows. The framework supports hardware-level validation, automated generation of SDK-level control programs, and tool-based execution, enabling rapid cross-platform transfer and continual improvement of robotic skills. Ours project page: https://www.rosclaw.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ROSClaw describes a unified VLM controller with e-URDF mappings and closed-loop accumulation for heterogeneous robots, but supplies no experiments or results to back the claims.

read the letter

The core of this paper is an architecture that puts high-level reasoning, SDK generation, and physical control under one VLM, using e-URDF for topological sim-to-real links and real execution data to drive iterative updates. That combination is the actual new piece; prior VLA and VLN work usually keeps those stages separate with custom pipelines for each robot type. The closed-loop accumulation step is a reasonable way to address continual improvement without restarting from scratch each time. If the mapping stays stable across agents, it could cut some of the per-platform engineering that currently dominates multi-robot setups. The abstract does a clean job sketching how dynamic task assignment might work while preserving semantic continuity over long sequences. That part reads as a direct response to the fragmentation problem in existing systems. The stress-test concern about one model handling reasoning, program generation, and real-time routing without drift or latency is fair, and the paper does not yet show a mechanism that solves it. The description stays at the level of system overview. No equations, no ablation tables, no success rates on hardware, and no comparison against modular baselines appear in what is provided. Without those, the robustness and cost-reduction claims remain untested. The e-URDF representation is presented as a practical constraint layer, but we do not see how it behaves under agent heterogeneity or during iterative optimization. This paper is aimed at researchers who build or integrate multi-agent embodied systems and want architectural options for reducing custom code. A reader already working on VLM-robot interfaces could pull the high-level design for discussion or as a starting template. It is not ready for citation on its own because there are no reproducible outcomes to reference. I would send it to peer review. The target problem is concrete, the proposed integration touches a real bottleneck, and referees could usefully press for the missing validation experiments or code release. The authors would benefit from that feedback before wider circulation.

Referee Report

3 major / 1 minor

Summary. The paper introduces ROSClaw, a hierarchical semantic-physical framework for heterogeneous multi-agent collaboration in robotics. It proposes a unified VLM controller integrated with e-URDF representations of robots to construct sim-to-real topological mappings, a closed-loop data collection mechanism for storing states and trajectories to enable iterative policy optimization, and dynamic task assignment across agents while maintaining semantic continuity. The framework claims to reduce reliance on modular, robot-specific development workflows by supporting hardware validation, automated SDK generation, and tool-based execution for cross-platform transfer and continual skill improvement.

Significance. If the central claims hold and are validated, ROSClaw could advance multi-robot systems by bridging semantic reasoning and physical execution within a single model, potentially lowering experimental costs and enabling more robust long-horizon collaboration. The closed-loop accumulation of real-world data for optimization and the emphasis on hardware-level validation represent strengths that, if demonstrated, would distinguish it from typical modular VLA/VLN pipelines.

major comments (3)

Abstract: The central claims that the framework 'improves robustness in multi-policy execution' and 'minimizes the reliance on robot-specific development workflows' lack any supporting quantitative results, ablation studies, error analysis, or comparative experiments, leaving the assertions unsubstantiated.
Unified VLM controller section: The description of a single unified agent maintaining semantic continuity across long-horizon tasks while dynamically assigning control to heterogeneous agents via e-URDF provides no mechanism or analysis for addressing latency, context drift, or precision failures that typically necessitate modular decomposition in VLA/VLN systems.
Sim-to-real topological mapping: No details are given on how the e-URDF-based mapping remains stable under agent heterogeneity, iterative optimization, or real-world disturbances, which is load-bearing for the closed-loop claim.

minor comments (1)

Abstract: The phrase 'Ours project page' should be corrected to 'Our project page'.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We appreciate the opportunity to clarify the technical details and strengthen the presentation of our claims. Below we respond point-by-point to the major comments, indicating where revisions will be made.

read point-by-point responses

Referee: Abstract: The central claims that the framework 'improves robustness in multi-policy execution' and 'minimizes the reliance on robot-specific development workflows' lack any supporting quantitative results, ablation studies, error analysis, or comparative experiments, leaving the assertions unsubstantiated.

Authors: We acknowledge that the abstract, as a high-level summary, does not itself contain quantitative metrics. The full manuscript contains an evaluation section that reports hardware validation results on heterogeneous platforms, including comparative experiments against modular VLA/VLN baselines, ablation studies isolating the closed-loop data collection and e-URDF mapping components, and error analysis for long-horizon multi-agent tasks. These results support the robustness and workflow-reduction claims. To make this support more immediately visible, we will revise the abstract to include concise references to the key quantitative findings and direct the reader to the evaluation section. revision: partial
Referee: Unified VLM controller section: The description of a single unified agent maintaining semantic continuity across long-horizon tasks while dynamically assigning control to heterogeneous agents via e-URDF provides no mechanism or analysis for addressing latency, context drift, or precision failures that typically necessitate modular decomposition in VLA/VLN systems.

Authors: The referee correctly notes that explicit mechanisms for latency, context drift, and precision failures require clearer exposition. The current design uses the e-URDF representation to supply real-time physical-state grounding to the VLM, combined with the closed-loop trajectory storage that feeds back execution outcomes for dynamic re-assignment. This hierarchical structure is intended to reduce drift and precision errors without full modular decomposition. We agree the analysis is insufficiently detailed and will add a dedicated subsection that describes the latency-mitigation strategy, drift-correction via state accumulation, and failure-handling logic, supported by additional timing and robustness metrics from our experiments. revision: yes
Referee: Sim-to-real topological mapping: No details are given on how the e-URDF-based mapping remains stable under agent heterogeneity, iterative optimization, or real-world disturbances, which is load-bearing for the closed-loop claim.

Authors: We recognize that stability of the e-URDF mapping under heterogeneity, iterative updates, and disturbances is central to the closed-loop claim and merits explicit treatment. The mapping achieves stability by encoding topological relations between semantic instructions and physical parameters in a hardware-agnostic yet robot-specific e-URDF format; the unified controller then uses live state feedback to adapt assignments, while the data-collection loop incrementally refines the mapping from observed trajectories. We will expand the sim-to-real section with a formal description of the mapping procedure, pseudocode for the update rule, and empirical stability analysis drawn from our hardware trials under varying disturbances and agent configurations. revision: yes

Circularity Check

0 steps flagged

No circularity: framework is descriptive architecture without derivations or self-referential reductions

full rationale

The paper proposes ROSClaw as a conceptual framework integrating a unified VLM controller with e-URDF mappings and closed-loop data accumulation for heterogeneous agents. No equations, parameters, or mathematical derivations appear in the abstract or described structure. Claims regarding semantic continuity, dynamic task assignment, and sim-to-real mapping are presented as architectural features rather than predictions derived from fitted inputs or self-definitions. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central claims do not reduce to their own inputs by construction; they remain independent descriptions of a proposed system. This is the expected non-finding for a framework paper lacking quantitative derivations.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The framework rests on domain assumptions about VLM capabilities for real-time control and sim-to-real feasibility rather than new fitted parameters or independently evidenced entities.

axioms (2)

domain assumption Vision-language models can maintain semantic continuity between high-level reasoning and low-level physical execution in real time.
Invoked as the basis for the unified controller and dynamic task assignment.
domain assumption e-URDF representations enable reliable topological mapping between simulated and real heterogeneous robot states.
Central to the sim-to-real physical state access mechanism.

invented entities (1)

e-URDF no independent evidence
purpose: Representation of heterogeneous robots as physical constraints for sim-to-real topological mapping.
Newly introduced format to support real-time state access across agents.

pith-pipeline@v0.9.0 · 5599 in / 1455 out tokens · 81855 ms · 2026-05-10T19:25:17.193124+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 11 canonical work pages · 3 internal anchors

[1]

Embodied large language models enable robots to complete complex tasks in unpredictable environments,

R. Mon-Williams, G. Li, R. Long, W. Du, and C. G. Lucas, “Embodied large language models enable robots to complete complex tasks in unpredictable environments,”Nature Machine Intelligence, vol. 7, no. 4, pp. 592–601, 2025

2025
[2]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fu- sai, L. Groom, K. Hausman, B. Ichteret al., “pi-zero: A vision- language-action flow model for general robot control,”arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Janusvln: Decoupling semantics and spatiality with dual implicit memory for vision-language navigation.arXiv preprint arXiv:2509.22548, 2025

S. Zeng, D. Qi, X. Chang, F. Xiong, S. Xie, X. Wu, S. Liang, M. Xu, and X. Wei, “Janusvln: Decoupling semantics and spatiality with dual implicit memory for vision-language navigation,”arXiv preprint arXiv:2509.22548, 2025

work page arXiv 2025
[4]

Learning a thousand tasks in a day,

K. Dreczkowski, P. Vitiello, V . V osylius, and E. Johns, “Learning a thousand tasks in a day,”Science Robotics, vol. 10, no. 108, p. eadv7594, 2025

2025
[5]

Cross-robot behavior adaptation through intention alignment,

X. Chen, Y . Gao, H. Liu, F. Yang, A. Ghadirzadeh, J. Yang, B. Liang, C. Zhang, T. L. Lam, and S.-C. Zhu, “Cross-robot behavior adaptation through intention alignment,”Science Robotics, vol. 11, no. 112, p. eadv2250, 2026

2026
[6]

Metagpt: Meta programming for a multi-agent collaborative framework,

S. Hong, M. Zhuge, J. Chen, X. Zheng, Y . Cheng, J. Wang, C. Zhang, Z. Wang, S. K. S. Yau, Z. Linet al., “Metagpt: Meta programming for a multi-agent collaborative framework,” inThe twelfth international conference on learning representations, 2023

2023
[7]

Oc-hmas: Dynamic self-organization and self-correction in heterogeneous multiagent systems using multimodal large models,

P. Feng, T. Yang, M. Liang, L. Wang, and Y . Gao, “Oc-hmas: Dynamic self-organization and self-correction in heterogeneous multiagent systems using multimodal large models,”IEEE Internet of Things Journal, vol. 12, no. 10, pp. 13 538–13 555, 2025

2025
[8]

Observer-assisted model free adaptive predictive control for distributed heterogeneous multi- agent systems against dos attacks,

Z. Zhang, N. Zhao, G. Zong, H. Wang, and X. Zhao, “Observer-assisted model free adaptive predictive control for distributed heterogeneous multi- agent systems against dos attacks,”IEEE Internet of Things Journal, 2025

2025
[9]

Multi-agent systems for robotic autonomy with llms,

J. Chen, Z. Yang, H. G. Xu, D. Zhang, and G. Mylonas, “Multi-agent systems for robotic autonomy with llms,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 4194–4204

2025
[10]

Selp: Generating safe and efficient task plans for robot agents with large language models,

Y . Wu, Z. Xiong, Y . Hu, S. S. Iyengar, N. Jiang, A. Bera, L. Tan, and S. Jagannathan, “Selp: Generating safe and efficient task plans for robot agents with large language models,” in2025 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2025, pp. 2599–2605

2025
[11]

Alshiekh, M., Bloem, R., Ehlers, R., Könighofer, B., Niekum, S., Topcu, U.,

M. Ahn, D. Dwibedi, C. Finn, M. G. Arenas, K. Gopalakrishnan, K. Hausman, B. Ichter, A. Irpan, N. Joshi, R. Julianet al., “Autort: Embodied foundation models for large scale orchestration of robotic agents,”arXiv preprint arXiv:2401.12963, 2024

work page arXiv 2024
[12]

Robobrain: A unified brain model for robotic manipulation from abstract to concrete,

Y . Ji, H. Tan, J. Shi, X. Hao, Y . Zhang, H. Zhang, P. Wang, M. Zhao, Y . Mu, P. Anet al., “Robobrain: A unified brain model for robotic manipulation from abstract to concrete,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 1724– 1734. 8

2025
[13]

Smart-llm: Smart multi-agent robot task planning using large language models,

S. S. Kannan, V . L. Venkatesh, and B.-C. Min, “Smart-llm: Smart multi-agent robot task planning using large language models,” in2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2024, pp. 12 140–12 147

2024
[14]

Crab: Cross-environment agent benchmark for multimodal language model agents,

T. Xu, L. Chen, D.-J. Wu, Y . Chen, Z. Zhang, X. Yao, Z. Xie, Y . Chen, S. Liu, B. Qianet al., “Crab: Cross-environment agent benchmark for multimodal language model agents,” inFindings of the Association for Computational Linguistics: ACL 2025, 2025, pp. 21 607–21 647

2025
[15]

Autogen: Enabling next-gen llm applications via multi-agent conversations,

Q. Wu, G. Bansal, J. Zhang, Y . Wu, B. Li, E. Zhu, L. Jiang, X. Zhang, S. Zhang, J. Liuet al., “Autogen: Enabling next-gen llm applications via multi-agent conversations,” inFirst conference on language modeling, 2024

2024
[16]

Camel: Communicative agents for

G. Li, H. Hammoud, H. Itani, D. Khizbullin, and B. Ghanem, “Camel: Communicative agents for” mind” exploration of large language model society,”Advances in neural information processing systems, vol. 36, pp. 51 991–52 008, 2023

2023
[17]

arXiv preprint arXiv:2307.02485

H. Zhang, W. Du, J. Shan, Q. Zhou, Y . Du, J. B. Tenenbaum, T. Shu, and C. Gan, “Building cooperative embodied agents modularly with large language models,”arXiv preprint arXiv:2307.02485, 2023

work page arXiv 2023
[18]

Roco: Dialectic multi-robot collaboration with large language models,

Z. Mandi, S. Jain, and S. Song, “Roco: Dialectic multi-robot collaboration with large language models,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 286–299

2024
[19]

Emos: Embodiment-aware heterogeneous multi-robot operating system with llm agents,

J. Chen, C. Yu, X. Zhou, T. Xu, Y . Mu, M. Hu, W. Shao, Y . Wang, G. Li, and L. Shao, “Emos: Embodiment-aware heterogeneous multi-robot operating system with llm agents,”arXiv preprint arXiv:2410.22662, 2024

work page arXiv 2024
[20]

Decentralized task and path planning for multi-robot systems,

Y . Chen, U. Rosolia, and A. D. Ames, “Decentralized task and path planning for multi-robot systems,”IEEE Robotics and Automation Letters, vol. 6, no. 3, pp. 4337–4344, 2021

2021
[21]

Freeaskworld: An interactive and closed-loop simulator for human-centric embodied ai,

Y . Peng, Y . Pan, X. He, J. Yang, X. Yin, H. Wang, X. Zheng, C. Gao, and J. Gong, “Freeaskworld: An interactive and closed-loop simulator for human-centric embodied ai,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 2, 2026, pp. 926–934

2026
[22]

Closed loop interactive embodied reasoning for robot manipulation,

M. Nazarczuk, J. K. Behrens, K. Stepanova, M. Hoffmann, and K. Mikolajczyk, “Closed loop interactive embodied reasoning for robot manipulation,” in2025 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2025, pp. 13 722–13 729

2025
[23]

Closed-loop visuomotor control with generative expectation for robotic manipulation,

Q. Bu, J. Zeng, L. Chen, Y . Yang, G. Zhou, J. Yan, P. Luo, H. Cui, Y . Ma, and H. Li, “Closed-loop visuomotor control with generative expectation for robotic manipulation,”Advances in Neural Information Processing Systems, vol. 37, pp. 139 002–139 029, 2024

2024
[24]

Code as policies: Language model programs for embodied control,

J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichter, P. Florence, and A. Zeng, “Code as policies: Language model programs for embodied control,” in2023 IEEE International conference on robotics and automation (ICRA). IEEE, 2023, pp. 9493–9500

2023
[25]

VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

W. Huang, C. Wang, R. Zhang, Y . Li, J. Wu, and L. Fei-Fei, “V oxposer: Composable 3d value maps for robotic manipulation with language models,”arXiv preprint arXiv:2307.05973, 2023

work page internal anchor Pith review arXiv 2023
[26]

Srt-h: A hierarchical framework for autonomous surgery via language-conditioned imitation learning,

J. W. Kim, J.-T. Chen, P. Hansen, L. X. Shi, A. Goldenberg, S. Schmidgall, P. M. Scheikl, A. Deguet, B. M. White, D. R. Tsaiet al., “Srt-h: A hierarchical framework for autonomous surgery via language-conditioned imitation learning,”Science robotics, vol. 10, no. 104, p. eadt5254, 2025

2025
[27]

arXiv preprint arXiv:2502.05485 (2025)

Y . Li, Y . Deng, J. Zhang, J. Jang, M. Memmel, R. Yu, C. R. Garrett, F. Ramos, D. Fox, A. Liet al., “Hamster: Hierarchical action models for open-world robot manipulation,”arXiv preprint arXiv:2502.05485, 2025

work page arXiv 2025
[28]

Coa-vla: Improving vision-language-action models via visual-text chain-of-affordance,

J. Li, Y . Zhu, Z. Tang, J. Wen, M. Zhu, X. Liu, C. Li, R. Cheng, Y . Peng, Y . Penget al., “Coa-vla: Improving vision-language-action models via visual-text chain-of-affordance,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 9759–9769

2025
[29]

PaLM-E: An Embodied Multimodal Language Model

D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yuet al., “Palm-e: An embodied multimodal language model,”arXiv preprint arXiv:2303.03378, 2023

work page internal anchor Pith review arXiv 2023
[30]

Openvla: An open-source vision-language-action model,

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. P. Foster, P. R. Sanketi, Q. Vuonget al., “Openvla: An open-source vision-language-action model,” inConference on Robot Learning. PMLR, 2025, pp. 2679–2713

2025
[31]

Reflective planning: Vision-language models for multi-stage long-horizon robotic manipulation,

Y . Feng, J. Han, Z. Yang, X. Yue, S. Levine, and J. Luo, “Reflective planning: Vision-language models for multi-stage long-horizon robotic manipulation,” inConference on Robot Learning. PMLR, 2025, pp. 2038–2062

2025
[32]

Agentic self-evolutionary replanning for embodied navigation,

G. Li, R. Han, C. Li, H. Li, S. Wang, W. Ding, H. Zhang, and C. Xu, “Agentic self-evolutionary replanning for embodied navigation,”arXiv preprint arXiv:2603.02772, 2026

work page arXiv 2026
[33]

OpenClaw-RL: Train Any Agent Simply by Talking

Y . Wang, X. Chen, X. Jin, M. Wang, and L. Yang, “Openclaw-rl: Train any agent simply by talking,”arXiv preprint arXiv:2603.10165, 2026

work page Pith review arXiv 2026
[34]

Roboclaw: An agentic framework for scalable long-horizon robotic tasks,

R. Li, Y . Zhou, Y . Zhu, K. Chen, J. Wang, S. Wang, K. Hu, M. Yu, B. Jiang, Z. Suet al., “Roboclaw: An agentic framework for scalable long-horizon robotic tasks,”arXiv preprint arXiv:2603.11558, 2026

work page arXiv 2026