arxiv: 2604.10096 · v2 · submitted 2026-04-11 · 💻 cs.CV

Recognition: unknown

ABot-Claw: A Foundation for Persistent, Cooperative, and Self-Evolving Robotic Agents

Dongjie Huo , Haoyun Liu , Guoqing Liu , Dekang Qi , Zhiming Sun , Maoguo Gao , Jianxin He , Yandan Yang

show 5 more authors

Xinyuan Chang Feng Xiong Xing Wei Zhiheng Ma Mu Xu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:19 UTC · model grok-4.3

classification 💻 cs.CV

keywords ABot-Clawembodied roboticsOpenClawmultimodal memoryclosed-loop feedbackrobotic agentsself-evolving systemscooperative robots

0 comments

The pith

ABot-Claw extends OpenClaw into a decoupled three-layer system that closes the loop from natural language intent to physical robot execution across heterogeneous machines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the persistent gap between high-level AI planning and reliable low-level control when robots must operate for long periods in unpredictable settings. ABot-Claw adds a unified embodiment interface, visual-centric multimodal memory, and critic-based feedback with a generalist reward model on top of the OpenClaw runtime. These pieces sit in a shared service layer between the core runtime and individual robot hardware. A sympathetic reader would care because the design aims to turn one-shot robotic responses into persistent, cooperative behavior that improves over time without constant human oversight.

Core claim

ABot-Claw integrates a unified embodiment interface with capability-driven scheduling for heterogeneous robot coordination, a visual-centric cross-embodiment multimodal memory for persistent context retention and grounded retrieval, and a critic-based closed-loop feedback mechanism with a generalist reward model for online progress evaluation, local correction, and replanning. Its decoupled architecture spans the OpenClaw layer, shared service layer, and robot embodiment layer to enable real-world interaction, close the loop from natural language intent to physical action, and support progressively self-evolving robotic agents in open, dynamic environments.

What carries the argument

The decoupled three-layer architecture (OpenClaw runtime, shared service layer, robot embodiment layer) together with capability-driven scheduling, visual-centric multimodal memory, and critic-based closed-loop feedback.

If this is right

Heterogeneous robots receive tasks through a single capability-driven scheduler rather than custom per-machine code.
Visual-centric memory supplies grounded context across long sessions and different robot bodies.
The critic and generalist reward model enable local corrections and replanning without restarting entire plans.
Agents accumulate experience that supports progressive self-evolution during ongoing operation.
The system sustains execution in open environments where conditions change unpredictably.
pith_inferences=[

Load-bearing premise

The three proposed components can be implemented and integrated on heterogeneous physical robots without introducing prohibitive latency, instability, or hardware-specific failures.

What would settle it

Deploy the full ABot-Claw stack on at least two dissimilar robots, run a multi-step collaborative task in a changing workspace for several hours, and check whether coordination, memory retrieval, and error correction remain stable without external resets or excessive delays.

read the original abstract

Current embodied intelligent systems still face a substantial gap between high-level reasoning and low-level physical execution in open-world environments. Although Vision-Language-Action (VLA) models provide strong perception and intuitive responses, their open-loop nature limits long-horizon performance. Agents incorporating System 2 cognitive mechanisms improve planning, but usually operate in closed sandboxes with predefined toolkits and limited real-system control. OpenClaw provides a localized runtime with full system privileges, but lacks the embodied control architecture required for long-duration, multi-robot execution. We therefore propose ABot-Claw, an embodied extension of OpenClaw that integrates: 1) a unified embodiment interface with capability-driven scheduling for heterogeneous robot coordination; 2) a visual-centric cross-embodiment multimodal memory for persistent context retention and grounded retrieval; and 3) a critic-based closed-loop feedback mechanism with a generalist reward model for online progress evaluation, local correction, and replanning. With a decoupled architecture spanning the OpenClaw layer, shared service layer, and robot embodiment layer, ABot-Claw enables real-world interaction, closes the loop from natural language intent to physical action, and supports progressively self-evolving robotic agents in open, dynamic environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ABot-Claw is a clean but untested architecture sketch for linking language goals to persistent multi-robot execution.

read the letter

The main takeaway is that this paper describes a layered system called ABot-Claw to let agents follow natural language instructions across multiple robots for extended periods. It adds capability-driven scheduling, visual-centric memory, and a critic loop on top of OpenClaw and VLA models, with the goal of closing the open-loop gap and allowing self-evolution in real environments. The decoupled structure across runtime, services, and hardware layers is the clearest part of the proposal and makes sense for handling different robot types without forcing everything into one tight bundle. The three components are laid out plainly enough that someone could use them as a starting blueprint for coordination, grounded recall, and online correction. The paper does a reasonable job naming the practical problems in current embodied setups, like limited long-horizon control and lack of persistence. That said, everything rests on descriptions alone. No experiments, no simulations, no latency numbers, and no integration details appear in the text. Claims that the system enables real-world interaction or supports self-evolving agents stay conditional on the pieces working together without conflicts, excessive delays, or unstable corrections from the generalist reward model. Hardware heterogeneity and scale issues for the memory are not analyzed. This is the kind of work that could interest people building multi-robot systems or long-term autonomy stacks who want high-level structure ideas rather than finished results. It shows coherent thinking about the needed pieces even if the evidence is thin, so it deserves a serious referee who can ask for concrete validation steps. I would send it to peer review with the clear expectation that reviewers will require at least prototype data or analysis of the integration risks.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes ABot-Claw as an embodied extension of OpenClaw for robotic agents operating in open-world environments. It integrates three components within a decoupled architecture (OpenClaw layer, shared service layer, robot embodiment layer): (1) a unified embodiment interface with capability-driven scheduling for heterogeneous robot coordination, (2) a visual-centric cross-embodiment multimodal memory for persistent context retention and grounded retrieval, and (3) a critic-based closed-loop feedback mechanism using a generalist reward model for online progress evaluation, local correction, and replanning. The central claim is that this design enables real-world interaction, closes the loop from natural language intent to physical action, and supports progressively self-evolving robotic agents.

Significance. If the proposed components can be implemented and validated on physical hardware, the work could meaningfully advance embodied AI by bridging the gap between high-level VLA reasoning and reliable long-horizon physical execution while adding persistence and self-correction. The emphasis on heterogeneous coordination and visual-centric memory offers a concrete architectural direction beyond sandboxed System-2 agents, though the absence of any empirical grounding makes the significance prospective rather than demonstrated.

major comments (2)

[Abstract] Abstract: The assertions that the architecture 'enables real-world interaction, closes the loop from natural language intent to physical action, and supports progressively self-evolving robotic agents' rest entirely on high-level component descriptions without any reported experiments, ablation studies, latency bounds, stability analysis, or hardware-interface details. This renders the 'enables' and 'supports' claims unsubstantiated.
[Abstract] Abstract: The load-bearing assumption that capability-driven scheduling, visual-centric memory, and the generalist reward model can be integrated on heterogeneous physical robots without prohibitive latency, instability, or hardware-specific failures receives no quantitative analysis or pseudocode, directly undermining the central claim of a functional closed-loop system.

minor comments (1)

[Abstract] Abstract: The term 'generalist reward model' is introduced without reference to its training procedure, data sources, or relation to existing reward models in the VLA or LLM literature, reducing clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the prospective significance of ABot-Claw in bridging high-level reasoning with physical execution. The manuscript presents a conceptual architecture rather than an implemented system with empirical results. We address each major comment below and indicate where revisions will be made to better align claims with the current scope.

read point-by-point responses

Referee: [Abstract] Abstract: The assertions that the architecture 'enables real-world interaction, closes the loop from natural language intent to physical action, and supports progressively self-evolving robotic agents' rest entirely on high-level component descriptions without any reported experiments, ablation studies, latency bounds, stability analysis, or hardware-interface details. This renders the 'enables' and 'supports' claims unsubstantiated.

Authors: We agree that the abstract employs forward-looking language that exceeds what is demonstrated in the current manuscript. The full text details the three components and decoupled layers to explain how they target the identified gaps in VLA models and sandboxed agents, but no experiments or quantitative evaluations are reported. This is a foundational design paper. In revision we will replace 'enables' and 'supports' with 'is designed to enable' and 'aims to support' and add an explicit limitations section outlining the need for future hardware validation. revision: partial
Referee: [Abstract] Abstract: The load-bearing assumption that capability-driven scheduling, visual-centric memory, and the generalist reward model can be integrated on heterogeneous physical robots without prohibitive latency, instability, or hardware-specific failures receives no quantitative analysis or pseudocode, directly undermining the central claim of a functional closed-loop system.

Authors: The manuscript argues for integration feasibility through the capability-driven scheduler and shared service layer, with component interactions described in the architecture section. No pseudocode or latency bounds are provided because the focus remains on high-level principles rather than low-level implementation. We accept that this leaves practicality unproven. We will add high-level pseudocode for scheduling and critic-driven replanning plus a qualitative discussion of latency considerations arising from the decoupled design; full quantitative hardware analysis requires a prototype and is noted as future work. revision: partial

Circularity Check

0 steps flagged

No significant circularity in the architectural proposal.

full rationale

The manuscript describes a high-level system architecture (OpenClaw layer + shared services + embodiment layer) together with three design elements (capability-driven scheduling, visual-centric memory, critic-based feedback) whose purpose is stated to be enabling real-world interaction and self-evolution. No equations, fitted parameters, or quantitative derivations appear in the provided text. Claims are presented as consequences of the proposed integration rather than as outputs derived from prior results by construction. No self-citations are invoked as load-bearing uniqueness theorems, and no known empirical patterns are renamed as novel results. The derivation chain therefore remains self-contained at the conceptual level.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the assumption that the described interfaces and mechanisms can be realized on physical hardware; no free parameters, formal axioms, or new physical entities are explicitly introduced beyond the named software layers.

axioms (2)

domain assumption Vision-Language-Action models provide strong perception and intuitive responses
Stated as background in the abstract without further justification.
domain assumption Agents incorporating System 2 cognitive mechanisms improve planning
Presented as established fact before describing limitations of current implementations.

invented entities (1)

ABot-Claw architecture no independent evidence
purpose: Unified embodiment interface, visual-centric memory, and critic feedback for long-horizon robotic execution
The paper introduces this named system as the primary contribution; no independent falsifiable prediction (e.g., specific performance metric on a public benchmark) is supplied.

pith-pipeline@v0.9.0 · 5560 in / 1523 out tokens · 41517 ms · 2026-05-10T16:19:01.146910+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Do Vision-Language-Models show human-like logical problem-solving capability in point and click puzzle games?
cs.AI 2026-05 unverdicted novelty 7.0

VLATIM benchmark reveals large VLMs excel at high-level planning in physics puzzles but struggle with precise visual grounding and mouse control, so they lack human-like problem-solving capabilities.

Reference graph

Works this paper leans on

27 extracted references · 15 canonical work pages · cited by 1 Pith paper · 7 internal anchors

[1]

Summarizing videos using concentrated attention and considering the uniqueness and diversity of the video frames

Evlampios Apostolidis, Georgios Balaouras, Vasileios Mezaris, and Ioannis Patras. Summarizing videos using concentrated attention and considering the uniqueness and diversity of the video frames. InProceedings of the 2022 international conference on multimedia retrieval, pages 407–415, 2022

2022
[2]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Socialnav: Training human-inspired foundation model for socially-aware embodied navigation,

Ziyi Chen, Yingnan Guo, Zedong Chu, Minghua Luo, Yanfen Shen, Mingchao Sun, Junjun Hu, Shichao Xie, Kuan Yang, Pei Shi, et al. Socialnav: Training human-inspired foundation model for socially-aware embodied navigation. arXiv preprint arXiv:2511.21135, 2025

work page arXiv 2025
[4]

Yolo-world: Real-time open-vocabulary object detection

Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xinggang Wang, and Ying Shan. Yolo-world: Real-time open-vocabulary object detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16901–16911, 2024

2024
[5]

Abot-n0: Technical report on the vla foundation model for versatile embodied navigation

Zedong Chu, Shichao Xie, Xiaolong Wu, Yanfen Shen, Minghua Luo, Zhengbo Wang, Fei Liu, Xiaoxu Leng, Junjun Hu, Mingyang Yin, et al. ABot-N0: Technical report on the vla foundation model for versatile embodied navigation. arXiv preprint arXiv:2602.11598, 2026

work page arXiv 2026
[6]

PaLM-E: An Embodied Multimodal Language Model

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model.arXiv preprint arXiv:2303.03378, 2023

work page internal anchor Pith review arXiv 2023
[7]

Anygrasp: Robust and efficient grasp perception in spatial and temporal domains.IEEE Transactions on Robotics, 39(5):3929–3945, 2023

Hao-Shu Fang, Chenxi Wang, Hongjie Fang, Minghao Gou, Jirong Liu, Hengxu Yan, Wenhai Liu, Yichen Xie, and Cewu Lu. Anygrasp: Robust and efficient grasp perception in spatial and temporal domains.IEEE Transactions on Robotics, 39(5):3929–3945, 2023

2023
[8]

Maneuver-based motion planning for nonlinear systems with symmetries

Emilio Frazzoli, Munther A Dahleh, and Eric Feron. Maneuver-based motion planning for nonlinear systems with symmetries. IEEE transactions on robotics, 21(6):1077–1091, 2005

2005
[9]

CaP-X: A framework for benchmarking and improving coding agents for robot manipulation

Max Fu, Justin Yu, Karim El-Refai, Ethan Kou, Haoru Xue, Huang Huang, Wenli Xiao, Guanzhi Wang, Fei-Fei Li, Guanya Shi, et al. Cap-x: A framework for benchmarking and improving coding agents for robot manipulation. arXiv preprint arXiv:2603.22435, 2026

work page arXiv 2026
[10]

Robot operating system (ros): The complete reference (volume 1).Cham: Springer International Publishing, pages 595–625, 2016

Fadri Furrer, Michael Burri, Markus Achtelik, and Roland Siegwart. Robot operating system (ros): The complete reference (volume 1).Cham: Springer International Publishing, pages 595–625, 2016

2016
[11]

Rt-trajectory: Robotic task generalization via hindsight trajectory sketches,

Jiayuan Gu, Sean Kirmani, Paul Wohlhart, Yao Lu, Montserrat Gonzalez Arenas, Kanishka Rao, Wenhao Yu, Chuyuan Fu, Keerthana Gopalakrishnan, Zhuo Xu, et al. Rt-trajectory: Robotic task generalization via hindsight trajectory sketches. arXiv preprint arXiv:2311.01977, 2023

work page arXiv 2023
[12]

Language models as zero-shot planners: Extracting actionable knowledge for embodied agents

Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. InInternational conference on machine learning, pages 9118–9147. PMLR, 2022

2022
[13]

Inner Monologue: Embodied Reasoning through Planning with Language Models

Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, et al. Inner monologue: Embodied reasoning through planning with language models. arXiv preprint arXiv:2207.05608, 2022

work page internal anchor Pith review arXiv 2022
[14]

Robust and real-time perception and planning for ugvs in complex outdoor environments

Dongjie Huo, Dengshuo Wang, Dong Zhang, Mengchu Zhou, and Zhengcai Cao. Robust and real-time perception and planning for ugvs in complex outdoor environments. In2025IEEE/RSJInternationalConferenceonIntelligent Robots and Systems (IROS), pages 2726–2733. IEEE, 2025

2025
[15]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al.π0.5: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Thinking, fast and slow

Daniel Kahneman. Thinking, fast and slow. macmillan, 2011

2011
[17]

Causal World Modeling for Robot Control

Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, et al. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026. 18

work page internal anchor Pith review arXiv 2026
[18]

Code as policies: Language model programs for embodied control

Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for embodied control. In2023 IEEE International conference on robotics and automation (ICRA), pages 9493–9500. IEEE, 2023

2023
[19]

Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539–68551, 2023

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539–68551, 2023

2023
[20]

Robo-dopamine: General process re- ward modeling for high-precision robotic manipula- tion,

Huajie Tan, Sixiang Chen, Yijie Xu, Zixiao Wang, Yuheng Ji, Cheng Chi, Yaoxu Lyu, Zhongxia Zhao, Xiansheng Chen, Peterson Co, et al. Robo-dopamine: General process reward modeling for high-precision robotic manipulation. arXiv preprint arXiv:2512.23703, 2025

work page arXiv 2025
[21]

arXiv preprint arXiv:2310.10634 , year=

Tianbao Xie, Fan Zhou, Zhoujun Cheng, Peng Shi, Luoxuan Weng, Yitao Liu, Toh Jing Hua, Junning Zhao, Qian Liu, Che Liu, et al. Openagents: An open platform for language agents in the wild. arXiv preprint arXiv:2310.10634, 2023

work page arXiv 2023
[22]

ABot-M0: VLA Foundation Model for Robotic Manipulation with Action Manifold Learning

Yandan Yang, Shuang Zeng, Tong Lin, Xinyuan Chang, Dekang Qi, Junjin Xiao, Haoyun Liu, Ronghan Chen, Yuzhi Chen, Dongjie Huo, et al. ABot-M0: Vla foundation model for robotic manipulation with action manifold learning. arXiv preprint arXiv:2602.11236, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[23]

World Action Models are Zero-shot Policies

Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026

work page internal anchor Pith review arXiv 2026
[24]

Vlfm: Vision-languagefrontier maps for zero-shot semantic navigation

NaokiYokoyama, SehoonHa, DhruvBatra, JiuguangWang, andBernadetteBucher. Vlfm: Vision-languagefrontier maps for zero-shot semantic navigation. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 42–48. IEEE, 2024

2024
[25]

Janusvln: Decoupling semantics and spatiality with dual implicit memory for vision-language navigation,

Shuang Zeng, Dekang Qi, Xinyuan Chang, Feng Xiong, Shichao Xie, Xiaolong Wu, Shiyi Liang, Mu Xu, and Xing Wei. Janusvln: Decoupling semantics and spatiality with dual implicit memory for vision-language navigation. arXiv preprint arXiv:2509.22548, 2025

work page arXiv 2025
[26]

A vision-language- action-critic model for robotic real-world reinforcement learning,

Shaopeng Zhai, Qi Zhang, Tianyi Zhang, Fuxian Huang, Haoran Zhang, Ming Zhou, Shengzhe Zhang, Litao Liu, Sixu Lin, and Jiangmiao Pang. A vision-language-action-critic model for robotic real-world reinforcement learning. arXiv preprint arXiv:2509.15937, 2025

work page arXiv 2025
[27]

Single image 3d object detection and pose estimation for grasping

Menglong Zhu, Konstantinos G Derpanis, Yinfei Yang, Samarth Brahmbhatt, Mabel Zhang, Cody Phillips, Matthieu Lecce, and Kostas Daniilidis. Single image 3d object detection and pose estimation for grasping. In 2014 IEEE International Conference on Robotics and Automation (ICRA), pages 3936–3943. IEEE, 2014. 19

2014