USIM and U0: A Vision-Language-Action Dataset and Model for General Underwater Robots

Jian Wang; Junwen Gu; Laien Luo; Lianyi Yu; Luoyang Sun; Pengxuan Si; Shuang Qiu; Yukai Feng; Zhengxing Wu; Zhentao Zhang

arxiv: 2510.07869 · v4 · pith:LLTACTSWnew · submitted 2025-10-09 · 💻 cs.RO

USIM and U0: A Vision-Language-Action Dataset and Model for General Underwater Robots

Junwen Gu , Zhiheng Wu , Pengxuan Si , Shuang Qiu , Zhentao Zhang , Yukai Feng , Luoyang Sun , Laien Luo

show 3 more authors

Lianyi Yu Jian Wang Zhengxing Wu

This is my paper

Pith reviewed 2026-05-25 08:17 UTC · model grok-4.3

classification 💻 cs.RO

keywords underwater roboticsvision-language-actionsimulation datasetmulti-task learningrobotic navigationmobile manipulation

0 comments

The pith

A simulation dataset and VLA model enable general underwater robots to follow language instructions for navigation and manipulation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to move underwater robotics from task-specific methods toward general-purpose agents that execute varied tasks from language instructions. It constructs the USIM simulation dataset with over 905K frames across 2275 trajectories of BlueROV2 interactions and introduces the U0 model, which adds a convolution-attention-based perception module that treats target pose estimation as an auxiliary task. Evaluation uses both offline action prediction and online task success metrics, showing clear gains over prior baselines. A reader would care because underwater settings are data-scarce and hostile, so scalable simulated training could support multi-task embodied agents where real-world collection is expensive or dangerous.

Core claim

We propose USIM, a simulation-based dataset comprising over 905K frames from 2275 trajectories totaling about 25 hours of BlueROV2 interactions, and U0, a vision-language-action model with a convolution-attention-based perception module that incorporates target pose estimation as an auxiliary task; the model achieves a mean action prediction error of 0.0359 and an overall online success rate of 43.1% across tasks from obstacle-avoidance navigation to 3D mobile manipulation, a 5.5% improvement over baselines.

What carries the argument

U0 vision-language-action model with convolution-attention-based perception (CAP) module that uses target pose estimation as auxiliary task to bolster spatial awareness.

If this is right

The USIM dataset significantly empowers existing VLA models to adapt to underwater scenarios.
U0 reduces the offline mean action prediction error to 0.0359.
U0 achieves an overall online success rate of 43.1%, marking a 5.5% improvement over existing competitive baselines below 37.6%.
Navigation tasks reach success rates as high as 87.5%.
These results validate the feasibility of general-purpose intelligence in underwater robotics and provide a foundation for scalable dataset synthesis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If sim-to-real transfer succeeds, the same synthesis pipeline could cut the cost of collecting real underwater trajectories for training.
The CAP auxiliary-task design might transfer to other VLA settings where spatial reasoning is weak, such as aerial or cluttered indoor robots.
An automated evaluation pipeline built for simulation could be reused to benchmark other aquatic or marine robotics methods without manual labeling.
Extending the dataset generation to include more varied water conditions would test whether the reported gains hold under greater environmental diversity.

Load-bearing premise

The simulation-based USIM dataset and automated evaluation pipeline sufficiently capture real underwater dynamics and task success without introducing artifacts that inflate reported performance gains.

What would settle it

Deploying the trained U0 model on a physical BlueROV2 in real ocean conditions and measuring whether the online success rate remains at or above 43.1%.

Figures

Figures reproduced from arXiv: 2510.07869 by Jian Wang, Junwen Gu, Laien Luo, Lianyi Yu, Luoyang Sun, Pengxuan Si, Shuang Qiu, Yukai Feng, Zhengxing Wu, Zhentao Zhang, Zhiheng Wu.

**Figure 1.** Figure 1: Overall Framework. Diverse underwater scenarios and a BlueROV2 robot equipped with a manipulator and gripper are first constructed using the Stonefish simulator. Data collection and control are implemented via ROS, resulting in the USIM dataset of 561K frames (approximately 15.6 hours) covering 20 tasks. Based on USIM, U0 is developed with a dualsystem architecture, incorporating multimodal sensor fusion … view at source ↗

**Figure 3.** Figure 3: Fig.3: The effect of changes in the Fig.3: The effect of changes in th [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Example task trajectories of USIM dataset, including pipeline inspection, obstacle avoidance navigation, shipwreck [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Task distribution of the USIM dataset. signals consist of thruster pulse-width modulation (PWM) signals and manipulator joint angles. To enable large-scale data acquisition, we developed an automated parallel data collection pipeline with task-specific execution logic. At the control level, a PID controller ensures accurate ROV pose tracking, whereas grasping tasks leverage MoveIt for manipulator planning … view at source ↗

**Figure 6.** Figure 6: CAP module architecture. computation process is formulated as: Token = VLM Imgleft, Imgright , (3) F = Conv (Token, MASK), (4) Att = Conv(F), (5) T = MLP (pool(F · Att)), (6) where Imgleft and Imgright denote stereo images from the binocular vision sensor, VLM(·) represents the VisionLanguage model, and Token denotes the extracted features. Conv(·) refers to convolutional operations, while MASK avoids… view at source ↗

**Figure 7.** Figure 7: Comparison of online success rates across multiple tasks. [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

read the original abstract

Underwater environments pose unique challenges for robotic navigation and manipulation. While existing research has primarily focused on task-specific methods, studies on general-purpose intelligence for multi-task execution remain scarce. To address this gap, we propose a unified framework for general-purpose underwater robots that integrates perception and action driven by language instructions. First, we develop a data synthesis pipeline to construct USIM, a simulation-based dataset which comprises over 905K frames from 2275 trajectories, totaling approximately 25 hours of BlueROV2 interactions. Furthermore, we propose U0, a vision-language-action (VLA) model capable of executing various tasks from obstacle-avoidance navigation to three-dimensional mobile manipulation. The model features a convolution-attention-based perception (CAP) module, which incorporates target pose estimation as an auxiliary task to explicitly bolster the model's spatial awareness. For evaluation, we establish a systematic assessment framework and an automated pipeline encompassing both offline metrics and online task execution. Experimental results demonstrate that the USIM dataset significantly empowers existing VLA models to adapt to underwater scenarios. Notably, our U0 model achieves state-of-the-art performance: it reduces the offline mean action prediction error to 0.0359 and achieves an overall online success rate of 43.1%, marking a 5.5% improvement over existing competitive baselines (below 37.6%), with navigation tasks reaching as high as 87.5%. These results validate the feasibility of general-purpose intelligence in underwater robotics, providing a foundation for scalable dataset synthesis and aquatic embodied agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New sim dataset and VLA model for underwater robots, but every number stays inside the simulator with no real-robot tests.

read the letter

The main things to know are that the paper builds a simulation dataset of BlueROV2 trajectories and trains a VLA model on it that reports modest gains over baselines, all inside that same simulator. The dataset and the specific model tweaks are new for this domain. The USIM pipeline generates 905K frames from 2275 trajectories, roughly 25 hours of data, which is a practical way to get volume where real underwater collection is expensive. The U0 model adds a convolution-attention perception module and treats target pose estimation as an auxiliary task to strengthen spatial reasoning. They also define an offline action prediction metric and an online task success pipeline covering navigation and 3D manipulation from language instructions. Those pieces give a concrete starting point for underwater VLA work that prior papers did not supply. The reported numbers are 0.0359 mean action error and 43.1% overall success rate, 5.5 points above the baselines they tested, with navigation at 87.5%. That shows the data and the auxiliary task can move the needle inside the sim. The central weakness is the complete absence of real-robot results or any sim-to-real transfer experiments. The stress-test note is right on this: the hydrodynamics, sensor noise, lighting, and automated success criteria are all simulator-internal, so the claim that this validates general-purpose underwater intelligence rests on untested assumptions about transfer. Details on baseline re-implementations, data splits, and how success thresholds were chosen are also thin in the abstract, which makes it hard to judge whether the 5.5% edge is stable. This paper is useful for researchers who need a first underwater VLA benchmark or dataset to build on, especially if the data is released. It is not yet ready for anyone who needs evidence that policies will work outside the sim. It deserves peer review because the dataset contribution is tangible and the experimental setup is systematic enough to be worth referee scrutiny on the sim-to-real gap and metric robustness.

Referee Report

3 major / 2 minor

Summary. The paper presents USIM, a simulation-based dataset comprising over 905K frames from 2275 BlueROV2 trajectories (approximately 25 hours), and U0, a vision-language-action model featuring a convolution-attention-based perception (CAP) module with target pose estimation as an auxiliary task. It claims that USIM enables existing VLA models to adapt to underwater scenarios and that U0 achieves state-of-the-art results in simulation: offline mean action prediction error of 0.0359 and online success rate of 43.1% (5.5% above baselines below 37.6%), with navigation tasks up to 87.5%, thereby validating the feasibility of general-purpose underwater intelligence.

Significance. If the USIM simulator and automated evaluation pipeline prove faithful to real underwater dynamics, the work would provide a substantial public resource (large-scale multi-task trajectories) and a tailored VLA architecture with an explicit spatial-awareness auxiliary task, filling a gap in general-purpose rather than task-specific underwater robotics. The scale of the dataset and the systematic offline-plus-online evaluation framework are concrete strengths that could support follow-on research in aquatic embodied agents.

major comments (3)

[Abstract] Abstract: The headline performance numbers (0.0359 offline error, 43.1% online success, +5.5% over baselines) are stated without any description of baseline implementations, data splits, error bars, or controls for simulation-specific artifacts; this information is required to substantiate the central empirical claim.
[Abstract] Abstract and evaluation framework: All quantitative results, including the automated online success metric, are obtained exclusively inside the USIM simulator; no real-robot experiments, sim-to-real fine-tuning, or ablation of success-threshold choices are reported. This directly undermines the claim that the results “validate the feasibility of general-purpose intelligence in underwater robotics.”
[Evaluation] Evaluation section: The automated pipeline for online task execution is described at a high level but lacks concrete specification of how task-success criteria are defined and whether they align with human judgment or physical deployment, leaving open the possibility that reported gains are inflated by simulator-internal loopholes.

minor comments (2)

[Dataset construction] The manuscript would benefit from an explicit statement of the total number of tasks, their distribution across the 2275 trajectories, and the precise definition of the “overall online success rate.”
[Model architecture] Notation for the CAP module and auxiliary loss should be introduced with a clear equation or diagram reference to improve reproducibility.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive comments, which highlight important areas for improving the clarity and substantiation of our empirical claims. We address each major comment point by point below.

read point-by-point responses

Referee: [Abstract] Abstract: The headline performance numbers (0.0359 offline error, 43.1% online success, +5.5% over baselines) are stated without any description of baseline implementations, data splits, error bars, or controls for simulation-specific artifacts; this information is required to substantiate the central empirical claim.

Authors: We agree that the abstract requires additional context to support the reported numbers. In the revised manuscript, we will expand the relevant sections to briefly describe the baseline implementations (including model architectures and training procedures), specify the train/validation/test data splits, include error bars from multiple runs, and discuss controls for simulation artifacts such as randomization of initial conditions and environmental parameters. These additions will be placed in the abstract where space permits or in a dedicated experimental setup paragraph. revision: yes
Referee: [Abstract] Abstract and evaluation framework: All quantitative results, including the automated online success metric, are obtained exclusively inside the USIM simulator; no real-robot experiments, sim-to-real fine-tuning, or ablation of success-threshold choices are reported. This directly undermines the claim that the results “validate the feasibility of general-purpose intelligence in underwater robotics.”

Authors: We acknowledge that all reported results are obtained within the USIM simulator and that real-robot experiments or sim-to-real fine-tuning are not included. This is a genuine scope limitation of the current work. We will revise the abstract and conclusion to state that the results validate feasibility of general-purpose underwater intelligence in simulation, providing a foundation for future real-world deployment. We will also add a dedicated limitations paragraph discussing the sim-to-real gap and include an ablation on success-threshold sensitivity where possible. revision: partial
Referee: [Evaluation] Evaluation section: The automated pipeline for online task execution is described at a high level but lacks concrete specification of how task-success criteria are defined and whether they align with human judgment or physical deployment, leaving open the possibility that reported gains are inflated by simulator-internal loopholes.

Authors: We agree that the evaluation section would benefit from greater specificity. In the revised manuscript, we will expand the description of the automated pipeline to include explicit definitions of success criteria (e.g., position threshold of X meters, orientation threshold of Y degrees, and completion time limits), report results from manual human validation on a sampled subset of trajectories to assess alignment with human judgment, and discuss the criteria's relevance to physical underwater deployment. revision: yes

standing simulated objections not resolved

Providing real-robot experiments, sim-to-real fine-tuning results, or physical deployment validation, as the current study is confined to simulation-based data and evaluation.

Circularity Check

0 steps flagged

No significant circularity; empirical dataset-plus-model contribution with measured results.

full rationale

The paper constructs a simulation dataset (USIM) and trains/evaluates a VLA model (U0) on it, reporting directly measured offline error (0.0359) and online success rates (43.1%). No equations, derivations, or load-bearing self-citations exist that reduce any claimed result to its own inputs by construction. All quantitative claims are empirical measurements inside the simulator rather than predictions forced by fitted parameters or renamed ansatzes. This is a standard non-circular empirical ML robotics paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that simulation trajectories adequately proxy real underwater physics and that the auxiliary pose task improves spatial reasoning; no free parameters or invented entities are explicitly introduced beyond standard neural network training.

axioms (1)

domain assumption Simulation data from BlueROV2 trajectories can train models that generalize to underwater task execution
Invoked when claiming the dataset empowers VLA models and enables general-purpose intelligence.

pith-pipeline@v0.9.0 · 5847 in / 1338 out tokens · 57322 ms · 2026-05-25T08:17:43.728352+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 5 internal anchors

[1]

A bioinspired multimotion modality underwater micro- robot,

T. Liuet al., “A bioinspired multimotion modality underwater micro- robot,”Sci. Adv., vol. 11, no. 19, May 2025, Art. no. eadu2527

work page 2025
[2]

Explo- ration of underwater life with an acoustically controlled soft robotic fish,

R. K. Katzschmann, J. DelPreto, R. MacCurdy, and D. Rus, “Explo- ration of underwater life with an acoustically controlled soft robotic fish,”Sci. Robot., vol. 3, no. 16, Mar. 2018, Art. no. eaar3449

work page 2018
[3]

Hitter: A humanoid table tennis robot via hierarchical planning and learning,

Z. Suet al., “Hitter: A humanoid table tennis robot via hierarchical planning and learning,”arXiv preprint arXiv:2508.21043, Sep. 2025

work page arXiv 2025
[4]

Quart–online: Latency–free multimodal large language model for quadruped robot learning,

X. Tonget al., “Quart–online: Latency–free multimodal large language model for quadruped robot learning,” inProc. 2025 IEEE Int. Conf. Robot. Autom. (ICRA), Atlanta, GA, USA, May 2025, pp. 9533–9539

work page 2025
[5]

Self–improving autonomous underwater manipulation,

R. Liu, H. Ha, M. Hou, S. Song, and C. V ondrick, “Self–improving autonomous underwater manipulation,” inProc. 2025 IEEE Int. Conf. Robot. Autom. (ICRA), Atlanta, GA, USA, May 2025, pp. 16 915– 16 922

work page 2025
[6]

A shared autonomy system for precise and efficient remote underwater manipulation,

A. Phung, G. Billings, A. F. Daniele, M. R. Walter, and R. Camilli, “A shared autonomy system for precise and efficient remote underwater manipulation,”IEEE Trans. Robot., vol. 40, pp. 4147–4159, 2024

work page 2024
[7]

Angler: An autonomy framework for intervention tasks with lightweight underwater vehicle manipulator systems,

E. Palmer, C. Holm, and G. Hollinger, “Angler: An autonomy framework for intervention tasks with lightweight underwater vehicle manipulator systems,” inProc. 2024 IEEE Int. Conf. Robot. Autom. (ICRA), Yokohama, Japan, May 2024, pp. 6126–6132

work page 2024
[8]

Stonefish: An advanced open–source simulation tool designed for marine robotics, with a ros interface,

P. Cieslak, “Stonefish: An advanced open–source simulation tool designed for marine robotics, with a ros interface,” inProc. OCEANS 2019 – Marseille, Marseille, France, Jun. 2019, pp. 1–6

work page 2019
[9]

Stonefish: Supporting machine learning research in marine robotics,

M. Grimaldiet al., “Stonefish: Supporting machine learning research in marine robotics,” inProc. 2025 IEEE Int. Conf. Robot. Autom. (ICRA), Atlanta, GA, USA, May 2025, pp. 13 605–13 611

work page 2025
[10]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

NVIDIAet al., “Gr00t N1: An open foundation model for generalist humanoid robots,”arXiv preprint arXiv:2503.14734, Mar. 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Deformation control and thrust analysis of a flexible fishtail with muscle-like actuation,

J. Gu, J. Wang, Z. Liu, M. Tan, J. Yu, and Z. Wu, “Deformation control and thrust analysis of a flexible fishtail with muscle-like actuation,” IEEE Trans. Robot., vol. 41, pp. 159–179, 2025

work page 2025
[12]

Intelligent path planning of underwater robot based on reinforcement learning,

J. Yang, J. Ni, M. Xi, J. Wen, and Y . Li, “Intelligent path planning of underwater robot based on reinforcement learning,”IEEE Trans. Automat. Sci. Eng., vol. 20, no. 3, pp. 1983–1996, Jul. 2023

work page 1983
[13]

Multi– agent generative adversarial interactive self–imitation learning for auv formation control and obstacle avoidance,

Z. Fang, T. Chen, T. Shen, D. Jiang, Z. Zhang, and G. Li, “Multi– agent generative adversarial interactive self–imitation learning for auv formation control and obstacle avoidance,”IEEE Robot. Autom. Lett., vol. 10, no. 5, pp. 4356–4363, May 2025

work page 2025
[14]

Target tracking control of a biomimetic underwater vehicle through deep reinforcement learning,

Y . Wanget al., “Target tracking control of a biomimetic underwater vehicle through deep reinforcement learning,”IEEE Trans. Neural Netw. Learning Syst., vol. 33, no. 8, pp. 3741–3752, Aug. 2022

work page 2022
[15]

Dynamic robotic tracking of underwater targets using reinforcement learning,

I. Masmitjaet al., “Dynamic robotic tracking of underwater targets using reinforcement learning,”Sci. Robot., vol. 8, no. 80, Jul. 2023, Art. no. eade7811

work page 2023
[16]

Uivnav: Underwater information–driven vision–based navigation via imitation learning,

X. Linet al., “Uivnav: Underwater information–driven vision–based navigation via imitation learning,” inProc. 2024 IEEE Int. Conf. Robot. Autom. (ICRA), Yokohama, Japan, May 2024, pp. 5250–5256

work page 2024
[17]

An improved SAC–based deep reinforcement learning framework for collaborative pushing and grasping in underwater environments,

J. Gao, Y . Li, Y . Chen, Y . He, and J. Guo, “An improved SAC–based deep reinforcement learning framework for collaborative pushing and grasping in underwater environments,”IEEE Trans. Instrum. Meas., vol. 73, pp. 1–14, 2024

work page 2024
[18]

Distributed AI agents for cognitive underwater robot autonomy,

M. Buchholz, I. Carlucho, M. Grimaldi, and Y . R. Petillot, “Distributed AI agents for cognitive underwater robot autonomy,”arXiv preprint arXiv:2507.23735, Aug. 2025

work page arXiv 2025
[19]

Underwater vision-based gesture recognition: A robustness validation for safe human–robot interaction,

A. Gomez Chavez, A. Ranieri, D. Chiarella, and A. Birk, “Underwater vision-based gesture recognition: A robustness validation for safe human–robot interaction,”IEEE Robot. Automat. Mag., vol. 28, no. 3, pp. 67–78, Sep. 2021

work page 2021
[20]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

A. Khazatskyet al., “Droid: A large–scale in–the–wild robot manip- ulation dataset,”arXiv preprint arXiv:2403.12945, Apr. 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Open X-embodiment: Robotic learning datasets and RT-X models : Open X-embodiment collaboration0,

A. O’Neillet al., “Open X-embodiment: Robotic learning datasets and RT-X models : Open X-embodiment collaboration0,” inProc. 2024 IEEE Int. Conf. Robot. Autom. (ICRA), Yokohama, Japan, 2024, pp. 6892–6903

work page 2024
[22]

AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

AgiBot–World–Contributorset al., “Agibot world colosseo: A large– scale manipulation platform for scalable and intelligent embodied systems,”arXiv preprint arXiv:2503.06669, Aug. 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Rt-2: Vision-language-action models transfer web knowledge to robotic control,

B. Zitkovichet al., “Rt-2: Vision-language-action models transfer web knowledge to robotic control,” inProc. 7th Conf. Robot Learning (CoRL), vol. 229, Nov 2023, pp. 2165–2183

work page 2023
[24]

OpenVLA: An Open-Source Vision-Language-Action Model

M. J. Kimet al., “Openvla: An open–source vision–language–action model,”arXiv preprint arXiv:2406.09246, Sep. 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

K. Blacket al., “π0: A vision–language–action flow model for general robot control,”arXiv preprint arXiv:2410.24164, Nov. 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

The varos synthetic underwater data set: Towards realistic multi-sensor underwater data with ground truth,

P. Georg Olofsson Zwilgmeyer, M. Yip, A. Langeland Teigen, R. Mester, and A. Stahl, “The varos synthetic underwater data set: Towards realistic multi-sensor underwater data with ground truth,” in Proc. 2021 IEEE/CVF Int. Conf. Comput. Vis. Workshops (ICCVW), Montreal, BC, Canada, Oct. 2021, pp. 3715–3723

work page 2021
[27]

An underwater image enhancement benchmark dataset and beyond,

C. Liet al., “An underwater image enhancement benchmark dataset and beyond,”IEEE Trans. on Image Process., vol. 29, pp. 4376–4389, 2020

work page 2020
[28]

Aqualoc: An underwater dataset for visual–inertial–pressure localization,

M. Ferrera, V . Creuze, J. Moras, and P. Trouv ´e-Peloux, “Aqualoc: An underwater dataset for visual–inertial–pressure localization,”Int. J. Robot. Res., vol. 38, no. 14, pp. 1549–1559, Dec. 2019

work page 2019
[29]

Underwater caves sonar data set,

A. Mallios, E. Vidal, R. Campos, and M. Carreras, “Underwater caves sonar data set,”Int. J. Robot. Res., vol. 36, no. 12, pp. 1247–1251, Oct. 2017

work page 2017
[30]

Self–supervised underwater image generation for underwater domain pre–training,

Z. Wu, Z. Wu, X. Chen, Y . Lu, and J. Yu, “Self–supervised underwater image generation for underwater domain pre–training,”IEEE Trans. Instrum. Meas., vol. 73, pp. 1–14, 2024

work page 2024
[31]

Urobench: Comparative analyses of underwater robotics sim- ulators from reinforcement learning perspective,

Z. Huang, M. Buchholz, M. Grimaldi, H. Yu, I. Carlucho, and Y . R. Petillot, “Urobench: Comparative analyses of underwater robotics sim- ulators from reinforcement learning perspective,” inProc. OCEANS 2024 – Singapore, Singapore, Singapore, Apr. 2024, pp. 1–8

work page 2024
[32]

Fishgym: A high–performance physics–based simulation framework for under- water robot learning,

W. Liu, K. Bai, X. He, S. Song, C. Zheng, and X. Liu, “Fishgym: A high–performance physics–based simulation framework for under- water robot learning,” inProc. 2022 IEEE Int. Conf. Robot. Autom. (ICRA), Philadelphia, PA, USA, May 2022, pp. 6268–6275

work page 2022

[1] [1]

A bioinspired multimotion modality underwater micro- robot,

T. Liuet al., “A bioinspired multimotion modality underwater micro- robot,”Sci. Adv., vol. 11, no. 19, May 2025, Art. no. eadu2527

work page 2025

[2] [2]

Explo- ration of underwater life with an acoustically controlled soft robotic fish,

R. K. Katzschmann, J. DelPreto, R. MacCurdy, and D. Rus, “Explo- ration of underwater life with an acoustically controlled soft robotic fish,”Sci. Robot., vol. 3, no. 16, Mar. 2018, Art. no. eaar3449

work page 2018

[3] [3]

Hitter: A humanoid table tennis robot via hierarchical planning and learning,

Z. Suet al., “Hitter: A humanoid table tennis robot via hierarchical planning and learning,”arXiv preprint arXiv:2508.21043, Sep. 2025

work page arXiv 2025

[4] [4]

Quart–online: Latency–free multimodal large language model for quadruped robot learning,

X. Tonget al., “Quart–online: Latency–free multimodal large language model for quadruped robot learning,” inProc. 2025 IEEE Int. Conf. Robot. Autom. (ICRA), Atlanta, GA, USA, May 2025, pp. 9533–9539

work page 2025

[5] [5]

Self–improving autonomous underwater manipulation,

R. Liu, H. Ha, M. Hou, S. Song, and C. V ondrick, “Self–improving autonomous underwater manipulation,” inProc. 2025 IEEE Int. Conf. Robot. Autom. (ICRA), Atlanta, GA, USA, May 2025, pp. 16 915– 16 922

work page 2025

[6] [6]

A shared autonomy system for precise and efficient remote underwater manipulation,

A. Phung, G. Billings, A. F. Daniele, M. R. Walter, and R. Camilli, “A shared autonomy system for precise and efficient remote underwater manipulation,”IEEE Trans. Robot., vol. 40, pp. 4147–4159, 2024

work page 2024

[7] [7]

Angler: An autonomy framework for intervention tasks with lightweight underwater vehicle manipulator systems,

E. Palmer, C. Holm, and G. Hollinger, “Angler: An autonomy framework for intervention tasks with lightweight underwater vehicle manipulator systems,” inProc. 2024 IEEE Int. Conf. Robot. Autom. (ICRA), Yokohama, Japan, May 2024, pp. 6126–6132

work page 2024

[8] [8]

Stonefish: An advanced open–source simulation tool designed for marine robotics, with a ros interface,

P. Cieslak, “Stonefish: An advanced open–source simulation tool designed for marine robotics, with a ros interface,” inProc. OCEANS 2019 – Marseille, Marseille, France, Jun. 2019, pp. 1–6

work page 2019

[9] [9]

Stonefish: Supporting machine learning research in marine robotics,

M. Grimaldiet al., “Stonefish: Supporting machine learning research in marine robotics,” inProc. 2025 IEEE Int. Conf. Robot. Autom. (ICRA), Atlanta, GA, USA, May 2025, pp. 13 605–13 611

work page 2025

[10] [10]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

NVIDIAet al., “Gr00t N1: An open foundation model for generalist humanoid robots,”arXiv preprint arXiv:2503.14734, Mar. 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

Deformation control and thrust analysis of a flexible fishtail with muscle-like actuation,

J. Gu, J. Wang, Z. Liu, M. Tan, J. Yu, and Z. Wu, “Deformation control and thrust analysis of a flexible fishtail with muscle-like actuation,” IEEE Trans. Robot., vol. 41, pp. 159–179, 2025

work page 2025

[12] [12]

Intelligent path planning of underwater robot based on reinforcement learning,

J. Yang, J. Ni, M. Xi, J. Wen, and Y . Li, “Intelligent path planning of underwater robot based on reinforcement learning,”IEEE Trans. Automat. Sci. Eng., vol. 20, no. 3, pp. 1983–1996, Jul. 2023

work page 1983

[13] [13]

Multi– agent generative adversarial interactive self–imitation learning for auv formation control and obstacle avoidance,

Z. Fang, T. Chen, T. Shen, D. Jiang, Z. Zhang, and G. Li, “Multi– agent generative adversarial interactive self–imitation learning for auv formation control and obstacle avoidance,”IEEE Robot. Autom. Lett., vol. 10, no. 5, pp. 4356–4363, May 2025

work page 2025

[14] [14]

Target tracking control of a biomimetic underwater vehicle through deep reinforcement learning,

Y . Wanget al., “Target tracking control of a biomimetic underwater vehicle through deep reinforcement learning,”IEEE Trans. Neural Netw. Learning Syst., vol. 33, no. 8, pp. 3741–3752, Aug. 2022

work page 2022

[15] [15]

Dynamic robotic tracking of underwater targets using reinforcement learning,

I. Masmitjaet al., “Dynamic robotic tracking of underwater targets using reinforcement learning,”Sci. Robot., vol. 8, no. 80, Jul. 2023, Art. no. eade7811

work page 2023

[16] [16]

Uivnav: Underwater information–driven vision–based navigation via imitation learning,

X. Linet al., “Uivnav: Underwater information–driven vision–based navigation via imitation learning,” inProc. 2024 IEEE Int. Conf. Robot. Autom. (ICRA), Yokohama, Japan, May 2024, pp. 5250–5256

work page 2024

[17] [17]

An improved SAC–based deep reinforcement learning framework for collaborative pushing and grasping in underwater environments,

J. Gao, Y . Li, Y . Chen, Y . He, and J. Guo, “An improved SAC–based deep reinforcement learning framework for collaborative pushing and grasping in underwater environments,”IEEE Trans. Instrum. Meas., vol. 73, pp. 1–14, 2024

work page 2024

[18] [18]

Distributed AI agents for cognitive underwater robot autonomy,

M. Buchholz, I. Carlucho, M. Grimaldi, and Y . R. Petillot, “Distributed AI agents for cognitive underwater robot autonomy,”arXiv preprint arXiv:2507.23735, Aug. 2025

work page arXiv 2025

[19] [19]

Underwater vision-based gesture recognition: A robustness validation for safe human–robot interaction,

A. Gomez Chavez, A. Ranieri, D. Chiarella, and A. Birk, “Underwater vision-based gesture recognition: A robustness validation for safe human–robot interaction,”IEEE Robot. Automat. Mag., vol. 28, no. 3, pp. 67–78, Sep. 2021

work page 2021

[20] [20]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

A. Khazatskyet al., “Droid: A large–scale in–the–wild robot manip- ulation dataset,”arXiv preprint arXiv:2403.12945, Apr. 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

Open X-embodiment: Robotic learning datasets and RT-X models : Open X-embodiment collaboration0,

A. O’Neillet al., “Open X-embodiment: Robotic learning datasets and RT-X models : Open X-embodiment collaboration0,” inProc. 2024 IEEE Int. Conf. Robot. Autom. (ICRA), Yokohama, Japan, 2024, pp. 6892–6903

work page 2024

[22] [22]

AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

AgiBot–World–Contributorset al., “Agibot world colosseo: A large– scale manipulation platform for scalable and intelligent embodied systems,”arXiv preprint arXiv:2503.06669, Aug. 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

Rt-2: Vision-language-action models transfer web knowledge to robotic control,

B. Zitkovichet al., “Rt-2: Vision-language-action models transfer web knowledge to robotic control,” inProc. 7th Conf. Robot Learning (CoRL), vol. 229, Nov 2023, pp. 2165–2183

work page 2023

[24] [24]

OpenVLA: An Open-Source Vision-Language-Action Model

M. J. Kimet al., “Openvla: An open–source vision–language–action model,”arXiv preprint arXiv:2406.09246, Sep. 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[25] [25]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

K. Blacket al., “π0: A vision–language–action flow model for general robot control,”arXiv preprint arXiv:2410.24164, Nov. 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [26]

The varos synthetic underwater data set: Towards realistic multi-sensor underwater data with ground truth,

P. Georg Olofsson Zwilgmeyer, M. Yip, A. Langeland Teigen, R. Mester, and A. Stahl, “The varos synthetic underwater data set: Towards realistic multi-sensor underwater data with ground truth,” in Proc. 2021 IEEE/CVF Int. Conf. Comput. Vis. Workshops (ICCVW), Montreal, BC, Canada, Oct. 2021, pp. 3715–3723

work page 2021

[27] [27]

An underwater image enhancement benchmark dataset and beyond,

C. Liet al., “An underwater image enhancement benchmark dataset and beyond,”IEEE Trans. on Image Process., vol. 29, pp. 4376–4389, 2020

work page 2020

[28] [28]

Aqualoc: An underwater dataset for visual–inertial–pressure localization,

M. Ferrera, V . Creuze, J. Moras, and P. Trouv ´e-Peloux, “Aqualoc: An underwater dataset for visual–inertial–pressure localization,”Int. J. Robot. Res., vol. 38, no. 14, pp. 1549–1559, Dec. 2019

work page 2019

[29] [29]

Underwater caves sonar data set,

A. Mallios, E. Vidal, R. Campos, and M. Carreras, “Underwater caves sonar data set,”Int. J. Robot. Res., vol. 36, no. 12, pp. 1247–1251, Oct. 2017

work page 2017

[30] [30]

Self–supervised underwater image generation for underwater domain pre–training,

Z. Wu, Z. Wu, X. Chen, Y . Lu, and J. Yu, “Self–supervised underwater image generation for underwater domain pre–training,”IEEE Trans. Instrum. Meas., vol. 73, pp. 1–14, 2024

work page 2024

[31] [31]

Urobench: Comparative analyses of underwater robotics sim- ulators from reinforcement learning perspective,

Z. Huang, M. Buchholz, M. Grimaldi, H. Yu, I. Carlucho, and Y . R. Petillot, “Urobench: Comparative analyses of underwater robotics sim- ulators from reinforcement learning perspective,” inProc. OCEANS 2024 – Singapore, Singapore, Singapore, Apr. 2024, pp. 1–8

work page 2024

[32] [32]

Fishgym: A high–performance physics–based simulation framework for under- water robot learning,

W. Liu, K. Bai, X. He, S. Song, C. Zheng, and X. Liu, “Fishgym: A high–performance physics–based simulation framework for under- water robot learning,” inProc. 2022 IEEE Int. Conf. Robot. Autom. (ICRA), Philadelphia, PA, USA, May 2022, pp. 6268–6275

work page 2022