USIM and U0: A Vision-Language-Action Dataset and Model for General Underwater Robots
Pith reviewed 2026-05-25 08:17 UTC · model grok-4.3
The pith
A simulation dataset and VLA model enable general underwater robots to follow language instructions for navigation and manipulation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We propose USIM, a simulation-based dataset comprising over 905K frames from 2275 trajectories totaling about 25 hours of BlueROV2 interactions, and U0, a vision-language-action model with a convolution-attention-based perception module that incorporates target pose estimation as an auxiliary task; the model achieves a mean action prediction error of 0.0359 and an overall online success rate of 43.1% across tasks from obstacle-avoidance navigation to 3D mobile manipulation, a 5.5% improvement over baselines.
What carries the argument
U0 vision-language-action model with convolution-attention-based perception (CAP) module that uses target pose estimation as auxiliary task to bolster spatial awareness.
If this is right
- The USIM dataset significantly empowers existing VLA models to adapt to underwater scenarios.
- U0 reduces the offline mean action prediction error to 0.0359.
- U0 achieves an overall online success rate of 43.1%, marking a 5.5% improvement over existing competitive baselines below 37.6%.
- Navigation tasks reach success rates as high as 87.5%.
- These results validate the feasibility of general-purpose intelligence in underwater robotics and provide a foundation for scalable dataset synthesis.
Where Pith is reading between the lines
- If sim-to-real transfer succeeds, the same synthesis pipeline could cut the cost of collecting real underwater trajectories for training.
- The CAP auxiliary-task design might transfer to other VLA settings where spatial reasoning is weak, such as aerial or cluttered indoor robots.
- An automated evaluation pipeline built for simulation could be reused to benchmark other aquatic or marine robotics methods without manual labeling.
- Extending the dataset generation to include more varied water conditions would test whether the reported gains hold under greater environmental diversity.
Load-bearing premise
The simulation-based USIM dataset and automated evaluation pipeline sufficiently capture real underwater dynamics and task success without introducing artifacts that inflate reported performance gains.
What would settle it
Deploying the trained U0 model on a physical BlueROV2 in real ocean conditions and measuring whether the online success rate remains at or above 43.1%.
Figures
read the original abstract
Underwater environments pose unique challenges for robotic navigation and manipulation. While existing research has primarily focused on task-specific methods, studies on general-purpose intelligence for multi-task execution remain scarce. To address this gap, we propose a unified framework for general-purpose underwater robots that integrates perception and action driven by language instructions. First, we develop a data synthesis pipeline to construct USIM, a simulation-based dataset which comprises over 905K frames from 2275 trajectories, totaling approximately 25 hours of BlueROV2 interactions. Furthermore, we propose U0, a vision-language-action (VLA) model capable of executing various tasks from obstacle-avoidance navigation to three-dimensional mobile manipulation. The model features a convolution-attention-based perception (CAP) module, which incorporates target pose estimation as an auxiliary task to explicitly bolster the model's spatial awareness. For evaluation, we establish a systematic assessment framework and an automated pipeline encompassing both offline metrics and online task execution. Experimental results demonstrate that the USIM dataset significantly empowers existing VLA models to adapt to underwater scenarios. Notably, our U0 model achieves state-of-the-art performance: it reduces the offline mean action prediction error to 0.0359 and achieves an overall online success rate of 43.1%, marking a 5.5% improvement over existing competitive baselines (below 37.6%), with navigation tasks reaching as high as 87.5%. These results validate the feasibility of general-purpose intelligence in underwater robotics, providing a foundation for scalable dataset synthesis and aquatic embodied agents.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents USIM, a simulation-based dataset comprising over 905K frames from 2275 BlueROV2 trajectories (approximately 25 hours), and U0, a vision-language-action model featuring a convolution-attention-based perception (CAP) module with target pose estimation as an auxiliary task. It claims that USIM enables existing VLA models to adapt to underwater scenarios and that U0 achieves state-of-the-art results in simulation: offline mean action prediction error of 0.0359 and online success rate of 43.1% (5.5% above baselines below 37.6%), with navigation tasks up to 87.5%, thereby validating the feasibility of general-purpose underwater intelligence.
Significance. If the USIM simulator and automated evaluation pipeline prove faithful to real underwater dynamics, the work would provide a substantial public resource (large-scale multi-task trajectories) and a tailored VLA architecture with an explicit spatial-awareness auxiliary task, filling a gap in general-purpose rather than task-specific underwater robotics. The scale of the dataset and the systematic offline-plus-online evaluation framework are concrete strengths that could support follow-on research in aquatic embodied agents.
major comments (3)
- [Abstract] Abstract: The headline performance numbers (0.0359 offline error, 43.1% online success, +5.5% over baselines) are stated without any description of baseline implementations, data splits, error bars, or controls for simulation-specific artifacts; this information is required to substantiate the central empirical claim.
- [Abstract] Abstract and evaluation framework: All quantitative results, including the automated online success metric, are obtained exclusively inside the USIM simulator; no real-robot experiments, sim-to-real fine-tuning, or ablation of success-threshold choices are reported. This directly undermines the claim that the results “validate the feasibility of general-purpose intelligence in underwater robotics.”
- [Evaluation] Evaluation section: The automated pipeline for online task execution is described at a high level but lacks concrete specification of how task-success criteria are defined and whether they align with human judgment or physical deployment, leaving open the possibility that reported gains are inflated by simulator-internal loopholes.
minor comments (2)
- [Dataset construction] The manuscript would benefit from an explicit statement of the total number of tasks, their distribution across the 2275 trajectories, and the precise definition of the “overall online success rate.”
- [Model architecture] Notation for the CAP module and auxiliary loss should be introduced with a clear equation or diagram reference to improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which highlight important areas for improving the clarity and substantiation of our empirical claims. We address each major comment point by point below.
read point-by-point responses
-
Referee: [Abstract] Abstract: The headline performance numbers (0.0359 offline error, 43.1% online success, +5.5% over baselines) are stated without any description of baseline implementations, data splits, error bars, or controls for simulation-specific artifacts; this information is required to substantiate the central empirical claim.
Authors: We agree that the abstract requires additional context to support the reported numbers. In the revised manuscript, we will expand the relevant sections to briefly describe the baseline implementations (including model architectures and training procedures), specify the train/validation/test data splits, include error bars from multiple runs, and discuss controls for simulation artifacts such as randomization of initial conditions and environmental parameters. These additions will be placed in the abstract where space permits or in a dedicated experimental setup paragraph. revision: yes
-
Referee: [Abstract] Abstract and evaluation framework: All quantitative results, including the automated online success metric, are obtained exclusively inside the USIM simulator; no real-robot experiments, sim-to-real fine-tuning, or ablation of success-threshold choices are reported. This directly undermines the claim that the results “validate the feasibility of general-purpose intelligence in underwater robotics.”
Authors: We acknowledge that all reported results are obtained within the USIM simulator and that real-robot experiments or sim-to-real fine-tuning are not included. This is a genuine scope limitation of the current work. We will revise the abstract and conclusion to state that the results validate feasibility of general-purpose underwater intelligence in simulation, providing a foundation for future real-world deployment. We will also add a dedicated limitations paragraph discussing the sim-to-real gap and include an ablation on success-threshold sensitivity where possible. revision: partial
-
Referee: [Evaluation] Evaluation section: The automated pipeline for online task execution is described at a high level but lacks concrete specification of how task-success criteria are defined and whether they align with human judgment or physical deployment, leaving open the possibility that reported gains are inflated by simulator-internal loopholes.
Authors: We agree that the evaluation section would benefit from greater specificity. In the revised manuscript, we will expand the description of the automated pipeline to include explicit definitions of success criteria (e.g., position threshold of X meters, orientation threshold of Y degrees, and completion time limits), report results from manual human validation on a sampled subset of trajectories to assess alignment with human judgment, and discuss the criteria's relevance to physical underwater deployment. revision: yes
- Providing real-robot experiments, sim-to-real fine-tuning results, or physical deployment validation, as the current study is confined to simulation-based data and evaluation.
Circularity Check
No significant circularity; empirical dataset-plus-model contribution with measured results.
full rationale
The paper constructs a simulation dataset (USIM) and trains/evaluates a VLA model (U0) on it, reporting directly measured offline error (0.0359) and online success rates (43.1%). No equations, derivations, or load-bearing self-citations exist that reduce any claimed result to its own inputs by construction. All quantitative claims are empirical measurements inside the simulator rather than predictions forced by fitted parameters or renamed ansatzes. This is a standard non-circular empirical ML robotics paper.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Simulation data from BlueROV2 trajectories can train models that generalize to underwater task execution
Reference graph
Works this paper leans on
-
[1]
A bioinspired multimotion modality underwater micro- robot,
T. Liuet al., “A bioinspired multimotion modality underwater micro- robot,”Sci. Adv., vol. 11, no. 19, May 2025, Art. no. eadu2527
work page 2025
-
[2]
Explo- ration of underwater life with an acoustically controlled soft robotic fish,
R. K. Katzschmann, J. DelPreto, R. MacCurdy, and D. Rus, “Explo- ration of underwater life with an acoustically controlled soft robotic fish,”Sci. Robot., vol. 3, no. 16, Mar. 2018, Art. no. eaar3449
work page 2018
-
[3]
Hitter: A humanoid table tennis robot via hierarchical planning and learning,
Z. Suet al., “Hitter: A humanoid table tennis robot via hierarchical planning and learning,”arXiv preprint arXiv:2508.21043, Sep. 2025
-
[4]
Quart–online: Latency–free multimodal large language model for quadruped robot learning,
X. Tonget al., “Quart–online: Latency–free multimodal large language model for quadruped robot learning,” inProc. 2025 IEEE Int. Conf. Robot. Autom. (ICRA), Atlanta, GA, USA, May 2025, pp. 9533–9539
work page 2025
-
[5]
Self–improving autonomous underwater manipulation,
R. Liu, H. Ha, M. Hou, S. Song, and C. V ondrick, “Self–improving autonomous underwater manipulation,” inProc. 2025 IEEE Int. Conf. Robot. Autom. (ICRA), Atlanta, GA, USA, May 2025, pp. 16 915– 16 922
work page 2025
-
[6]
A shared autonomy system for precise and efficient remote underwater manipulation,
A. Phung, G. Billings, A. F. Daniele, M. R. Walter, and R. Camilli, “A shared autonomy system for precise and efficient remote underwater manipulation,”IEEE Trans. Robot., vol. 40, pp. 4147–4159, 2024
work page 2024
-
[7]
E. Palmer, C. Holm, and G. Hollinger, “Angler: An autonomy framework for intervention tasks with lightweight underwater vehicle manipulator systems,” inProc. 2024 IEEE Int. Conf. Robot. Autom. (ICRA), Yokohama, Japan, May 2024, pp. 6126–6132
work page 2024
-
[8]
P. Cieslak, “Stonefish: An advanced open–source simulation tool designed for marine robotics, with a ros interface,” inProc. OCEANS 2019 – Marseille, Marseille, France, Jun. 2019, pp. 1–6
work page 2019
-
[9]
Stonefish: Supporting machine learning research in marine robotics,
M. Grimaldiet al., “Stonefish: Supporting machine learning research in marine robotics,” inProc. 2025 IEEE Int. Conf. Robot. Autom. (ICRA), Atlanta, GA, USA, May 2025, pp. 13 605–13 611
work page 2025
-
[10]
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
NVIDIAet al., “Gr00t N1: An open foundation model for generalist humanoid robots,”arXiv preprint arXiv:2503.14734, Mar. 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Deformation control and thrust analysis of a flexible fishtail with muscle-like actuation,
J. Gu, J. Wang, Z. Liu, M. Tan, J. Yu, and Z. Wu, “Deformation control and thrust analysis of a flexible fishtail with muscle-like actuation,” IEEE Trans. Robot., vol. 41, pp. 159–179, 2025
work page 2025
-
[12]
Intelligent path planning of underwater robot based on reinforcement learning,
J. Yang, J. Ni, M. Xi, J. Wen, and Y . Li, “Intelligent path planning of underwater robot based on reinforcement learning,”IEEE Trans. Automat. Sci. Eng., vol. 20, no. 3, pp. 1983–1996, Jul. 2023
work page 1983
-
[13]
Z. Fang, T. Chen, T. Shen, D. Jiang, Z. Zhang, and G. Li, “Multi– agent generative adversarial interactive self–imitation learning for auv formation control and obstacle avoidance,”IEEE Robot. Autom. Lett., vol. 10, no. 5, pp. 4356–4363, May 2025
work page 2025
-
[14]
Target tracking control of a biomimetic underwater vehicle through deep reinforcement learning,
Y . Wanget al., “Target tracking control of a biomimetic underwater vehicle through deep reinforcement learning,”IEEE Trans. Neural Netw. Learning Syst., vol. 33, no. 8, pp. 3741–3752, Aug. 2022
work page 2022
-
[15]
Dynamic robotic tracking of underwater targets using reinforcement learning,
I. Masmitjaet al., “Dynamic robotic tracking of underwater targets using reinforcement learning,”Sci. Robot., vol. 8, no. 80, Jul. 2023, Art. no. eade7811
work page 2023
-
[16]
Uivnav: Underwater information–driven vision–based navigation via imitation learning,
X. Linet al., “Uivnav: Underwater information–driven vision–based navigation via imitation learning,” inProc. 2024 IEEE Int. Conf. Robot. Autom. (ICRA), Yokohama, Japan, May 2024, pp. 5250–5256
work page 2024
-
[17]
J. Gao, Y . Li, Y . Chen, Y . He, and J. Guo, “An improved SAC–based deep reinforcement learning framework for collaborative pushing and grasping in underwater environments,”IEEE Trans. Instrum. Meas., vol. 73, pp. 1–14, 2024
work page 2024
-
[18]
Distributed AI agents for cognitive underwater robot autonomy,
M. Buchholz, I. Carlucho, M. Grimaldi, and Y . R. Petillot, “Distributed AI agents for cognitive underwater robot autonomy,”arXiv preprint arXiv:2507.23735, Aug. 2025
-
[19]
A. Gomez Chavez, A. Ranieri, D. Chiarella, and A. Birk, “Underwater vision-based gesture recognition: A robustness validation for safe human–robot interaction,”IEEE Robot. Automat. Mag., vol. 28, no. 3, pp. 67–78, Sep. 2021
work page 2021
-
[20]
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
A. Khazatskyet al., “Droid: A large–scale in–the–wild robot manip- ulation dataset,”arXiv preprint arXiv:2403.12945, Apr. 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[21]
Open X-embodiment: Robotic learning datasets and RT-X models : Open X-embodiment collaboration0,
A. O’Neillet al., “Open X-embodiment: Robotic learning datasets and RT-X models : Open X-embodiment collaboration0,” inProc. 2024 IEEE Int. Conf. Robot. Autom. (ICRA), Yokohama, Japan, 2024, pp. 6892–6903
work page 2024
-
[22]
AgiBot–World–Contributorset al., “Agibot world colosseo: A large– scale manipulation platform for scalable and intelligent embodied systems,”arXiv preprint arXiv:2503.06669, Aug. 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
Rt-2: Vision-language-action models transfer web knowledge to robotic control,
B. Zitkovichet al., “Rt-2: Vision-language-action models transfer web knowledge to robotic control,” inProc. 7th Conf. Robot Learning (CoRL), vol. 229, Nov 2023, pp. 2165–2183
work page 2023
-
[24]
OpenVLA: An Open-Source Vision-Language-Action Model
M. J. Kimet al., “Openvla: An open–source vision–language–action model,”arXiv preprint arXiv:2406.09246, Sep. 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
K. Blacket al., “π0: A vision–language–action flow model for general robot control,”arXiv preprint arXiv:2410.24164, Nov. 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[26]
P. Georg Olofsson Zwilgmeyer, M. Yip, A. Langeland Teigen, R. Mester, and A. Stahl, “The varos synthetic underwater data set: Towards realistic multi-sensor underwater data with ground truth,” in Proc. 2021 IEEE/CVF Int. Conf. Comput. Vis. Workshops (ICCVW), Montreal, BC, Canada, Oct. 2021, pp. 3715–3723
work page 2021
-
[27]
An underwater image enhancement benchmark dataset and beyond,
C. Liet al., “An underwater image enhancement benchmark dataset and beyond,”IEEE Trans. on Image Process., vol. 29, pp. 4376–4389, 2020
work page 2020
-
[28]
Aqualoc: An underwater dataset for visual–inertial–pressure localization,
M. Ferrera, V . Creuze, J. Moras, and P. Trouv ´e-Peloux, “Aqualoc: An underwater dataset for visual–inertial–pressure localization,”Int. J. Robot. Res., vol. 38, no. 14, pp. 1549–1559, Dec. 2019
work page 2019
-
[29]
Underwater caves sonar data set,
A. Mallios, E. Vidal, R. Campos, and M. Carreras, “Underwater caves sonar data set,”Int. J. Robot. Res., vol. 36, no. 12, pp. 1247–1251, Oct. 2017
work page 2017
-
[30]
Self–supervised underwater image generation for underwater domain pre–training,
Z. Wu, Z. Wu, X. Chen, Y . Lu, and J. Yu, “Self–supervised underwater image generation for underwater domain pre–training,”IEEE Trans. Instrum. Meas., vol. 73, pp. 1–14, 2024
work page 2024
-
[31]
Z. Huang, M. Buchholz, M. Grimaldi, H. Yu, I. Carlucho, and Y . R. Petillot, “Urobench: Comparative analyses of underwater robotics sim- ulators from reinforcement learning perspective,” inProc. OCEANS 2024 – Singapore, Singapore, Singapore, Apr. 2024, pp. 1–8
work page 2024
-
[32]
Fishgym: A high–performance physics–based simulation framework for under- water robot learning,
W. Liu, K. Bai, X. He, S. Song, C. Zheng, and X. Liu, “Fishgym: A high–performance physics–based simulation framework for under- water robot learning,” inProc. 2022 IEEE Int. Conf. Robot. Autom. (ICRA), Philadelphia, PA, USA, May 2022, pp. 6268–6275
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.