pith. sign in

arxiv: 2510.07869 · v4 · pith:LLTACTSWnew · submitted 2025-10-09 · 💻 cs.RO

USIM and U0: A Vision-Language-Action Dataset and Model for General Underwater Robots

Pith reviewed 2026-05-25 08:17 UTC · model grok-4.3

classification 💻 cs.RO
keywords underwater roboticsvision-language-actionsimulation datasetmulti-task learningrobotic navigationmobile manipulation
0
0 comments X

The pith

A simulation dataset and VLA model enable general underwater robots to follow language instructions for navigation and manipulation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to move underwater robotics from task-specific methods toward general-purpose agents that execute varied tasks from language instructions. It constructs the USIM simulation dataset with over 905K frames across 2275 trajectories of BlueROV2 interactions and introduces the U0 model, which adds a convolution-attention-based perception module that treats target pose estimation as an auxiliary task. Evaluation uses both offline action prediction and online task success metrics, showing clear gains over prior baselines. A reader would care because underwater settings are data-scarce and hostile, so scalable simulated training could support multi-task embodied agents where real-world collection is expensive or dangerous.

Core claim

We propose USIM, a simulation-based dataset comprising over 905K frames from 2275 trajectories totaling about 25 hours of BlueROV2 interactions, and U0, a vision-language-action model with a convolution-attention-based perception module that incorporates target pose estimation as an auxiliary task; the model achieves a mean action prediction error of 0.0359 and an overall online success rate of 43.1% across tasks from obstacle-avoidance navigation to 3D mobile manipulation, a 5.5% improvement over baselines.

What carries the argument

U0 vision-language-action model with convolution-attention-based perception (CAP) module that uses target pose estimation as auxiliary task to bolster spatial awareness.

If this is right

  • The USIM dataset significantly empowers existing VLA models to adapt to underwater scenarios.
  • U0 reduces the offline mean action prediction error to 0.0359.
  • U0 achieves an overall online success rate of 43.1%, marking a 5.5% improvement over existing competitive baselines below 37.6%.
  • Navigation tasks reach success rates as high as 87.5%.
  • These results validate the feasibility of general-purpose intelligence in underwater robotics and provide a foundation for scalable dataset synthesis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If sim-to-real transfer succeeds, the same synthesis pipeline could cut the cost of collecting real underwater trajectories for training.
  • The CAP auxiliary-task design might transfer to other VLA settings where spatial reasoning is weak, such as aerial or cluttered indoor robots.
  • An automated evaluation pipeline built for simulation could be reused to benchmark other aquatic or marine robotics methods without manual labeling.
  • Extending the dataset generation to include more varied water conditions would test whether the reported gains hold under greater environmental diversity.

Load-bearing premise

The simulation-based USIM dataset and automated evaluation pipeline sufficiently capture real underwater dynamics and task success without introducing artifacts that inflate reported performance gains.

What would settle it

Deploying the trained U0 model on a physical BlueROV2 in real ocean conditions and measuring whether the online success rate remains at or above 43.1%.

Figures

Figures reproduced from arXiv: 2510.07869 by Jian Wang, Junwen Gu, Laien Luo, Lianyi Yu, Luoyang Sun, Pengxuan Si, Shuang Qiu, Yukai Feng, Zhengxing Wu, Zhentao Zhang, Zhiheng Wu.

Figure 1
Figure 1. Figure 1: Overall Framework. Diverse underwater scenarios and a BlueROV2 robot equipped with a manipulator and gripper are first constructed using the Stonefish simulator. Data collection and control are implemented via ROS, resulting in the USIM dataset of 561K frames (approximately 15.6 hours) covering 20 tasks. Based on USIM, U0 is developed with a dual￾system architecture, incorporating multimodal sensor fusion … view at source ↗
Figure 3
Figure 3. Figure 3: Fig.3: The effect of changes in the Fig.3: The effect of changes in th [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Example task trajectories of USIM dataset, including pipeline inspection, obstacle avoidance navigation, shipwreck [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Task distribution of the USIM dataset. signals consist of thruster pulse-width modulation (PWM) signals and manipulator joint angles. To enable large-scale data acquisition, we developed an automated parallel data collection pipeline with task-specific execution logic. At the control level, a PID controller ensures accurate ROV pose tracking, whereas grasping tasks leverage MoveIt for manipulator planning … view at source ↗
Figure 6
Figure 6. Figure 6: CAP module architecture. computation process is formulated as: Token = VLM  Imgleft, Imgright , (3) F = Conv (Token, MASK), (4) Att = Conv(F), (5) T = MLP (pool(F · Att)), (6) where Imgleft and Imgright denote stereo images from the binocular vision sensor, VLM(·) represents the Vision￾Language model, and Token denotes the extracted fea￾tures. Conv(·) refers to convolutional operations, while MASK avoids… view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of online success rates across multiple tasks. [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
read the original abstract

Underwater environments pose unique challenges for robotic navigation and manipulation. While existing research has primarily focused on task-specific methods, studies on general-purpose intelligence for multi-task execution remain scarce. To address this gap, we propose a unified framework for general-purpose underwater robots that integrates perception and action driven by language instructions. First, we develop a data synthesis pipeline to construct USIM, a simulation-based dataset which comprises over 905K frames from 2275 trajectories, totaling approximately 25 hours of BlueROV2 interactions. Furthermore, we propose U0, a vision-language-action (VLA) model capable of executing various tasks from obstacle-avoidance navigation to three-dimensional mobile manipulation. The model features a convolution-attention-based perception (CAP) module, which incorporates target pose estimation as an auxiliary task to explicitly bolster the model's spatial awareness. For evaluation, we establish a systematic assessment framework and an automated pipeline encompassing both offline metrics and online task execution. Experimental results demonstrate that the USIM dataset significantly empowers existing VLA models to adapt to underwater scenarios. Notably, our U0 model achieves state-of-the-art performance: it reduces the offline mean action prediction error to 0.0359 and achieves an overall online success rate of 43.1%, marking a 5.5% improvement over existing competitive baselines (below 37.6%), with navigation tasks reaching as high as 87.5%. These results validate the feasibility of general-purpose intelligence in underwater robotics, providing a foundation for scalable dataset synthesis and aquatic embodied agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper presents USIM, a simulation-based dataset comprising over 905K frames from 2275 BlueROV2 trajectories (approximately 25 hours), and U0, a vision-language-action model featuring a convolution-attention-based perception (CAP) module with target pose estimation as an auxiliary task. It claims that USIM enables existing VLA models to adapt to underwater scenarios and that U0 achieves state-of-the-art results in simulation: offline mean action prediction error of 0.0359 and online success rate of 43.1% (5.5% above baselines below 37.6%), with navigation tasks up to 87.5%, thereby validating the feasibility of general-purpose underwater intelligence.

Significance. If the USIM simulator and automated evaluation pipeline prove faithful to real underwater dynamics, the work would provide a substantial public resource (large-scale multi-task trajectories) and a tailored VLA architecture with an explicit spatial-awareness auxiliary task, filling a gap in general-purpose rather than task-specific underwater robotics. The scale of the dataset and the systematic offline-plus-online evaluation framework are concrete strengths that could support follow-on research in aquatic embodied agents.

major comments (3)
  1. [Abstract] Abstract: The headline performance numbers (0.0359 offline error, 43.1% online success, +5.5% over baselines) are stated without any description of baseline implementations, data splits, error bars, or controls for simulation-specific artifacts; this information is required to substantiate the central empirical claim.
  2. [Abstract] Abstract and evaluation framework: All quantitative results, including the automated online success metric, are obtained exclusively inside the USIM simulator; no real-robot experiments, sim-to-real fine-tuning, or ablation of success-threshold choices are reported. This directly undermines the claim that the results “validate the feasibility of general-purpose intelligence in underwater robotics.”
  3. [Evaluation] Evaluation section: The automated pipeline for online task execution is described at a high level but lacks concrete specification of how task-success criteria are defined and whether they align with human judgment or physical deployment, leaving open the possibility that reported gains are inflated by simulator-internal loopholes.
minor comments (2)
  1. [Dataset construction] The manuscript would benefit from an explicit statement of the total number of tasks, their distribution across the 2275 trajectories, and the precise definition of the “overall online success rate.”
  2. [Model architecture] Notation for the CAP module and auxiliary loss should be introduced with a clear equation or diagram reference to improve reproducibility.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive comments, which highlight important areas for improving the clarity and substantiation of our empirical claims. We address each major comment point by point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline performance numbers (0.0359 offline error, 43.1% online success, +5.5% over baselines) are stated without any description of baseline implementations, data splits, error bars, or controls for simulation-specific artifacts; this information is required to substantiate the central empirical claim.

    Authors: We agree that the abstract requires additional context to support the reported numbers. In the revised manuscript, we will expand the relevant sections to briefly describe the baseline implementations (including model architectures and training procedures), specify the train/validation/test data splits, include error bars from multiple runs, and discuss controls for simulation artifacts such as randomization of initial conditions and environmental parameters. These additions will be placed in the abstract where space permits or in a dedicated experimental setup paragraph. revision: yes

  2. Referee: [Abstract] Abstract and evaluation framework: All quantitative results, including the automated online success metric, are obtained exclusively inside the USIM simulator; no real-robot experiments, sim-to-real fine-tuning, or ablation of success-threshold choices are reported. This directly undermines the claim that the results “validate the feasibility of general-purpose intelligence in underwater robotics.”

    Authors: We acknowledge that all reported results are obtained within the USIM simulator and that real-robot experiments or sim-to-real fine-tuning are not included. This is a genuine scope limitation of the current work. We will revise the abstract and conclusion to state that the results validate feasibility of general-purpose underwater intelligence in simulation, providing a foundation for future real-world deployment. We will also add a dedicated limitations paragraph discussing the sim-to-real gap and include an ablation on success-threshold sensitivity where possible. revision: partial

  3. Referee: [Evaluation] Evaluation section: The automated pipeline for online task execution is described at a high level but lacks concrete specification of how task-success criteria are defined and whether they align with human judgment or physical deployment, leaving open the possibility that reported gains are inflated by simulator-internal loopholes.

    Authors: We agree that the evaluation section would benefit from greater specificity. In the revised manuscript, we will expand the description of the automated pipeline to include explicit definitions of success criteria (e.g., position threshold of X meters, orientation threshold of Y degrees, and completion time limits), report results from manual human validation on a sampled subset of trajectories to assess alignment with human judgment, and discuss the criteria's relevance to physical underwater deployment. revision: yes

standing simulated objections not resolved
  • Providing real-robot experiments, sim-to-real fine-tuning results, or physical deployment validation, as the current study is confined to simulation-based data and evaluation.

Circularity Check

0 steps flagged

No significant circularity; empirical dataset-plus-model contribution with measured results.

full rationale

The paper constructs a simulation dataset (USIM) and trains/evaluates a VLA model (U0) on it, reporting directly measured offline error (0.0359) and online success rates (43.1%). No equations, derivations, or load-bearing self-citations exist that reduce any claimed result to its own inputs by construction. All quantitative claims are empirical measurements inside the simulator rather than predictions forced by fitted parameters or renamed ansatzes. This is a standard non-circular empirical ML robotics paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that simulation trajectories adequately proxy real underwater physics and that the auxiliary pose task improves spatial reasoning; no free parameters or invented entities are explicitly introduced beyond standard neural network training.

axioms (1)
  • domain assumption Simulation data from BlueROV2 trajectories can train models that generalize to underwater task execution
    Invoked when claiming the dataset empowers VLA models and enables general-purpose intelligence.

pith-pipeline@v0.9.0 · 5847 in / 1338 out tokens · 57322 ms · 2026-05-25T08:17:43.728352+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 5 internal anchors

  1. [1]

    A bioinspired multimotion modality underwater micro- robot,

    T. Liuet al., “A bioinspired multimotion modality underwater micro- robot,”Sci. Adv., vol. 11, no. 19, May 2025, Art. no. eadu2527

  2. [2]

    Explo- ration of underwater life with an acoustically controlled soft robotic fish,

    R. K. Katzschmann, J. DelPreto, R. MacCurdy, and D. Rus, “Explo- ration of underwater life with an acoustically controlled soft robotic fish,”Sci. Robot., vol. 3, no. 16, Mar. 2018, Art. no. eaar3449

  3. [3]

    Hitter: A humanoid table tennis robot via hierarchical planning and learning,

    Z. Suet al., “Hitter: A humanoid table tennis robot via hierarchical planning and learning,”arXiv preprint arXiv:2508.21043, Sep. 2025

  4. [4]

    Quart–online: Latency–free multimodal large language model for quadruped robot learning,

    X. Tonget al., “Quart–online: Latency–free multimodal large language model for quadruped robot learning,” inProc. 2025 IEEE Int. Conf. Robot. Autom. (ICRA), Atlanta, GA, USA, May 2025, pp. 9533–9539

  5. [5]

    Self–improving autonomous underwater manipulation,

    R. Liu, H. Ha, M. Hou, S. Song, and C. V ondrick, “Self–improving autonomous underwater manipulation,” inProc. 2025 IEEE Int. Conf. Robot. Autom. (ICRA), Atlanta, GA, USA, May 2025, pp. 16 915– 16 922

  6. [6]

    A shared autonomy system for precise and efficient remote underwater manipulation,

    A. Phung, G. Billings, A. F. Daniele, M. R. Walter, and R. Camilli, “A shared autonomy system for precise and efficient remote underwater manipulation,”IEEE Trans. Robot., vol. 40, pp. 4147–4159, 2024

  7. [7]

    Angler: An autonomy framework for intervention tasks with lightweight underwater vehicle manipulator systems,

    E. Palmer, C. Holm, and G. Hollinger, “Angler: An autonomy framework for intervention tasks with lightweight underwater vehicle manipulator systems,” inProc. 2024 IEEE Int. Conf. Robot. Autom. (ICRA), Yokohama, Japan, May 2024, pp. 6126–6132

  8. [8]

    Stonefish: An advanced open–source simulation tool designed for marine robotics, with a ros interface,

    P. Cieslak, “Stonefish: An advanced open–source simulation tool designed for marine robotics, with a ros interface,” inProc. OCEANS 2019 – Marseille, Marseille, France, Jun. 2019, pp. 1–6

  9. [9]

    Stonefish: Supporting machine learning research in marine robotics,

    M. Grimaldiet al., “Stonefish: Supporting machine learning research in marine robotics,” inProc. 2025 IEEE Int. Conf. Robot. Autom. (ICRA), Atlanta, GA, USA, May 2025, pp. 13 605–13 611

  10. [10]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    NVIDIAet al., “Gr00t N1: An open foundation model for generalist humanoid robots,”arXiv preprint arXiv:2503.14734, Mar. 2025

  11. [11]

    Deformation control and thrust analysis of a flexible fishtail with muscle-like actuation,

    J. Gu, J. Wang, Z. Liu, M. Tan, J. Yu, and Z. Wu, “Deformation control and thrust analysis of a flexible fishtail with muscle-like actuation,” IEEE Trans. Robot., vol. 41, pp. 159–179, 2025

  12. [12]

    Intelligent path planning of underwater robot based on reinforcement learning,

    J. Yang, J. Ni, M. Xi, J. Wen, and Y . Li, “Intelligent path planning of underwater robot based on reinforcement learning,”IEEE Trans. Automat. Sci. Eng., vol. 20, no. 3, pp. 1983–1996, Jul. 2023

  13. [13]

    Multi– agent generative adversarial interactive self–imitation learning for auv formation control and obstacle avoidance,

    Z. Fang, T. Chen, T. Shen, D. Jiang, Z. Zhang, and G. Li, “Multi– agent generative adversarial interactive self–imitation learning for auv formation control and obstacle avoidance,”IEEE Robot. Autom. Lett., vol. 10, no. 5, pp. 4356–4363, May 2025

  14. [14]

    Target tracking control of a biomimetic underwater vehicle through deep reinforcement learning,

    Y . Wanget al., “Target tracking control of a biomimetic underwater vehicle through deep reinforcement learning,”IEEE Trans. Neural Netw. Learning Syst., vol. 33, no. 8, pp. 3741–3752, Aug. 2022

  15. [15]

    Dynamic robotic tracking of underwater targets using reinforcement learning,

    I. Masmitjaet al., “Dynamic robotic tracking of underwater targets using reinforcement learning,”Sci. Robot., vol. 8, no. 80, Jul. 2023, Art. no. eade7811

  16. [16]

    Uivnav: Underwater information–driven vision–based navigation via imitation learning,

    X. Linet al., “Uivnav: Underwater information–driven vision–based navigation via imitation learning,” inProc. 2024 IEEE Int. Conf. Robot. Autom. (ICRA), Yokohama, Japan, May 2024, pp. 5250–5256

  17. [17]

    An improved SAC–based deep reinforcement learning framework for collaborative pushing and grasping in underwater environments,

    J. Gao, Y . Li, Y . Chen, Y . He, and J. Guo, “An improved SAC–based deep reinforcement learning framework for collaborative pushing and grasping in underwater environments,”IEEE Trans. Instrum. Meas., vol. 73, pp. 1–14, 2024

  18. [18]

    Distributed AI agents for cognitive underwater robot autonomy,

    M. Buchholz, I. Carlucho, M. Grimaldi, and Y . R. Petillot, “Distributed AI agents for cognitive underwater robot autonomy,”arXiv preprint arXiv:2507.23735, Aug. 2025

  19. [19]

    Underwater vision-based gesture recognition: A robustness validation for safe human–robot interaction,

    A. Gomez Chavez, A. Ranieri, D. Chiarella, and A. Birk, “Underwater vision-based gesture recognition: A robustness validation for safe human–robot interaction,”IEEE Robot. Automat. Mag., vol. 28, no. 3, pp. 67–78, Sep. 2021

  20. [20]

    DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

    A. Khazatskyet al., “Droid: A large–scale in–the–wild robot manip- ulation dataset,”arXiv preprint arXiv:2403.12945, Apr. 2025

  21. [21]

    Open X-embodiment: Robotic learning datasets and RT-X models : Open X-embodiment collaboration0,

    A. O’Neillet al., “Open X-embodiment: Robotic learning datasets and RT-X models : Open X-embodiment collaboration0,” inProc. 2024 IEEE Int. Conf. Robot. Autom. (ICRA), Yokohama, Japan, 2024, pp. 6892–6903

  22. [22]

    AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

    AgiBot–World–Contributorset al., “Agibot world colosseo: A large– scale manipulation platform for scalable and intelligent embodied systems,”arXiv preprint arXiv:2503.06669, Aug. 2025

  23. [23]

    Rt-2: Vision-language-action models transfer web knowledge to robotic control,

    B. Zitkovichet al., “Rt-2: Vision-language-action models transfer web knowledge to robotic control,” inProc. 7th Conf. Robot Learning (CoRL), vol. 229, Nov 2023, pp. 2165–2183

  24. [24]

    OpenVLA: An Open-Source Vision-Language-Action Model

    M. J. Kimet al., “Openvla: An open–source vision–language–action model,”arXiv preprint arXiv:2406.09246, Sep. 2024

  25. [25]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    K. Blacket al., “π0: A vision–language–action flow model for general robot control,”arXiv preprint arXiv:2410.24164, Nov. 2024

  26. [26]

    The varos synthetic underwater data set: Towards realistic multi-sensor underwater data with ground truth,

    P. Georg Olofsson Zwilgmeyer, M. Yip, A. Langeland Teigen, R. Mester, and A. Stahl, “The varos synthetic underwater data set: Towards realistic multi-sensor underwater data with ground truth,” in Proc. 2021 IEEE/CVF Int. Conf. Comput. Vis. Workshops (ICCVW), Montreal, BC, Canada, Oct. 2021, pp. 3715–3723

  27. [27]

    An underwater image enhancement benchmark dataset and beyond,

    C. Liet al., “An underwater image enhancement benchmark dataset and beyond,”IEEE Trans. on Image Process., vol. 29, pp. 4376–4389, 2020

  28. [28]

    Aqualoc: An underwater dataset for visual–inertial–pressure localization,

    M. Ferrera, V . Creuze, J. Moras, and P. Trouv ´e-Peloux, “Aqualoc: An underwater dataset for visual–inertial–pressure localization,”Int. J. Robot. Res., vol. 38, no. 14, pp. 1549–1559, Dec. 2019

  29. [29]

    Underwater caves sonar data set,

    A. Mallios, E. Vidal, R. Campos, and M. Carreras, “Underwater caves sonar data set,”Int. J. Robot. Res., vol. 36, no. 12, pp. 1247–1251, Oct. 2017

  30. [30]

    Self–supervised underwater image generation for underwater domain pre–training,

    Z. Wu, Z. Wu, X. Chen, Y . Lu, and J. Yu, “Self–supervised underwater image generation for underwater domain pre–training,”IEEE Trans. Instrum. Meas., vol. 73, pp. 1–14, 2024

  31. [31]

    Urobench: Comparative analyses of underwater robotics sim- ulators from reinforcement learning perspective,

    Z. Huang, M. Buchholz, M. Grimaldi, H. Yu, I. Carlucho, and Y . R. Petillot, “Urobench: Comparative analyses of underwater robotics sim- ulators from reinforcement learning perspective,” inProc. OCEANS 2024 – Singapore, Singapore, Singapore, Apr. 2024, pp. 1–8

  32. [32]

    Fishgym: A high–performance physics–based simulation framework for under- water robot learning,

    W. Liu, K. Bai, X. He, S. Song, C. Zheng, and X. Liu, “Fishgym: A high–performance physics–based simulation framework for under- water robot learning,” inProc. 2022 IEEE Int. Conf. Robot. Autom. (ICRA), Philadelphia, PA, USA, May 2022, pp. 6268–6275