pith. machine review for the scientific record. sign in

arxiv: 2604.19404 · v1 · submitted 2026-04-21 · 💻 cs.RO · cs.AI

Recognition: unknown

M²GRPO: Mamba-based Multi-Agent Group Relative Policy Optimization for Biomimetic Underwater Robots Pursuit

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:16 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords multi-agent reinforcement learningMamba state space modelgroup relative policy optimizationcooperative pursuitunderwater robotsbiomimetic systemsCTDEpartial observability
0
0 comments X

The pith

A Mamba-based multi-agent policy optimization method raises pursuit success and efficiency for biomimetic underwater robot teams.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops M²GRPO to handle long-horizon decisions, partial information, and robot coordination in underwater pursuit tasks. It pairs a selective state-space Mamba policy that processes observation history and relational features with group-relative advantages obtained by normalizing rewards across agents in each episode. This combination runs under centralized training with decentralized execution and aims to deliver stable updates while cutting training demands. If the approach holds, multi-robot systems could coordinate more reliably in real underwater settings without the heavy compute costs of prior methods.

Core claim

The central claim is that integrating a Mamba policy—which uses observation history to capture temporal dependencies and attention-based features to encode inter-agent interactions—with group-relative policy optimization under the CTDE paradigm produces higher pursuit success rates and capture efficiency than MAPPO or recurrent baselines, as shown by extensive simulations and real-world pool experiments across team sizes and evader strategies.

What carries the argument

The selective state-space Mamba policy with attention-based relational encoding, paired with group-relative advantage normalization that computes advantages by averaging rewards across agents within an episode.

If this is right

  • Enables stable policy updates with lower training resource needs for multi-agent coordination.
  • Produces bounded continuous actions through normalized Gaussian sampling suitable for robot actuators.
  • Maintains performance gains across varying team scales and different evader behaviors in both simulation and physical tests.
  • Supports decentralized execution after centralized training for practical deployment on individual robots.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The combination of Mamba state-space models and group normalization may apply to other partially observable multi-robot tasks such as formation control or search missions.
  • Reward normalization across agents could serve as a general technique to stabilize credit assignment in long-horizon group reinforcement learning problems.
  • Substituting recurrent networks with Mamba policies might reduce inference latency for real-time decisions on resource-limited underwater platforms.

Load-bearing premise

Normalizing rewards across agents within each episode yields stable credit assignment and scalable policy updates without introducing bias that harms long-horizon coordination under partial observability.

What would settle it

Experiments showing that M²GRPO achieves no higher or lower success rates and capture efficiency than MAPPO when team size grows or evaders adopt more complex paths would falsify the claim of consistent outperformance.

Figures

Figures reproduced from arXiv: 2604.19404 by Junwen Gu, Junzhi Yu, Yukai Feng, Zhengxing Wu, Zhiheng Wu.

Figure 1
Figure 1. Figure 1: Overall framework of the proposed M2GRPO algorithm, which consists of three components: (a) CTDE paradigm: centralized training with decentralized execution, where agents share environment information and update in parallel during training, but rely solely on local observations and history for independent decision-making at the execution stage; (b) Mamba Policy: a selective state-space architecture that mo… view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of the pursuit–evasion task with two pursuers Pi, Pj and one evader E. Each pursuer is assigned a perception range Rc. The distance between the pursuit and evader is denoted as di,e. where less maneuverable pursuers must coordinate effi￾ciently and adapt their strategies adaptively to capture the agile evader. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Capture success rate of pursuers under different evader strategies: (i) evader with a learned policy; (ii) evader with a random policy. 14 17 15 20 17 21 19 23 5 10 15 20 25 Random Escape Learned Escape M 2GRPO MAPPO HAPPO MASAC Average Success Steps [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Average steps to successful capture under different evader strategies: (i) evader with a learned policy, (ii) evader with a random policy. 2 3 4 5 6 Number of pursuers 75 80 85 90 95 100 Success Rate (%) M 2GRPO MAPPO MHPPO MASAC M 2GRPO MAPPO MHPPO MASAC [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The success rate of the pursuit for different numbers of pursuers. three representative baselines, and ablation studies ex￾amine the contribution of key modules. In real-world ex￾periments, the learned policy is deployed on biomimetic robot shark platforms to verify its feasibility and ef￾fectiveness. During testing, the evader adopts either a random strategy or a learned strategy as detailed in [32]. A. S… view at source ↗
Figure 7
Figure 7. Figure 7: Snapshots of the cooperative pursuit experiment for bionic underwater robots Episode time (s) X position (m) Y position (m) 0 5 10 15 20 25 30 -2 -1 0 1 2 0 5 10 15 20 25 30 -2 -1 0 1 2 0 5 10 15 20 25 30 -2 -1 0 1 2 Evader Pursuer1 Pursuer2 Episode time (s) (a) (b) Evader Pursuer1 Pursuer2 [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Positional relationship between the evader and the pur￾suers. ral modeling branch, (ii) removing interaction encod￾ing, and (iii) replacing the Mamba backbone with a plain MLP. As shown in Table I, the complete model (M2GRPO+Mamba) achieves the strongest per￾formance. Notably, among the variants, removing in￾teraction encoding leads to the most significant degra￾dation, demonstrating the essential role of … view at source ↗
Figure 9
Figure 9. Figure 9: Distance variation between the evader and the pursuers. For quantitative analysis, [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
read the original abstract

Traditional policy learning methods in cooperative pursuit face fundamental challenges in biomimetic underwater robots, where long-horizon decision making, partial observability, and inter-robot coordination require both expressiveness and stability. To address these issues, a novel framework called Mamba-based multi-agent group relative policy optimization (M$^{2}$GRPO) is proposed, which integrates a selective state-space Mamba policy with group-relative policy optimization under the centralized-training and decentralized-execution (CTDE) paradigm. Specifically, the Mamba-based policy leverages observation history to capture long-horizon temporal dependencies and exploits attention-based relational features to encode inter-agent interactions, producing bounded continuous actions through normalized Gaussian sampling. To further improve credit assignment without sacrificing stability, the group-relative advantages are obtained by normalizing rewards across agents within each episode and optimized through a multi-agent extension of GRPO, significantly reducing the demand for training resources while enabling stable and scalable policy updates. Extensive simulations and real-world pool experiments across team scales and evader strategies demonstrate that M$^{2}$GRPO consistently outperforms MAPPO and recurrent baselines in both pursuit success rate and capture efficiency. Overall, the proposed framework provides a practical and scalable solution for cooperative underwater pursuit with biomimetic robot systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes M²GRPO, a Mamba-based multi-agent group relative policy optimization framework for cooperative pursuit tasks with biomimetic underwater robots. It integrates selective state-space Mamba policies with attention-based relational features under the CTDE paradigm, computes group-relative advantages by normalizing rewards across agents within each episode, and claims superior pursuit success rates and capture efficiency over MAPPO and recurrent baselines in simulations and real-world pool experiments across varying team scales and evader strategies.

Significance. If the empirical claims hold with proper validation, the integration of Mamba for long-horizon temporal modeling and group-relative normalization for credit assignment could offer a resource-efficient approach to stable multi-agent RL in partially observable robotic settings, addressing coordination challenges in underwater pursuit without excessive training demands.

major comments (2)
  1. [Abstract] Abstract: the central empirical claim of consistent outperformance in pursuit success rate and capture efficiency is asserted without any quantitative metrics, ablation studies, or statistical tests, preventing evaluation of whether the reported gains are load-bearing or attributable to the proposed components.
  2. [Methods (group-relative advantages)] Description of group-relative advantages: normalizing rewards across agents within each episode to obtain advantages implicitly assumes comparable per-agent reward distributions despite partial observability, heterogeneous contributions, and Mamba-handled long-horizon dependencies; this risks systematic bias in credit assignment for sparse-reward pursuit, and requires explicit comparison to per-agent normalization or variance-preserving alternatives to substantiate stability and scalability claims.
minor comments (2)
  1. [Abstract] The abstract would benefit from inclusion of specific performance deltas (e.g., success rate improvements) and details on the number of trials or statistical significance to ground the outperformance statements.
  2. [Methods] Notation for the normalized Gaussian sampling of continuous actions and the multi-agent GRPO update rule should be defined with explicit equations for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our empirical claims and methodological choices. We address each major comment below with clarifications from the manuscript and commit to targeted revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central empirical claim of consistent outperformance in pursuit success rate and capture efficiency is asserted without any quantitative metrics, ablation studies, or statistical tests, preventing evaluation of whether the reported gains are load-bearing or attributable to the proposed components.

    Authors: We acknowledge that the abstract states the performance improvements qualitatively. The full manuscript reports quantitative results, including specific success rates, capture efficiencies, ablation studies on the Mamba policy and group-relative components, and statistical significance across simulation and pool experiments in Sections 4 and 5. In the revision we will incorporate key quantitative metrics and references to the ablations and tests into the abstract to make the central claims immediately evaluable. revision: yes

  2. Referee: [Methods (group-relative advantages)] Description of group-relative advantages: normalizing rewards across agents within each episode to obtain advantages implicitly assumes comparable per-agent reward distributions despite partial observability, heterogeneous contributions, and Mamba-handled long-horizon dependencies; this risks systematic bias in credit assignment for sparse-reward pursuit, and requires explicit comparison to per-agent normalization or variance-preserving alternatives to substantiate stability and scalability claims.

    Authors: The group-relative normalization is motivated by the cooperative nature of the pursuit task, where agents share a joint objective; it is applied within each episode to reduce advantage variance while preserving relative contributions under CTDE. We recognize that partial observability and role heterogeneity could introduce bias and that direct comparisons would strengthen the stability claims. We will add an ablation study in the revised manuscript comparing group-relative normalization to per-agent normalization and variance-preserving alternatives, reporting the resulting effects on training stability and scalability. revision: yes

Circularity Check

0 steps flagged

No significant circularity in M²GRPO framework

full rationale

The paper proposes an algorithmic framework (Mamba policy + group-relative advantages via per-episode reward normalization under CTDE) and validates it empirically via simulations and pool experiments. No first-principles derivation, uniqueness theorem, or fitted quantity is presented as a 'prediction' that reduces by construction to its own inputs. The normalization step is an explicit design choice for credit assignment, not a self-referential result. No self-citations or ansatz smuggling appear in the abstract or described chain. The method is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities can be extracted beyond standard RL assumptions such as Markov decision processes.

pith-pipeline@v0.9.0 · 5535 in / 1111 out tokens · 29280 ms · 2026-05-10T02:16:46.239876+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 6 canonical work pages · 4 internal anchors

  1. [1]

    Review of research and control technology of underwater bionic robots,

    Z. Cui, L. Li, Y. Wang, Z. Zhong, and J. Li, “Review of research and control technology of underwater bionic robots,” Intell. Mar. Technol. Syst. , vol. 1, no. 1, p. 7, 2023

  2. [2]

    A versatile jellyfish-like robotic platform for effective underwater propulsion and manipulation,

    T. Wang, H.-J. Joo, S. Song, W. Hu, C. Keplinger, and M. Sitti, “A versatile jellyfish-like robotic platform for effective underwater propulsion and manipulation,” Sci. Adv. , vol. 9, no. 15, p. eadg0292, 2023

  3. [3]

    Bioinspired soft robots for deep-sea exploration,

    G. Li, T.-W. Wong, B. Shih, C. Guo, L. Wang, J. Liu, T. Wang, X. Liu, J. Yan, B. Wu, et al. , “Bioinspired soft robots for deep-sea exploration,” Nat. Commun. , vol. 14, no. 1, p. 7097, 2023

  4. [4]

    Study on the hydrodynamic performance of a self-propelled robot fish swimming in pipelines environment,

    O. Xie, C. Zhang, C. Shen, Y. Li, and D. Zhou, “Study on the hydrodynamic performance of a self-propelled robot fish swimming in pipelines environment,” Ocean Eng., vol. 309, p. 118356, 2024

  5. [5]

    Agile robotic fish based on direct drive of continuum body,

    K. Iguchi, T. Shimooka, S. Uchikai, Y. Konno, H. Tanaka, Y. Ikemoto, and J. Shintake, “Agile robotic fish based on direct drive of continuum body,” npj Robot., vol. 2, no. 1, p. 7, 2024

  6. [6]

    Implicit coordination for 3D underwater collective behaviors in a fish-inspired robot swarm,

    F. Berlinger, M. Gauci, and R. Nagpal, “Implicit coordination for 3D underwater collective behaviors in a fish-inspired robot swarm,” Sci. Robot., vol. 6, no. 50, p. eabd8668, 2021

  7. [7]

    A survey of autonomous under- water vehicle formation: Performance, formation control, and communication capability,

    Y. Yang, Y. Xiao, and T. Li, “A survey of autonomous under- water vehicle formation: Performance, formation control, and communication capability,” IEEE Commun. Surv. Tutor. , vol. 23, no. 2, pp. 815–841, 2021

  8. [8]

    Cooperative artificial intelligence for underwater robotic swarm,

    W. Cai, Z. Liu, M. Zhang, and C. Wang, “Cooperative artificial intelligence for underwater robotic swarm,” Robot. Auton. Syst., vol. 164, p. 104410, 2023

  9. [9]

    Approximate methods for visibility-based pursuit-evasion,

    E. Antonio, I. Becerra, and R. Murrieta-Cid, “Approximate methods for visibility-based pursuit-evasion,” IEEE Trans. Robot., early access, 2024

  10. [10]

    Zero-sum differential game guidance law for missile interception engagement via neuro-dynamic programming,

    A. Xi, Y. Cai, Y. Deng, and H. Jiang, “Zero-sum differential game guidance law for missile interception engagement via neuro-dynamic programming,” Proc. Inst. Mech. Eng., Part G: J. Aerosp. Eng. , vol. 237, no. 14, pp. 3352–3366, 2023

  11. [11]

    A novel graph-based motion planner of multi-mobile robot systems with formation and obstacle constraints,

    W. Liu, J. Hu, H. Zhang, M. Y. Wang, and Z. Xiong, “A novel graph-based motion planner of multi-mobile robot systems with formation and obstacle constraints,” IEEE Trans. Robot., vol. 40, pp. 714–728, 2023

  12. [12]

    A visibility-based pursuit-evasion game between two nonholonomic robots in environments with obstacles,

    E. Lozano, I. Becerra, U. Ruiz, L. Bravo, and R. Murrieta- Cid, “A visibility-based pursuit-evasion game between two nonholonomic robots in environments with obstacles,” Auton. Robots, vol. 46, no. 2, pp. 349–371, 2022

  13. [13]

    Multiplayer pursuit-evasion differential games with malicious pursuers,

    Y. Xu, H. Yang, B. Jiang, and M. M. Polycarpou, “Multiplayer pursuit-evasion differential games with malicious pursuers,” IEEE Trans. Autom. Control , vol. 67, no. 9, pp. 4939–4946, 2022

  14. [14]

    Particle swarm optimization algorithm for the optimization of rescue task allocation with uncertain time constraints,

    N. Geng, Z. Chen, Q. A. Nguyen, and D. Gong, “Particle swarm optimization algorithm for the optimization of rescue task allocation with uncertain time constraints,” Complex Intell. Syst. , vol. 7, no. 2, pp. 873–890, 2021

  15. [15]

    Comparison of two optimal guidance methods for the long-distance orbital pursuit-evasion game,

    X. Zeng, L. Yang, Y. Zhu, and F. Yang, “Comparison of two optimal guidance methods for the long-distance orbital pursuit-evasion game,” IEEE Trans. Aerosp. Electron. Syst. , vol. 57, no. 1, pp. 521–539, 2020

  16. [16]

    Multi-robot cooperative pursuit via potential field-enhanced reinforcement learning,

    Z. Zhang, X. Wang, Q. Zhang, and T. Hu, “Multi-robot cooperative pursuit via potential field-enhanced reinforcement learning,” in Proc. IEEE Int. Conf. Robot. Autom. (ICRA) , 2022, pp. 8808–8814

  17. [17]

    Multi-target pursuit by a decentralized heterogeneous UA V swarm using deep multi-agent reinforcement learning,

    M. Kouzeghar, Y. Song, M. Meghjani, and R. Bouffanais, “Multi-target pursuit by a decentralized heterogeneous UA V swarm using deep multi-agent reinforcement learning,” in Proc. IEEE Int. Conf. Robot. Autom. (ICRA), 2023, pp. 3289– 3295

  18. [18]

    Decentralized multi-agent pursuit using deep reinforcement learning,

    C. De Souza, R. Newbury, A. Cosgun, P. Castillo, B. Vidolov, and D. Kulić, “Decentralized multi-agent pursuit using deep reinforcement learning,” IEEE Robot. Autom. Lett., vol. 6, no. 3, pp. 4552–4559, 2021

  19. [19]

    An improved approach towards multi-agent pursuit–evasion game decision-making using deep reinforcement learning,

    K. Wan, D. Wu, Y. Zhai, B. Li, X. Gao, and Z. Hu, “An improved approach towards multi-agent pursuit–evasion game decision-making using deep reinforcement learning,” Entropy, vol. 23, no. 11, p. 1433, 2021

  20. [20]

    Large scale pursuit-evasion under collision avoidance using deep reinforce- ment learning,

    H. Yang, P. Ge, J. Cao, Y. Yang, and Y. Liu, “Large scale pursuit-evasion under collision avoidance using deep reinforce- ment learning,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS) , 2023, pp. 2232–2239

  21. [21]

    Distributed pursuit–evasion game decision-making based on multi-agent deep reinforce- ment learning,

    Y. Lin, H. Gao, and Y. Xia, “Distributed pursuit–evasion game decision-making based on multi-agent deep reinforce- ment learning,” Electronics, vol. 14, no. 11, p. 2141, 2025

  22. [22]

    Recurrent prediction model for partially observable MDPs,

    S. Xie, Z. Zhang, H. Yu, and X. Luo, “Recurrent prediction model for partially observable MDPs,” Inf. Sci. , vol. 620, pp. 125–141, 2023

  23. [23]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,” arXiv preprint arXiv:2312.00752 , 2023

  24. [24]

    MARL-MambaContour: Unleashing multi-agent deep reinforcement learning for active contour optimization in medical image segmentation,

    R. Zhang, Y. Sun, Z. Zhang, J. Li, X. Liu, A. H. Fan, H. Guo, and P. Yan, “MARL-MambaContour: Unleashing multi-agent deep reinforcement learning for active contour optimization in medical image segmentation,” arXiv preprint arXiv:2506.18679, 2025

  25. [25]

    Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

    T. Dao and A. Gu, “Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality,” arXiv preprint arXiv:2405.21060 , 2024

  26. [26]

    Chem- ical language modeling with structured state space sequence models,

    R. Özçelik, S. de Ruiter, E. Criscuolo, and F. Grisoni, “Chem- ical language modeling with structured state space sequence models,” Nat. Commun. , vol. 15, no. 1, p. 6176, 2024

  27. [27]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, et al. , “Deepseekmath: Pushing the limits of mathematical reasoning in open language models,” arXiv preprint arXiv:2402.03300 , 2024

  28. [28]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al., “Deepseek-r1: Incentivizing reasoning capability in LLMs via reinforcement learning,” arXiv preprint arXiv:2501.12948 , 2025

  29. [29]

    Reinforcement learning with verifiable rewards: Grpo’s effective loss, dynamics, and success amplification.arXiv preprint arXiv:2503.06639,

    Y. Mroueh, “Reinforcement learning with verifiable rewards: GRPO’s effective loss, dynamics, and success amplification,” arXiv preprint arXiv:2503.06639 , 2025

  30. [30]

    Real-world learning control for autonomous exploration of a biomimetic robotic shark,

    S. Yan, Z. Wu, J. Wang, Y. Huang, M. Tan, and J. Yu, “Real-world learning control for autonomous exploration of a biomimetic robotic shark,” IEEE Trans. Ind. Electron. , vol. 70, no. 4, pp. 3966–3974, 2022

  31. [31]

    Decentralized multirobotic fish pursuit control with attraction-enhanced reinforcement learning,

    Y. Feng, Z. Wu, J. Wang, J. Gu, F. Yu, J. Yu, and M. Tan, “Decentralized multirobotic fish pursuit control with attraction-enhanced reinforcement learning,” IEEE Trans. Ind. Electron., vol. 72, no. 8, pp. 8290–8300, 2025

  32. [32]

    Cooperative pursuit policy for bionic underwater robot based on MARL-MHSA architecture: Data-driven modeling and distributed strategy optimization,

    Y.-K. Feng, Z.-X. Wu, and M. Tan, “Cooperative pursuit policy for bionic underwater robot based on MARL-MHSA architecture: Data-driven modeling and distributed strategy optimization,” Acta Autom. Sin., vol. 51, no. 9, pp. 1001–1014, 2025