Recognition: unknown
M²GRPO: Mamba-based Multi-Agent Group Relative Policy Optimization for Biomimetic Underwater Robots Pursuit
Pith reviewed 2026-05-10 02:16 UTC · model grok-4.3
The pith
A Mamba-based multi-agent policy optimization method raises pursuit success and efficiency for biomimetic underwater robot teams.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that integrating a Mamba policy—which uses observation history to capture temporal dependencies and attention-based features to encode inter-agent interactions—with group-relative policy optimization under the CTDE paradigm produces higher pursuit success rates and capture efficiency than MAPPO or recurrent baselines, as shown by extensive simulations and real-world pool experiments across team sizes and evader strategies.
What carries the argument
The selective state-space Mamba policy with attention-based relational encoding, paired with group-relative advantage normalization that computes advantages by averaging rewards across agents within an episode.
If this is right
- Enables stable policy updates with lower training resource needs for multi-agent coordination.
- Produces bounded continuous actions through normalized Gaussian sampling suitable for robot actuators.
- Maintains performance gains across varying team scales and different evader behaviors in both simulation and physical tests.
- Supports decentralized execution after centralized training for practical deployment on individual robots.
Where Pith is reading between the lines
- The combination of Mamba state-space models and group normalization may apply to other partially observable multi-robot tasks such as formation control or search missions.
- Reward normalization across agents could serve as a general technique to stabilize credit assignment in long-horizon group reinforcement learning problems.
- Substituting recurrent networks with Mamba policies might reduce inference latency for real-time decisions on resource-limited underwater platforms.
Load-bearing premise
Normalizing rewards across agents within each episode yields stable credit assignment and scalable policy updates without introducing bias that harms long-horizon coordination under partial observability.
What would settle it
Experiments showing that M²GRPO achieves no higher or lower success rates and capture efficiency than MAPPO when team size grows or evaders adopt more complex paths would falsify the claim of consistent outperformance.
Figures
read the original abstract
Traditional policy learning methods in cooperative pursuit face fundamental challenges in biomimetic underwater robots, where long-horizon decision making, partial observability, and inter-robot coordination require both expressiveness and stability. To address these issues, a novel framework called Mamba-based multi-agent group relative policy optimization (M$^{2}$GRPO) is proposed, which integrates a selective state-space Mamba policy with group-relative policy optimization under the centralized-training and decentralized-execution (CTDE) paradigm. Specifically, the Mamba-based policy leverages observation history to capture long-horizon temporal dependencies and exploits attention-based relational features to encode inter-agent interactions, producing bounded continuous actions through normalized Gaussian sampling. To further improve credit assignment without sacrificing stability, the group-relative advantages are obtained by normalizing rewards across agents within each episode and optimized through a multi-agent extension of GRPO, significantly reducing the demand for training resources while enabling stable and scalable policy updates. Extensive simulations and real-world pool experiments across team scales and evader strategies demonstrate that M$^{2}$GRPO consistently outperforms MAPPO and recurrent baselines in both pursuit success rate and capture efficiency. Overall, the proposed framework provides a practical and scalable solution for cooperative underwater pursuit with biomimetic robot systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes M²GRPO, a Mamba-based multi-agent group relative policy optimization framework for cooperative pursuit tasks with biomimetic underwater robots. It integrates selective state-space Mamba policies with attention-based relational features under the CTDE paradigm, computes group-relative advantages by normalizing rewards across agents within each episode, and claims superior pursuit success rates and capture efficiency over MAPPO and recurrent baselines in simulations and real-world pool experiments across varying team scales and evader strategies.
Significance. If the empirical claims hold with proper validation, the integration of Mamba for long-horizon temporal modeling and group-relative normalization for credit assignment could offer a resource-efficient approach to stable multi-agent RL in partially observable robotic settings, addressing coordination challenges in underwater pursuit without excessive training demands.
major comments (2)
- [Abstract] Abstract: the central empirical claim of consistent outperformance in pursuit success rate and capture efficiency is asserted without any quantitative metrics, ablation studies, or statistical tests, preventing evaluation of whether the reported gains are load-bearing or attributable to the proposed components.
- [Methods (group-relative advantages)] Description of group-relative advantages: normalizing rewards across agents within each episode to obtain advantages implicitly assumes comparable per-agent reward distributions despite partial observability, heterogeneous contributions, and Mamba-handled long-horizon dependencies; this risks systematic bias in credit assignment for sparse-reward pursuit, and requires explicit comparison to per-agent normalization or variance-preserving alternatives to substantiate stability and scalability claims.
minor comments (2)
- [Abstract] The abstract would benefit from inclusion of specific performance deltas (e.g., success rate improvements) and details on the number of trials or statistical significance to ground the outperformance statements.
- [Methods] Notation for the normalized Gaussian sampling of continuous actions and the multi-agent GRPO update rule should be defined with explicit equations for reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the presentation of our empirical claims and methodological choices. We address each major comment below with clarifications from the manuscript and commit to targeted revisions.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central empirical claim of consistent outperformance in pursuit success rate and capture efficiency is asserted without any quantitative metrics, ablation studies, or statistical tests, preventing evaluation of whether the reported gains are load-bearing or attributable to the proposed components.
Authors: We acknowledge that the abstract states the performance improvements qualitatively. The full manuscript reports quantitative results, including specific success rates, capture efficiencies, ablation studies on the Mamba policy and group-relative components, and statistical significance across simulation and pool experiments in Sections 4 and 5. In the revision we will incorporate key quantitative metrics and references to the ablations and tests into the abstract to make the central claims immediately evaluable. revision: yes
-
Referee: [Methods (group-relative advantages)] Description of group-relative advantages: normalizing rewards across agents within each episode to obtain advantages implicitly assumes comparable per-agent reward distributions despite partial observability, heterogeneous contributions, and Mamba-handled long-horizon dependencies; this risks systematic bias in credit assignment for sparse-reward pursuit, and requires explicit comparison to per-agent normalization or variance-preserving alternatives to substantiate stability and scalability claims.
Authors: The group-relative normalization is motivated by the cooperative nature of the pursuit task, where agents share a joint objective; it is applied within each episode to reduce advantage variance while preserving relative contributions under CTDE. We recognize that partial observability and role heterogeneity could introduce bias and that direct comparisons would strengthen the stability claims. We will add an ablation study in the revised manuscript comparing group-relative normalization to per-agent normalization and variance-preserving alternatives, reporting the resulting effects on training stability and scalability. revision: yes
Circularity Check
No significant circularity in M²GRPO framework
full rationale
The paper proposes an algorithmic framework (Mamba policy + group-relative advantages via per-episode reward normalization under CTDE) and validates it empirically via simulations and pool experiments. No first-principles derivation, uniqueness theorem, or fitted quantity is presented as a 'prediction' that reduces by construction to its own inputs. The normalization step is an explicit design choice for credit assignment, not a self-referential result. No self-citations or ansatz smuggling appear in the abstract or described chain. The method is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Review of research and control technology of underwater bionic robots,
Z. Cui, L. Li, Y. Wang, Z. Zhong, and J. Li, “Review of research and control technology of underwater bionic robots,” Intell. Mar. Technol. Syst. , vol. 1, no. 1, p. 7, 2023
2023
-
[2]
A versatile jellyfish-like robotic platform for effective underwater propulsion and manipulation,
T. Wang, H.-J. Joo, S. Song, W. Hu, C. Keplinger, and M. Sitti, “A versatile jellyfish-like robotic platform for effective underwater propulsion and manipulation,” Sci. Adv. , vol. 9, no. 15, p. eadg0292, 2023
2023
-
[3]
Bioinspired soft robots for deep-sea exploration,
G. Li, T.-W. Wong, B. Shih, C. Guo, L. Wang, J. Liu, T. Wang, X. Liu, J. Yan, B. Wu, et al. , “Bioinspired soft robots for deep-sea exploration,” Nat. Commun. , vol. 14, no. 1, p. 7097, 2023
2023
-
[4]
Study on the hydrodynamic performance of a self-propelled robot fish swimming in pipelines environment,
O. Xie, C. Zhang, C. Shen, Y. Li, and D. Zhou, “Study on the hydrodynamic performance of a self-propelled robot fish swimming in pipelines environment,” Ocean Eng., vol. 309, p. 118356, 2024
2024
-
[5]
Agile robotic fish based on direct drive of continuum body,
K. Iguchi, T. Shimooka, S. Uchikai, Y. Konno, H. Tanaka, Y. Ikemoto, and J. Shintake, “Agile robotic fish based on direct drive of continuum body,” npj Robot., vol. 2, no. 1, p. 7, 2024
2024
-
[6]
Implicit coordination for 3D underwater collective behaviors in a fish-inspired robot swarm,
F. Berlinger, M. Gauci, and R. Nagpal, “Implicit coordination for 3D underwater collective behaviors in a fish-inspired robot swarm,” Sci. Robot., vol. 6, no. 50, p. eabd8668, 2021
2021
-
[7]
A survey of autonomous under- water vehicle formation: Performance, formation control, and communication capability,
Y. Yang, Y. Xiao, and T. Li, “A survey of autonomous under- water vehicle formation: Performance, formation control, and communication capability,” IEEE Commun. Surv. Tutor. , vol. 23, no. 2, pp. 815–841, 2021
2021
-
[8]
Cooperative artificial intelligence for underwater robotic swarm,
W. Cai, Z. Liu, M. Zhang, and C. Wang, “Cooperative artificial intelligence for underwater robotic swarm,” Robot. Auton. Syst., vol. 164, p. 104410, 2023
2023
-
[9]
Approximate methods for visibility-based pursuit-evasion,
E. Antonio, I. Becerra, and R. Murrieta-Cid, “Approximate methods for visibility-based pursuit-evasion,” IEEE Trans. Robot., early access, 2024
2024
-
[10]
Zero-sum differential game guidance law for missile interception engagement via neuro-dynamic programming,
A. Xi, Y. Cai, Y. Deng, and H. Jiang, “Zero-sum differential game guidance law for missile interception engagement via neuro-dynamic programming,” Proc. Inst. Mech. Eng., Part G: J. Aerosp. Eng. , vol. 237, no. 14, pp. 3352–3366, 2023
2023
-
[11]
A novel graph-based motion planner of multi-mobile robot systems with formation and obstacle constraints,
W. Liu, J. Hu, H. Zhang, M. Y. Wang, and Z. Xiong, “A novel graph-based motion planner of multi-mobile robot systems with formation and obstacle constraints,” IEEE Trans. Robot., vol. 40, pp. 714–728, 2023
2023
-
[12]
A visibility-based pursuit-evasion game between two nonholonomic robots in environments with obstacles,
E. Lozano, I. Becerra, U. Ruiz, L. Bravo, and R. Murrieta- Cid, “A visibility-based pursuit-evasion game between two nonholonomic robots in environments with obstacles,” Auton. Robots, vol. 46, no. 2, pp. 349–371, 2022
2022
-
[13]
Multiplayer pursuit-evasion differential games with malicious pursuers,
Y. Xu, H. Yang, B. Jiang, and M. M. Polycarpou, “Multiplayer pursuit-evasion differential games with malicious pursuers,” IEEE Trans. Autom. Control , vol. 67, no. 9, pp. 4939–4946, 2022
2022
-
[14]
Particle swarm optimization algorithm for the optimization of rescue task allocation with uncertain time constraints,
N. Geng, Z. Chen, Q. A. Nguyen, and D. Gong, “Particle swarm optimization algorithm for the optimization of rescue task allocation with uncertain time constraints,” Complex Intell. Syst. , vol. 7, no. 2, pp. 873–890, 2021
2021
-
[15]
Comparison of two optimal guidance methods for the long-distance orbital pursuit-evasion game,
X. Zeng, L. Yang, Y. Zhu, and F. Yang, “Comparison of two optimal guidance methods for the long-distance orbital pursuit-evasion game,” IEEE Trans. Aerosp. Electron. Syst. , vol. 57, no. 1, pp. 521–539, 2020
2020
-
[16]
Multi-robot cooperative pursuit via potential field-enhanced reinforcement learning,
Z. Zhang, X. Wang, Q. Zhang, and T. Hu, “Multi-robot cooperative pursuit via potential field-enhanced reinforcement learning,” in Proc. IEEE Int. Conf. Robot. Autom. (ICRA) , 2022, pp. 8808–8814
2022
-
[17]
Multi-target pursuit by a decentralized heterogeneous UA V swarm using deep multi-agent reinforcement learning,
M. Kouzeghar, Y. Song, M. Meghjani, and R. Bouffanais, “Multi-target pursuit by a decentralized heterogeneous UA V swarm using deep multi-agent reinforcement learning,” in Proc. IEEE Int. Conf. Robot. Autom. (ICRA), 2023, pp. 3289– 3295
2023
-
[18]
Decentralized multi-agent pursuit using deep reinforcement learning,
C. De Souza, R. Newbury, A. Cosgun, P. Castillo, B. Vidolov, and D. Kulić, “Decentralized multi-agent pursuit using deep reinforcement learning,” IEEE Robot. Autom. Lett., vol. 6, no. 3, pp. 4552–4559, 2021
2021
-
[19]
An improved approach towards multi-agent pursuit–evasion game decision-making using deep reinforcement learning,
K. Wan, D. Wu, Y. Zhai, B. Li, X. Gao, and Z. Hu, “An improved approach towards multi-agent pursuit–evasion game decision-making using deep reinforcement learning,” Entropy, vol. 23, no. 11, p. 1433, 2021
2021
-
[20]
Large scale pursuit-evasion under collision avoidance using deep reinforce- ment learning,
H. Yang, P. Ge, J. Cao, Y. Yang, and Y. Liu, “Large scale pursuit-evasion under collision avoidance using deep reinforce- ment learning,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS) , 2023, pp. 2232–2239
2023
-
[21]
Distributed pursuit–evasion game decision-making based on multi-agent deep reinforce- ment learning,
Y. Lin, H. Gao, and Y. Xia, “Distributed pursuit–evasion game decision-making based on multi-agent deep reinforce- ment learning,” Electronics, vol. 14, no. 11, p. 2141, 2025
2025
-
[22]
Recurrent prediction model for partially observable MDPs,
S. Xie, Z. Zhang, H. Yu, and X. Luo, “Recurrent prediction model for partially observable MDPs,” Inf. Sci. , vol. 620, pp. 125–141, 2023
2023
-
[23]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,” arXiv preprint arXiv:2312.00752 , 2023
work page internal anchor Pith review arXiv 2023
-
[24]
R. Zhang, Y. Sun, Z. Zhang, J. Li, X. Liu, A. H. Fan, H. Guo, and P. Yan, “MARL-MambaContour: Unleashing multi-agent deep reinforcement learning for active contour optimization in medical image segmentation,” arXiv preprint arXiv:2506.18679, 2025
-
[25]
T. Dao and A. Gu, “Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality,” arXiv preprint arXiv:2405.21060 , 2024
work page internal anchor Pith review arXiv 2024
-
[26]
Chem- ical language modeling with structured state space sequence models,
R. Özçelik, S. de Ruiter, E. Criscuolo, and F. Grisoni, “Chem- ical language modeling with structured state space sequence models,” Nat. Commun. , vol. 15, no. 1, p. 6176, 2024
2024
-
[27]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, et al. , “Deepseekmath: Pushing the limits of mathematical reasoning in open language models,” arXiv preprint arXiv:2402.03300 , 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[28]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al., “Deepseek-r1: Incentivizing reasoning capability in LLMs via reinforcement learning,” arXiv preprint arXiv:2501.12948 , 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[29]
Y. Mroueh, “Reinforcement learning with verifiable rewards: GRPO’s effective loss, dynamics, and success amplification,” arXiv preprint arXiv:2503.06639 , 2025
-
[30]
Real-world learning control for autonomous exploration of a biomimetic robotic shark,
S. Yan, Z. Wu, J. Wang, Y. Huang, M. Tan, and J. Yu, “Real-world learning control for autonomous exploration of a biomimetic robotic shark,” IEEE Trans. Ind. Electron. , vol. 70, no. 4, pp. 3966–3974, 2022
2022
-
[31]
Decentralized multirobotic fish pursuit control with attraction-enhanced reinforcement learning,
Y. Feng, Z. Wu, J. Wang, J. Gu, F. Yu, J. Yu, and M. Tan, “Decentralized multirobotic fish pursuit control with attraction-enhanced reinforcement learning,” IEEE Trans. Ind. Electron., vol. 72, no. 8, pp. 8290–8300, 2025
2025
-
[32]
Cooperative pursuit policy for bionic underwater robot based on MARL-MHSA architecture: Data-driven modeling and distributed strategy optimization,
Y.-K. Feng, Z.-X. Wu, and M. Tan, “Cooperative pursuit policy for bionic underwater robot based on MARL-MHSA architecture: Data-driven modeling and distributed strategy optimization,” Acta Autom. Sin., vol. 51, no. 9, pp. 1001–1014, 2025
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.