pith. sign in

arxiv: 2506.00982 · v3 · pith:FO3WDX5Ynew · submitted 2025-06-01 · 💻 cs.RO · cs.MA

Robust and Safe Multi-Agent Reinforcement Learning with Communication for Autonomous Vehicles: From Simulation to Hardware

Pith reviewed 2026-05-19 11:15 UTC · model grok-4.3

classification 💻 cs.RO cs.MA
keywords multi-agent reinforcement learningautonomous vehiclessim-to-real transfervehicle-to-vehicle communicationcontrol barrier functionsrobust learningsafety shieldshardware experiments
0
0 comments X

The pith

A MARL framework trains driving policies in simulation and transfers them directly to physical vehicles while adding safety shields.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that multi-agent reinforcement learning policies for autonomous vehicles can be formulated with state and action representations that explicitly account for physical system complexities, trained robustly in simulation, and then transferred zero-shot to hardware. It incorporates vehicle-to-vehicle communication for shared information and uses Control Barrier Functions as modular safety shields to enforce guarantees during both training and deployment. A sympathetic reader would care because this addresses the persistent sim-to-real gap and safety concerns that have limited learning-based methods in real multi-robot systems, potentially allowing safer coordinated driving without extensive real-world retraining.

Core claim

RSR-RSMARL is a Robust and Safe MARL framework that supports Real-Sim-Real policy adaptation for multi-agent systems with communication among agents. It leverages state representations that include shared information among agents and action representations that consider real system complexities. The policy is trained with a robust MARL algorithm to enable zero-shot transfer to hardware despite the sim-to-real gap. A safety shield module using Control Barrier Functions provides safety guarantees for each individual agent. Experiments on 1/10th-scale autonomous vehicles with V2V communication show that the framework enhances driving safety and coordination across multiple configurations.

What carries the argument

The RSR-RSMARL framework, which combines robust MARL training, state and action representations that include shared V2V information and real-system details, Real-Sim-Real adaptation, and modular Control Barrier Function safety shields to support zero-shot hardware transfer.

If this is right

  • Multi-agent vehicle teams can maintain individual safety guarantees while using shared communication to improve overall coordination.
  • Zero-shot transfer from simulation becomes feasible for MARL policies when representations are designed around physical complexities rather than idealized models.
  • Safety shields based on Control Barrier Functions can be added modularly without retraining the core policy for hardware use.
  • The same framework supports testing across varied team sizes and scenarios once the representations and training are fixed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach might scale to full-size vehicles if the state representations are adjusted for higher speeds and longer communication ranges.
  • Similar combinations of robust training and barrier-function shields could apply to other multi-agent domains such as drone coordination or warehouse robots.
  • If communication is intermittent, the framework's reliance on shared states would need explicit robustness extensions that the current experiments do not test.
  • The method could be combined with online adaptation modules to handle larger distribution shifts not seen in the 1/10-scale tests.

Load-bearing premise

State and action representations that capture real system complexities, together with robust training, are enough to overcome sim-to-real discrepancies and model uncertainties so that the policies work directly on physical hardware.

What would settle it

Deploy the simulator-trained policies on the 1/10th-scale vehicles without any fine-tuning and observe whether safety or coordination breaks down in the presence of communication delays, model uncertainties, or dynamic obstacles.

Figures

Figures reproduced from arXiv: 2506.00982 by Ehsan Sabouni, Fei Miao, H M Sabbir Ahmad, Keshawn Smith, Mainak Mondal, Song Han, Wenchao Li, Zhili Zhang.

Figure 1
Figure 1. Figure 1: RSR-RSMARL Framework Breakdown. The figure showcases the pipeline of the Real [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The figure illustrates the hardware policy execution stage pipeline of an agent, with all [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Discounted Efficiency Returns during Training [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: CARLA Simulation: Successful Merging and Lane Change [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: CARLA Simulation: Failure Case with Rear-End Collision [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Real-World Environment Setting A.7.1 Communication Framework: Hardware and Software Each vehicle in the fleet was equipped with an onboard Jetson Orin Nano (8GB) running ROS1 Noetic on Ubuntu 20.04. For real-world deployment, we rely on a communication infrastructure built upon the Robot Operating System (ROS 1). ROS provides a publish-subscribe messaging ar￾chitecture that enables real-time data exchange … view at source ↗
Figure 7
Figure 7. Figure 7: Real-World Test: Lane Following and Obstacle Avoidance [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Real-World Test: Drift Caused by Abrupt Control Perturbation [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: CBF Intervention Frequency: With vs. Without V2V Communication [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
read the original abstract

Deep multi-agent reinforcement learning (MARL) has been demonstrated effectively in simulations for multi-robot problems. For autonomous vehicles, the development of vehicle-to-vehicle (V2V) communication technologies provide opportunities to further enhance system safety. However, zero-shot transfer of simulator-trained MARL policies to dynamic hardware systems remains challenging, and how to leverage communication and shared information for MARL has limited demonstrations on hardware. This problem is challenged by discrepancies between simulated and physical states, system state and model uncertainties, practical shared information design, and the need for safety guarantees in both simulation and hardware. This paper designs RSR-RSMARL, a novel Robust and Safe MARL framework that supports Real-Sim-Real (RSR) policy adaptation for multi-agent systems with communication among agents, with both simulation and hardware demonstrations. RSR-RSMARL leverages state (includes shared state information among agents) and action representations considering real system complexities for MARL formulation. The MARL policy is trained with robust MARL algorithm to enable zero-shot transfer to hardware considering the sim-to-real gap. A safety shield module using Control Barrier Functions (CBFs) provides safety guarantee for each individual agent. Experimental results on 1/10th-scale autonomous vehicles with V2V communication demonstrate the ability of RSR-RSMARL framework to enhance driving safety and coordination across multiple configurations. These findings emphasize the importance of jointly designing robust policy representations and modular safety architectures to enable scalable, generalizable RSR transfer in multi-agent autonomy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes RSR-RSMARL, a novel Robust and Safe Multi-Agent Reinforcement Learning framework with V2V communication for autonomous vehicles. It enables Real-Sim-Real (RSR) policy adaptation by designing state (including shared information) and action representations that account for real-system complexities, training via a robust MARL algorithm for zero-shot hardware transfer, and adding a Control Barrier Function (CBF) safety shield per agent. The central claim is that this yields enhanced driving safety and coordination, supported by both simulation results and hardware experiments on 1/10th-scale vehicles across multiple configurations.

Significance. If the hardware results hold with quantitative support, the work would be significant for multi-agent autonomy: it directly tackles sim-to-real transfer, communication design, and safety in a single modular architecture. The combination of representation choices, robust training, and CBF shielding offers a concrete path toward deployable MARL policies on physical vehicles, which remains rare in the literature.

major comments (2)
  1. [Abstract and Section 4] Abstract and Section 4: The manuscript asserts that 'Experimental results on 1/10th-scale autonomous vehicles with V2V communication demonstrate the ability of RSR-RSMARL framework to enhance driving safety and coordination,' yet supplies no quantitative metrics (success rates, collision counts, trajectory error, or statistical significance), no baselines, and no error analysis or training hyperparameters. This absence directly undermines the central empirical claim of effective zero-shot transfer.
  2. [Section 4 and RSR adaptation description] Section 4 and RSR adaptation description: No explicit quantification of the sim-to-real gap is provided (e.g., Wasserstein distance between state distributions, actuator latency mismatch, or sensor noise statistics). Without these measurements it is impossible to determine whether any observed hardware performance arises from the chosen state/action representations or from unstated environment simplifications or CBF intervention. This is load-bearing for the zero-shot guarantee.
minor comments (1)
  1. [Abstract] The expansion of the acronym RSR-RSMARL is not stated on first use; adding '(Robust and Safe Real-Sim-Real Multi-Agent Reinforcement Learning)' would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the opportunity to respond to the referee's report. We value the constructive criticism regarding the presentation of our hardware experiments and the quantification of the sim-to-real gap. We believe these points can be addressed through targeted revisions and additional analysis, which we outline below.

read point-by-point responses
  1. Referee: [Abstract and Section 4] Abstract and Section 4: The manuscript asserts that 'Experimental results on 1/10th-scale autonomous vehicles with V2V communication demonstrate the ability of RSR-RSMARL framework to enhance driving safety and coordination,' yet supplies no quantitative metrics (success rates, collision counts, trajectory error, or statistical significance), no baselines, and no error analysis or training hyperparameters. This absence directly undermines the central empirical claim of effective zero-shot transfer.

    Authors: We acknowledge the validity of this observation. The current manuscript emphasizes qualitative demonstrations and figures in Section 4 to illustrate the hardware performance. In the revision, we will incorporate quantitative metrics such as success rates, number of collisions, trajectory errors with standard deviations, and p-values for statistical significance. Baselines including non-communicative MARL and MARL without CBF will be added, along with a table summarizing hyperparameters and error analysis. This will provide the necessary quantitative support for the zero-shot transfer claims. revision: yes

  2. Referee: [Section 4 and RSR adaptation description] Section 4 and RSR adaptation description: No explicit quantification of the sim-to-real gap is provided (e.g., Wasserstein distance between state distributions, actuator latency mismatch, or sensor noise statistics). Without these measurements it is impossible to determine whether any observed hardware performance arises from the chosen state/action representations or from unstated environment simplifications or CBF intervention. This is load-bearing for the zero-shot guarantee.

    Authors: We agree that providing explicit measures of the sim-to-real gap would enhance the rigor of our claims. We will revise Section 4 to include an analysis of the sim-to-real discrepancies, such as statistical comparisons of state distributions (including Wasserstein distance where applicable), measured actuator latencies, and sensor noise levels from the hardware setup. We will also clarify how the designed state and action representations mitigate these gaps and evaluate the contribution of the CBF safety shield through ablation studies. These additions will better justify the zero-shot transfer performance. revision: yes

Circularity Check

0 steps flagged

No significant circularity; framework and transfer claims rest on external hardware validation

full rationale

The paper introduces RSR-RSMARL as a design combining state/action representations that incorporate real-system complexities, robust MARL training, and a modular CBF safety shield. Central claims are validated by direct hardware experiments on 1/10-scale vehicles with V2V communication across multiple configurations. No equations, fitted parameters, or results are shown to reduce by construction to quantities defined within the same experiment. No load-bearing self-citation chains or uniqueness theorems imported from prior author work appear in the derivation. The sim-to-real transfer is presented as an empirical outcome of the chosen representations and robust training rather than a tautological re-statement of inputs. This qualifies as self-contained against external benchmarks (hardware runs), warranting score 0.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters or invented entities; relies on standard domain assumptions about CBF safety and sim-to-real transfer feasibility.

axioms (1)
  • domain assumption Control barrier functions provide per-agent safety guarantees in both simulation and hardware
    Invoked for the safety shield module in the abstract.

pith-pipeline@v0.9.0 · 5831 in / 1155 out tokens · 49742 ms · 2026-05-19T11:15:46.444624+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Foundation/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    RSR-RSMARL leverages state (includes shared state information among agents) and action representations considering real system complexities for MARL formulation. The MARL policy is trained with robust MARL algorithm to enable zero-shot transfer to hardware considering the sim-to-real gap. A safety shield module using Control Barrier Functions (CBFs) provides safety guarantee for each individual agent.

  • IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We adopt the kinematic bicycle model ... The CBF is the additional safety constraint ... min u ½∥u−uref∥² s.t. ∂h/∂t + Lf h + Lg h u ≥ −γh

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 1 internal anchor

  1. [1]

    U. I. J. P. Office. Saving lives with connectivity: A plan to accelerate v2x deployment non- binding contents, 2024

  2. [2]

    Zhang, S

    Z. Zhang, S. Han, J. Wang, and F. Miao. Spatial-temporal-aware safe multi-agent reinforce- ment learning of connected autonomous vehicles in challenging scenarios. pages 5574–5580, 2023

  3. [3]

    Hyldmar, Y

    N. Hyldmar, Y . He, and A. Prorok. A fleet of miniature cars for experiments in cooperative driving. Proceedings - IEEE International Conference on Robotics and Automation , 2019- May:3238–3244, 5 2019. ISSN 10504729. doi:10.1109/ICRA.2019.8794445

  4. [4]

    Active deformation through visual servoing of soft objects

    A. Miller, K. Rim, P. Chopra, P. Kelkar, and M. Likhachev. Cooperative perception and lo- calization for cooperative driving. Proceedings - IEEE International Conference on Robotics and Automation, pages 1256–1262, 5 2020. ISSN 10504729. doi:10.1109/ICRA40945.2020. 9197463

  5. [5]

    Zhang, H

    Z. Zhang, H. M. S. Ahmad, E. Sabouni, Y . Sun, F. Huang, W. Li, and F. Miao. Safety guar- anteed robust multi-agent reinforcement learning with hierarchical control for connected and automated vehicles. 9 2023. URL https://arxiv.org/abs/2309.11057v2

  6. [6]

    S. Han, H. Wang, S. Su, Y . Shi, and F. Miao. Stable and efficient shapley value-based reward reallocation for multi-agent reinforcement learning of autonomous vehicles. Proceedings - IEEE International Conference on Robotics and Automation, pages 8765–8771, 3 2022. ISSN 10504729. doi:10.1109/ICRA46639.2022.9811626. URL https://arxiv.org/abs/ 2203.06333v2

  7. [7]

    Rios-Torres and A

    J. Rios-Torres and A. A. Malikopoulos. A survey on the coordination of connected and automated vehicles at intersections and merging at highway on-ramps. IEEE Transac- tions on Intelligent Transportation Systems , 18:1066–1077, 5 2017. ISSN 15249050. doi: 10.1109/TITS.2016.2600504

  8. [8]

    S. Han, S. Zhou, J. Wang, L. Pepin, C. Ding, J. Fu, and F. Miao. A multi-agent reinforcement learning approach for safe and efficient behavior planning of connected autonomous vehicles. IEEE Transactions on Intelligent Transportation Systems , 25(5):3654–3670, 2024. doi:10. 1109/TITS.2023.3336670

  9. [9]

    Detailed proofs for the manuscript DAFT-s-AFDM enabled ISAC systems: Ambiguity function analysis and waveform design,

    A. Mokhtarian, P. Scheffe, M. Kloock, S. Sch ¨afer, Heeseung Bang, Viet-Anh Le, Sangeet Ulhas, J. Betz, S. Wilson, S. Berman, A. Prorok, and B. Alrifaee. A survey on small-scale testbeds for connected and automated vehicles and robot swarms. 2024. doi:10.13140/RG.2. 2.16176.74248/1. URL https://arxiv.org/abs/2408.03539

  10. [10]

    Y . Shao, M. A. M. Zulkefli, Z. Sun, and P. Huang. Evaluating connected and autonomous vehicles using a hardware-in-the-loop testbed and a living lab. Transportation Research Part C: Emerging Technologies, 102:121–135, 5 2019. ISSN 0968-090X. doi:10.1016/J.TRC.2019. 03.010

  11. [11]

    C. Tang, B. Abbatematteo, J. Hu, R. Chandra, R. Mart´ın-Mart´ın, and P. Stone. Deep reinforce- ment learning for robotics: A survey of real-world successes. 8 2024. doi:10.1146/((please). URL https://arxiv.org/abs/2408.03539v2

  12. [12]

    Y . Feng, C. Hong, Y . Niu, S. Liu, Y . Yang, W. Yu, T. Zhang, J. Tan, and D. Zhao. Learning multi-agent loco-manipulation for long-horizon quadrupedal pushing, accepted, ICRA2025. URL https://arxiv.org/abs/2411.07104

  13. [13]

    Werner, T

    P. Werner, T. Seyde, P. Drews, T. M. Balch, I. Gilitschenski, W. Schwarting, G. Rosman, S. Karaman, and D. Rus. Dynamic multi-team racing: Competitive driving on 1/10-th scale vehicles via learning in simulation. In 7th Annual Conference on Robot Learning, 2023. URL https://openreview.net/forum?id=fvXFBCHVGn. 10

  14. [14]

    S. Han, S. Su, S. He, S. Han, H. Yang, and F. Miao. What is the solution for state adversarial multi-agent reinforcement learning? arXiv preprint arXiv:2212.02705, 2022

  15. [15]

    Liang, Y

    Y . Liang, Y . Sun, R. Zheng, and F. Huang. Efficient adversarial training without attacking: Worst-case-aware robust reinforcement learning. Advances in Neural Information Processing Systems, 35:22547–22561, 2022

  16. [16]

    Torne, A

    M. Torne, A. Simeonov, Z. Li, A. Chan, T. Chen, A. Gupta, and P. Agrawal. Reconciling reality through simulation: A real-to-sim-to-real approach for robust manipulation. 3 2024. URL https://arxiv.org/abs/2403.03949v1

  17. [17]

    M. T. Villasevil, A. Jain, V . Macha, J. Yuan, L. L. Ankile, A. Simeonov, P. Agrawal, and A. Gupta. Scaling robot-learning by crowdsourcing simulation environments

  18. [18]

    W. Zhao, J. P. Queralta, and T. Westerlund. Sim-to-real transfer in deep reinforcement learning for robotics: A survey. 2020 IEEE Symposium Series on Computational Intelligence, SSCI 2020, pages 737–744, 12 2020. doi:10.1109/SSCI47803.2020.9308468

  19. [19]

    Jiang, C

    Y . Jiang, C. Wang, R. Zhang, J. Wu, and L. Fei-Fei. Transic: Sim-to-real policy transfer by learning from online correction. In Conference on Robot Learning, 2024

  20. [20]

    S. S. Sandha, L. Garcia, B. Balaji, F. Anwar, and M. Srivastava. Sim2real transfer for deep reinforcement learning with stochastic state transition delays. In J. Kober, F. Ramos, and C. Tomlin, editors, Proceedings of the 2020 Conference on Robot Learning , volume 155 of Proceedings of Machine Learning Research, pages 1066–1083. PMLR, 16–18 Nov 2021. URL ...

  21. [21]

    Brunke, M

    L. Brunke, M. Greeff, A. W. Hall, Z. Yuan, S. Zhou, J. Panerati, and A. P. Schoellig. Safe learn- ing in robotics: From learning-based control to safe reinforcement learning. Annual Review of Control, Robotics, and Autonomous Systems, 5:411–444, 2022

  22. [22]

    ElSayed-Aly, S

    I. ElSayed-Aly, S. Bharadwaj, C. Amato, R. Ehlers, U. Topcu, and L. Feng. Safe multi- agent reinforcement learning via shielding. In Proceedings of the 20th International Confer- ence on Autonomous Agents and MultiAgent Systems, AAMAS ’21, page 483–491, Richland, SC, 2021. International Foundation for Autonomous Agents and Multiagent Systems. ISBN 9781450383073

  23. [23]

    Z. Cai, H. Cao, W. Lu, L. Zhang, and H. Xiong. Safe multi-agent reinforcement learning through decentralized multiple control barrier functions, 2021

  24. [24]

    J. Wang, S. Yang, Z. An, S. Han, Z. Zhang, R. Mangharam, M. Ma, and F. Miao. Multi- agent reinforcement learning guided by signal temporal logic specifications. arXiv preprint arXiv:2306.06808, 2023

  25. [25]

    S. He, S. Han, S. Su, S. Han, S. Zou, and F. Miao. Robust multi-agent reinforcement learning with state uncertainty. Transactions on Machine Learning Research, 2023

  26. [26]

    Mokhtarian, P

    A. Mokhtarian, P. Scheffe, M. Kloock, S. Sch ¨afer, Heeseung Bang, Viet-Anh Le, Sangeet Ulhas, J. Betz, S. Wilson, S. Berman, A. Prorok, and B. Alrifaee. A survey on small-scale testbeds for connected and automated vehicles and robot swarms. 2024. doi:10.13140/RG.2.2. 16176.74248/1. URL https://rgdoi.net/10.13140/RG.2.2.16176.74248/1

  27. [27]

    Z. Qin, H. Wang, and X. Li. Ultra fast structure-aware deep lane detection. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceed- ings, Part XXIV 16, pages 276–291. Springer, 2020

  28. [28]

    Y . Li, D. Ma, Z. An, Z. Wang, Y . Zhong, S. Chen, and C. Feng. V2x-sim: Multi-agent col- laborative perception dataset and benchmark for autonomous driving. IEEE Robotics and Au- tomation Letters, 7:10914–10921, 2 2022. ISSN 23773766. doi:10.1109/LRA.2022.3192802. URL https://arxiv.org/abs/2202.08449v2. 11

  29. [29]

    H. M. S. Ahmad, E. Sabouni, A. Dickson, W. Xiao, C. G. Cassandras, and W. Li. Secure control of connected and automated vehicles using trust-aware robust event-triggered control barrier functions, 2024. URL https://arxiv.org/abs/2401.02306

  30. [30]

    J. Kong, M. Pfeiffer, G. Schildbach, and F. Borrelli. Autonomous driving using model predic- tive control and a kinematic bicycle vehicle model. In Intelligent Vehicles Symposium, Seoul, Korea, 2015

  31. [31]

    Proximal Policy Optimization Algorithms

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017. 12 A Appendix A.1 Modeling and Algorithmic Details A.1.1 Vehicle Dynamic Model We adopt a kinematic bicycle model to describe the motion of each F1/10th vehicle. The state of each vehicle is represented asx = [X, ...