Zero-Shot MARL Benchmark in the Cyber-Physical Mobility Lab

Bassam Alrifaee; Fynn Belderink; Jianye Xu; Julius Beerwerth; Simon Sch\"afer

arxiv: 2601.16578 · v2 · pith:TEHC4IS4new · submitted 2026-01-23 · 💻 cs.RO · cs.SY· eess.SY

Zero-Shot MARL Benchmark in the Cyber-Physical Mobility Lab

Julius Beerwerth , Jianye Xu , Simon Sch\"afer , Fynn Belderink , Bassam Alrifaee This is my paper

Pith reviewed 2026-05-16 12:20 UTC · model grok-4.3

classification 💻 cs.RO cs.SYeess.SY

keywords sim-to-real transfermulti-agent reinforcement learningconnected and automated vehiclesbenchmarkzero-shot evaluationmotion planningdigital twin

0 comments

The pith

A benchmark platform tests MARL policies for vehicles across simulation, digital twin, and physical hardware in zero-shot fashion.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a reproducible benchmark built on the Cyber-Physical Mobility Lab to measure how multi-agent reinforcement learning policies for connected automated vehicles transfer from simulation to real hardware without retraining. It integrates three domains—pure simulation, a high-fidelity digital twin, and the physical testbed—to run the same policy and record performance. Deployment of one SigmaRL-trained policy shows two distinct drops: one from differences in the control software stacks and one from added environmental realism. A reader would care because MARL controllers for vehicles must cross this gap reliably before they can be used in traffic, and the open-source platform makes such measurements repeatable and comparable.

Core claim

The paper claims that the CPM Lab setup supplies a structured three-domain platform for zero-shot evaluation of MARL motion-planning policies, and that running a SigmaRL-trained policy through simulation, digital twin, and hardware isolates performance degradation into architectural mismatches between control stacks and the additional gap caused by increasing environmental realism.

What carries the argument

The integrated benchmark platform that combines simulation, high-fidelity digital twin, and physical testbed to enable zero-shot MARL policy evaluation for connected automated vehicles.

If this is right

MARL motion planners for connected vehicles can be evaluated for transfer without retraining on each new domain.
Performance losses can be attributed separately to control-stack differences versus added realism.
An open-source multi-domain testbed supports systematic, reproducible study of sim-to-real issues in vehicle MARL.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same platform could be used to compare transfer performance across different MARL algorithms beyond the one tested.
Aligning simulation control interfaces more closely with hardware might reduce one of the two observed degradation sources.
Policies trained with explicit awareness of these gaps could show smaller drops when moved to hardware.

Load-bearing premise

The chosen CPM Lab configuration and the particular SigmaRL policy sufficiently represent the general sim-to-real transfer problems that arise for MARL methods in connected automated vehicles.

What would settle it

A deployment of the same policy that produces no measurable performance loss across the three domains, or that attributes losses to factors other than control-stack architecture and environmental realism, would falsify the identified degradation sources.

Figures

Figures reproduced from arXiv: 2601.16578 by Bassam Alrifaee, Fynn Belderink, Jianye Xu, Julius Beerwerth, Simon Sch\"afer.

**Figure 1.** Figure 1: Illustration of how SigmaRL’s policy output is applied within the control horizon and then transitioned to a rules-based approach for the remaining prediction horizon, yielding a complete trajectory for each agent 𝑖 in the scene. 4 Evaluation We present a baseline evaluation to assess the zero-shot sim-to-real transferability of a policy trained in simulation using SigmaRL. Code reproducing the simulation… view at source ↗

**Figure 2.** Figure 2: Example initial setup and representative trajectories for one configuration. One representative run per environment is shown, using the same initial positions across all environments; the run was selected from the digital twin as the one with centerline deviation closest to that environment’s mean. Trajectories are shown over the 18 s evaluation horizon; in the physical lab, a collision during this interva… view at source ↗

read the original abstract

We present a reproducible benchmark for evaluating sim-to-real transfer of Multi-Agent Reinforcement Learning (MARL) policies for Connected and Automated Vehicles (CAVs). The platform, based on the Cyber-Physical Mobility Lab (CPM Lab) [1], integrates simulation, a high-fidelity digital twin, and a physical testbed, enabling structured zero-shot evaluation of MARL motion-planning policies. We demonstrate its use by deploying a SigmaRL-trained policy [2] across all three domains, revealing two complementary sources of performance degradation: architectural differences between simulation and hardware control stacks, and the sim-to-real gap induced by increasing environmental realism. The open-source setup enables systematic analysis of sim-to-real challenges in MARL under realistic, reproducible conditions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper presents a reproducible benchmark platform based on the Cyber-Physical Mobility Lab (CPM Lab) for zero-shot evaluation of sim-to-real transfer in multi-agent reinforcement learning (MARL) policies for connected and automated vehicles (CAVs). The platform integrates simulation, a high-fidelity digital twin, and physical hardware testbed. Its use is demonstrated by deploying a single SigmaRL-trained policy across all three domains, which reveals two sources of performance degradation: architectural mismatches between simulation and hardware control stacks, and the sim-to-real gap from increasing environmental realism. The setup is open-source to support systematic analysis.

Significance. If the attribution of degradation sources holds under broader testing, the benchmark would provide a valuable, structured tool for reproducible study of sim-to-real challenges in MARL for CAVs, addressing a practical gap in the field. The integration of simulation through hardware and the open-source release are clear strengths that could enable community follow-up work.

major comments (1)

[Demonstration / experimental evaluation] The central claim that the benchmark isolates two complementary sources of performance degradation (control-stack architecture differences and environmental realism) rests on deployment of only a single SigmaRL-trained policy. This leaves open the possibility that the observed drops arise from policy-specific features (e.g., observation space, reward design, or centralized training) rather than the stated general factors. Evaluation with at least one additional MARL algorithm possessing different coordination mechanisms is required to support the generality of the identified sources.

minor comments (1)

[Abstract] The abstract states that the demonstration 'reveals' the two degradation sources but provides no quantitative metrics, tables of performance values, or error analysis; the full manuscript should include these to allow readers to assess the magnitude and statistical reliability of the reported drops.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment below and describe the planned revisions.

read point-by-point responses

Referee: [Demonstration / experimental evaluation] The central claim that the benchmark isolates two complementary sources of performance degradation (control-stack architecture differences and environmental realism) rests on deployment of only a single SigmaRL-trained policy. This leaves open the possibility that the observed drops arise from policy-specific features (e.g., observation space, reward design, or centralized training) rather than the stated general factors. Evaluation with at least one additional MARL algorithm possessing different coordination mechanisms is required to support the generality of the identified sources.

Authors: We agree that reliance on a single SigmaRL policy limits the strength of the generality claim for the two identified degradation sources. While the benchmark platform itself is the primary contribution and the observed drops are attributable to control-stack mismatches and increasing realism (as the policy is held fixed across domains), we acknowledge that policy-specific factors cannot be fully ruled out without broader testing. In the revised manuscript we will add results from at least one additional MARL algorithm with different coordination mechanisms (e.g., a fully decentralized variant) to confirm that the same degradation patterns appear across policy classes. revision: yes

Circularity Check

0 steps flagged

Benchmark is self-contained empirical evaluation with no derivation chain

full rationale

The paper introduces a platform integrating simulation, digital twin, and physical testbed for zero-shot MARL evaluation in CAVs, then reports empirical results from deploying one external SigmaRL policy. No equations, fitted parameters, or predictions appear that reduce to inputs by construction. Citations to prior CPM Lab and SigmaRL work are external references rather than load-bearing self-citations that justify a uniqueness theorem or ansatz. The central claims rest on observed performance differences across domains, which are falsifiable measurements rather than self-defined or renamed known results. This is a standard benchmark paper with independent content.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no details on parameters or assumptions beyond the described platform.

pith-pipeline@v0.9.0 · 5440 in / 1009 out tokens · 38026 ms · 2026-05-16T12:20:33.824821+00:00 · methodology

Zero-Shot MARL Benchmark in the Cyber-Physical Mobility Lab

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)