arxiv: 2511.21304 · v2 · submitted 2025-11-26 · 📡 eess.SY · cs.SY

Sparse shepherding control of large-scale multi-agent systems via Reinforcement Learning

Luigi Catello , Italo Napolitano , Davide Salzano , Mario di Bernardo This is my paper

Pith reviewed 2026-05-17 05:05 UTC · model grok-4.3

classification 📡 eess.SY cs.SY

keywords reinforcement learningmulti-agent systemssparse controlPDE-ODE couplingdensity controlshepherdinglarge-scale systemsrobust control

0 comments

The pith

Reinforcement learning lets a few controlled agents steer the density of a large uncontrolled population.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a reinforcement learning method for indirect sparse control of large multi-agent groups. Controlled agents are modeled by ODEs while the many uncontrolled agents are represented by a PDE that tracks their collective density. The approach adds adaptive compensation for interaction strengths to make learning feasible despite limited actuation. Numerical tests confirm that the trained policy reaches target distributions and holds up under disturbances and noise, offering a cheaper alternative to repeated online optimization.

Core claim

The central claim is that a model-free reinforcement learning policy, combined with adaptive interaction strength compensation, can learn sparse control inputs for a small number of ODE agents that drive the macroscopic density of the uncontrolled population, described by a PDE, to desired target distributions, as shown by numerical validation that includes robustness to disturbances and measurement noise.

What carries the argument

The ODE-PDE hybrid model together with a model-free reinforcement learning policy that incorporates adaptive compensation for interaction strengths between controlled and uncontrolled agents.

If this is right

Target density distributions are reached in numerical simulations of the hybrid system.
Performance remains stable in the presence of external disturbances and sensor noise.
The learned policy replaces the need for repeated real-time optimization at each control step.
Sparse actuation by only a few agents suffices to shape the collective behavior of many others.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same learned-policy idea could apply to other continuum systems such as traffic flow or biological swarms where direct control of every individual is impossible.
Hardware experiments with real robots would test whether discretization errors and unmodeled dynamics preserve the reported robustness.
Linking the method to mean-field control theory might yield convergence guarantees that the current numerical evidence does not yet provide.

Load-bearing premise

The large population of uncontrolled agents can be accurately described by a continuum PDE density model whose interactions with the controlled agents stay well-behaved under the learned policy.

What would settle it

Simulations that replace the PDE with a very large but finite number of discrete agents and check whether the same learned policy still drives the empirical density to the target distribution within small error bounds.

Figures

Figures reproduced from arXiv: 2511.21304 by Davide Salzano, Italo Napolitano, Luigi Catello, Mario di Bernardo.

**Figure 1.** Figure 1: Proposed control architecture. The macro-micro controller computes the control actions u(t) knowing the desired target distribution ρ T(x) and sensing the current herders’ position H(t). The herders influence the target distribution ρ T(x, t) through the velocity field V (x, t). acceleration and assuming a drag force proportional to velocity. We also define the vector containing the positions of all the he… view at source ↗

**Figure 2.** Figure 2: Episodic reward during the training process of the PPO agent. Values are smoothed with a moving average of width 20 steps. Symbol Parameter Value ∆x Spatial step size 2π/250 ∆t Time step size for ODE simulation 0.01 ∆tPDE Time step size for PDE simulation 0.0005 Th Time horizon 150 NH Number of herders 2 D Diffusion coefficient 0.05 L Kernel interaction length π κ Concentration of Von Mises distribution 16… view at source ↗

**Figure 3.** Figure 3: Performance of the learned policy using the reward function in (10) and the compensation law in (12). (a) L2 norm of the target distribution error e T (red) and Euclidean vector norm of the control effort ∥u∥2 (blue) over time; (b) Top panel: Desired target density distribution ρ T in space. Bottom panel: Evolution of the targets density and of the herders positions in space (x axis) and time (y axis) for … view at source ↗

**Figure 4.** Figure 4: Evolution of the [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Robustness analysis of the proposed control strategy against constant disturbances (panels a, b) and measurement noise (panels c, d). (a) Steady-state error e T,ss varying the constant disturbance on the herders dynamics. (b) Top panel: Desired target density distribution ρ T in space. Bottom panel: Evolution of the targets density and of the herders positions in space (x axis) and time (y axis) for a repr… view at source ↗

read the original abstract

We propose a Reinforcement Learning framework for sparse indirect control of large-scale multi-agent systems, where few controlled agents shape the collective behavior of many uncontrolled agents. The approach addresses this multi-scale challenge by coupling ODEs (modeling controlled agents) with a PDE (describing the uncontrolled population density), capturing how microscopic control achieves macroscopic objectives. Our method combines model-free Reinforcement Learning with adaptive interaction strength compensation to overcome sparse actuation limitations. Numerical validation demonstrates effective density control, with the system achieving target distributions while maintaining robustness to disturbances and measurement noise, confirming that learning-based sparse control can replace computationally expensive online optimization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper pairs model-free RL with an ODE-PDE hybrid to steer large crowds via a few agents, and the numerics look workable inside that model, but the mean-field approximation gets no error analysis.

read the letter

The core contribution is a reinforcement learning setup that lets a small number of controlled agents shape the density of a much larger uncontrolled population. They model the controlled agents with ODEs and the rest with a Fokker-Planck-style PDE, then add an adaptive term that compensates for the sparse actuation by adjusting interaction strengths during learning. The numerical examples show the closed-loop system reaching target densities and holding up under added noise and disturbances, which suggests the learned policy can stand in for heavier online optimization methods in simulation.

Referee Report

2 major / 1 minor

Summary. The paper proposes a reinforcement learning framework for sparse indirect control of large-scale multi-agent systems. Controlled agents are modeled via ODEs while the uncontrolled population is represented by a Fokker-Planck-type PDE density; model-free RL with adaptive interaction compensation is used to learn policies that drive the system toward target densities. Numerical validation on the hybrid model is reported to demonstrate effective density tracking together with robustness to disturbances and measurement noise, positioning the method as a computationally lighter alternative to online optimization.

Significance. If the approximation quality and transfer to the discrete particle system can be established, the hybrid ODE-PDE plus RL approach would offer a scalable route to shepherding large swarms without requiring real-time solution of high-dimensional optimization problems. The combination of model-free learning with explicit multi-scale modeling is a constructive step for indirect control of continuum limits.

major comments (2)

[Section 3] Section 3: The coupling of the controlled agents' ODEs to the Fokker-Planck PDE for the uncontrolled density is introduced without a priori error bounds, convergence rates, or mean-field limit analysis. Because the learned policy may generate non-smooth or localized forcing, it is unclear whether the continuum approximation remains accurate for finite (even if large) agent counts; all reported robustness results are obtained exclusively inside the hybrid model.
[Numerical validation] Numerical validation section: The claims that the learned sparse controller achieves target distributions and remains robust are supported only by simulations of the continuum hybrid system. No discrete-to-continuum discrepancy checks, ablation studies on the number of agents, or direct comparisons against the underlying finite-agent dynamics are provided, leaving open whether the reported performance carries over to the original multi-agent system.

minor comments (1)

[Abstract] The abstract states the numerical results qualitatively but supplies no concrete metrics (e.g., L1 or Wasserstein density errors, success rates, or baseline comparisons), which would help readers gauge the practical improvement over optimization-based methods.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. The comments highlight important aspects regarding the theoretical justification of the hybrid modeling approach and the extent of numerical validation. We provide point-by-point responses below and indicate the revisions we intend to incorporate in the revised version of the paper.

read point-by-point responses

Referee: [Section 3] Section 3: The coupling of the controlled agents' ODEs to the Fokker-Planck PDE for the uncontrolled density is introduced without a priori error bounds, convergence rates, or mean-field limit analysis. Because the learned policy may generate non-smooth or localized forcing, it is unclear whether the continuum approximation remains accurate for finite (even if large) agent counts; all reported robustness results are obtained exclusively inside the hybrid model.

Authors: We agree that the manuscript does not include a rigorous mean-field limit analysis or a priori error bounds for the hybrid ODE-PDE coupling, particularly under the potentially non-smooth controls generated by the RL policy. The hybrid model is introduced as a practical approximation for large-scale systems, leveraging the fact that the uncontrolled agents are numerous and can be represented by their density evolution via the Fokker-Planck equation, while the few controlled agents are tracked individually via ODEs. This modeling choice is common in multi-scale multi-agent control literature. However, we acknowledge the validity of the concern for finite agent numbers. In the revision, we will add a dedicated paragraph in Section 3 discussing the modeling assumptions, referencing related mean-field results, and explicitly stating the limitations regarding error bounds and the potential impact of non-smooth forcing. We will also note that all robustness claims are with respect to the hybrid model and clarify this scope. revision: partial
Referee: [Numerical validation] Numerical validation section: The claims that the learned sparse controller achieves target distributions and remains robust are supported only by simulations of the continuum hybrid system. No discrete-to-continuum discrepancy checks, ablation studies on the number of agents, or direct comparisons against the underlying finite-agent dynamics are provided, leaving open whether the reported performance carries over to the original multi-agent system.

Authors: We concur that demonstrating the consistency between the hybrid continuum model and the underlying discrete multi-agent system is essential to support the applicability of our method. The current numerical results focus on the hybrid model because it directly corresponds to the large-scale regime targeted by the approach. To address this comment, we will perform and include additional numerical experiments in the revised manuscript. Specifically, we will simulate the finite-agent system with increasing numbers of uncontrolled agents and compare the resulting density evolution to the hybrid model predictions. This will include discrepancy metrics and robustness tests under disturbances for both models. We will also add an ablation study varying the total number of agents to illustrate convergence to the continuum limit. These additions will provide evidence that the performance observed in the hybrid model translates to the discrete setting. revision: yes

standing simulated objections not resolved

Providing a full rigorous mean-field convergence analysis with rates for the hybrid system under RL-generated controls, as this would require substantial additional theoretical development beyond the scope of the current work.

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper proposes an RL-based sparse control framework by coupling controlled-agent ODEs to a Fokker-Planck-type PDE for uncontrolled density as an explicit modeling choice to handle multi-scale shepherding. The RL policy is model-free and trained inside the stated hybrid model; no derivation step reduces a claimed prediction or uniqueness result to a fitted parameter, self-citation chain, or ansatz that is itself defined by the target outcome. Numerical validation occurs within the hybrid model without any self-definitional loop (e.g., no ratio or density target fitted from the same data then re-predicted). The central claim therefore remains independent of its inputs and is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the modeling assumption that a PDE continuum description remains valid for the uncontrolled population and that RL can discover compensating policies for the sparse actuation.

axioms (1)

domain assumption Uncontrolled agents admit a continuum PDE density approximation whose interaction with controlled agents is adequately captured by the chosen coupling terms.
Invoked when the paper states that ODEs for controlled agents are coupled with a PDE for the uncontrolled population density.

pith-pipeline@v0.9.0 · 5399 in / 1150 out tokens · 56139 ms · 2026-05-17T05:05:47.170656+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We model each herder as a single integrator... targets... random walkers... Fokker-Planck equation (4)... coupled ODE–PDE system
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

reward function... L2 error between desired and steady-state density profiles

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Leader-Follower Density Control of Multi-Agent Systems with Interacting Followers: Feasibility and Convergence Analysis
eess.SY 2026-04 unverdicted novelty 7.0

Derives necessary and sufficient feasibility conditions for target density in leader-follower systems with follower interactions, plus a locally stabilizing feedback law with explicit basin of attraction.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

Reflections on the future of swarm robotics,

M. Dorigo, G. Theraulaz, and V. Trianni, “Reflections on the future of swarm robotics,” Science Robotics, vol. 5, no. 49, p. eabe4385, 2020

work page 2020
[2]

Controlling complex networks with complex nodes,

R. M. D’Souza, M. di Bernardo, and Y.-Y. Liu, “Controlling complex networks with complex nodes,” Nature Reviews Physics, vol. 5, no. 4, pp. 250–262, 2023

work page 2023
[3]

Interactive planning for shepherd motion

J.-M. Lien and E. Pratt, “Interactive planning for shepherd motion. ” in AAAI Spring Symposium: Agents That Learn from Human Teachers, ser. AAAI Spring Symposium - Technical Report, 2009, pp. 95–102

work page 2009
[4]

Modeling of self-organized systems interacting with a few individuals: From microscopic to macro- scopic dynamics,

G. Albi and L. Pareschi, “Modeling of self-organized systems interacting with a few individuals: From microscopic to macro- scopic dynamics,” Applied Mathematics Letters, vol. 26, no. 4, pp. 397–401, 2013

work page 2013
[5]

Oil Spill Cleaning Up Using Swarm of Robots,

E. M. H. Zahugi, M. M. Shanta, and T. V. Prasad, “Oil Spill Cleaning Up Using Swarm of Robots,” in Advances in Computing and Information Technology, N. Meghanathan, D. Nagamalai, and N. Chaki, Eds. Springer Berlin Heidelberg, 2013, pp. 215–224

work page 2013
[6]

Single-agent indirect herding of multiple targets with uncertain dynamics,

R. A. Licitra, Z. I. Bell, and W. E. Dixon, “Single-agent indirect herding of multiple targets with uncertain dynamics,” IEEE Transactions on Robotics, vol. 35, no. 4, pp. 847–860, 2019

work page 2019
[7]

Macroscopic descriptions of follower-leader sys- tems,

S. Bernardi, G. Estrada-Rodriguez, H. Gimperlein, and K. J. Painter, “Macroscopic descriptions of follower-leader sys- tems,” Kinetic and Related Models , 2021

work page 2021
[8]

A Comprehensive Review of Shepherding as a Bio- Inspired Swarm-Robotics Guidance Approach,

N. K. Long, K. Sammut, D. Sgarioto, M. Garratt, and H. A. Abbass, “A Comprehensive Review of Shepherding as a Bio- Inspired Swarm-Robotics Guidance Approach,” IEEE Trans- actions on Emerging Topics in Computational Intelligence, vol. 4, no. 4, pp. 523–537, 2020

work page 2020
[9]

Solving the shepherding problem: Heuristics for herding autonomous, interacting agents,

D. Strömbom, R. Mann, A. Wilson, S. Hailes, D. Sumpter, and A. King, “Solving the shepherding problem: Heuristics for herding autonomous, interacting agents,” Journal of The Royal Society Interface, vol. 11, 2014

work page 2014
[10]

Hierarchical learning-based control for multi-agent shepherding of stochastic autonomous agents,

I. Napolitano, S. Covone, A. Lama, F. De Lellis, and M. di Bernardo, “Hierarchical learning-based control for multi-agent shepherding of stochastic autonomous agents,” arXiv:2508.02632, 2025

work page arXiv 2025
[11]

Nonrecip- rocal field theory for decision-making in multi-agent control systems,

A. Lama, M. di Bernardo, and Sabine. H. L. Klapp, “Nonrecip- rocal field theory for decision-making in multi-agent control systems,” Nature Communications, vol. 16, no. 1, p. 8450, 2025

work page 2025
[12]

Leader-follower density control of spatial dynamics in large- scale multi-agent systems,

G. C. Maffettone, A. Boldini, M. Porfiri, and M. di Bernardo, “Leader-follower density control of spatial dynamics in large- scale multi-agent systems,” IEEE Transactions on Automatic Control, pp. 1–16, 2025

work page 2025
[13]

Micro-macro and macro-macro limits for controlled leader-follower systems,

G. Albi, Y.-P. Choi, M. Piu, and S. Song, “Micro-macro and macro-macro limits for controlled leader-follower systems,” arXiv:2508.04020, 2025

work page arXiv 2025
[14]

Mean-Field Sparse Optimal Control of Systems with Additive White Noise,

G. Ascione, D. Castorina, and F. Solombrino, “Mean-Field Sparse Optimal Control of Systems with Additive White Noise,” SIAM Journal on Mathematical Analysis, vol. 55, no. 6, pp. 6965–6990, 2023

work page 2023
[15]

Invisible control of self-organizing agents leaving unknown environ- ments,

G. Albi, M. Bongini, E. Cristiani, and D. Kalise, “Invisible control of self-organizing agents leaving unknown environ- ments,” SIAM Journal on Applied Mathematics, vol. 76, no. 4, pp. 1683–1710, 2016

work page 2016
[16]

Optimized Leaders Strategies for Crowd Evacuation in Unknown Environments with Multiple Exits,

G. Albi, F. Ferrarese, and C. Segala, “Optimized Leaders Strategies for Crowd Evacuation in Unknown Environments with Multiple Exits,” in Crowd Dynamics, Volume 3, N. Bel- lomo and L. Gibelli, Eds. Springer International Publishing, 2021, pp. 97–131

work page 2021
[17]

Con- trollability and Stabilization for Herding a Robotic Swarm Using a Leader: A Mean-Field Approach,

K. Elamvazhuthi, Z. Kakish, A. Shirsat, and S. Berman, “Con- trollability and Stabilization for Herding a Robotic Swarm Using a Leader: A Mean-Field Approach,” IEEE Transactions on Robotics, vol. 37, no. 2, pp. 418–432, 2021

work page 2021
[18]

Using re- inforcement learning to herd a robotic swarm to a target distribution,

Z. Kakish, K. Elamvazhuthi, and S. Berman, “Using re- inforcement learning to herd a robotic swarm to a target distribution,” in Distributed Autonomous Robotic Systems. Springer International Publishing, 2022, pp. 401–414

work page 2022
[19]

Living materials for regenerative medicine,

Y. Yu, Q. Wang, C. Wang, and L. Shang, “Living materials for regenerative medicine,” Engineered Regeneration, vol. 2, pp. 96–104, 2021

work page 2021
[20]

Dissipation of stop-and-go waves via control of autonomous vehicles: Field experiments,

R. E. Stern, S. Cui, M. L. Delle Monache, R. Bhadani, M. Bunting, M. Churchill, N. Hamilton, R. Haulcy, H. Pohlmann, F. Wu, et al., “Dissipation of stop-and-go waves via control of autonomous vehicles: Field experiments,” Transportation research part C: emerging technologies, vol. 89, pp. 205–221, 2018

work page 2018
[21]

Modeling opinion dynamics: Theoretical analysis and continuous ap- proximation,

J. P. Pinasco, V. Semeshenko, and P. Balenzuela, “Modeling opinion dynamics: Theoretical analysis and continuous ap- proximation,” Chaos, Solitons & Fractals, vol. 98, pp. 210–215, 2017

work page 2017
[22]

A continuification-based control solution for large-scale shep- herding,

B. Di Lorenzo, G. C. Maffettone, and M. di Bernardo, “A continuification-based control solution for large-scale shep- herding,” European Journal of Control, p. 101324, 2025

work page 2025
[23]

A primer of swarm equilibria,

A. J. Bernoff and C. M. Topaz, “A primer of swarm equilibria,” Fig. 5. Robustness analysis of the proposed control strategy against constant disturbances (panels a, b) and measurement noise (panels c, d). (a) Steady-state error eT,ss varying the constant disturbance on the herders dynamics. (b) Top panel: Desired target density distribution ρT in space. B...

work page 2011
[24]

Shepherding and herdability in complex multiagent systems,

A. Lama and M. di Bernardo, “Shepherding and herdability in complex multiagent systems,” Physical Review Research, vol. 6, no. 3, p. L032012, 2024

work page 2024
[25]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[26]

Gardiner, Handbook of Stochastic Methods for Physics, Chemistry, and the Natural Sciences, ser

C. Gardiner, Handbook of Stochastic Methods for Physics, Chemistry, and the Natural Sciences, ser. Springer Complex- ity. Springer, 2004

work page 2004
[27]

Quarteroni and S

A. Quarteroni and S. Quarteroni, Numerical Models for Differential Problems. Springer, 2009, vol. 2

work page 2009
[28]

De- centralized Continuification Control of Multi-Agent Systems via Distributed Density Estimation,

B. Di Lorenzo, G. C. Maffettone, and M. di Bernardo, “De- centralized Continuification Control of Multi-Agent Systems via Distributed Density Estimation,” IEEE Control Systems Letters, vol. 9, pp. 1580–1585, 2025

work page 2025
[29]

Multi-agent deep reinforcement learning: a survey,

S. Gronauer and K. Diepold, “Multi-agent deep reinforcement learning: a survey,” Artificial Intelligence Review, vol. 55, no. 2, pp. 895–943, 2022

work page 2022