pith. sign in

arxiv: 2605.25025 · v1 · pith:YUYSBJCWnew · submitted 2026-05-24 · 💻 cs.RO · cs.SY· eess.SY

Micro-Swarm Locomotion Optimization in Dynamic Flow using Multi-Objective Multi-Agent Reinforcement Learning

Pith reviewed 2026-06-30 00:27 UTC · model grok-4.3

classification 💻 cs.RO cs.SYeess.SY
keywords micro-swarmmulti-agent reinforcement learningcomputational fluid dynamicspulsatile flowPCGradenergy efficiencybiomedical navigationdecentralized PPO
0
0 comments X

The pith

A hybrid CFD and multi-objective multi-agent RL framework lets sixteen micro-robots navigate pulsatile arterial flow while balancing upstream progress, energy efficiency, and motion smoothness.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper couples a high-fidelity incompressible Navier-Stokes solver directly to decentralized proximal policy optimization so that agents learn control policies that reconcile three competing objectives inside a single training loop. PCGrad gradient surgery is shown to be required; without it, energy and smoothness rewards collapse while progress oscillates. The resulting policy reaches a progress reward of 6.5–7.0, sustained energy efficiency of 0.63–0.65, and smoothness of 0.97–0.99, outperforming brute-force baselines that produce negative energy efficiency. Three distinct behavioral phases emerge during training: collective hydrodynamic throttling, cycle-synchronized ratcheting during flow reversal, and individualized final approach. These results indicate that time-dependent fluid interactions can be handled inside the reinforcement-learning loop itself rather than through hand-designed rules.

Core claim

Time-dependent fluid-agent interactions can be captured directly within multi-objective reinforcement learning loops; the converged decentralized PPO policy with PCGrad achieves a progress reward of 6.5–7.0, energy efficiency of 0.63–0.65, and smoothness of 0.97–0.99 while navigating a pulsatile arterial waveform, and produces three emergent phases (two-layer throttling formation, cycle-synchronized ratchet, individualized terminal approach) that brute-force baselines cannot match.

What carries the argument

Hybrid CFD-MOMARL loop that couples an incompressible Navier-Stokes solver to decentralized PPO agents whose gradients are reconciled by PCGrad to resolve conflicts among progress, energy, and smoothness rewards.

If this is right

  • Collective two-layer hydrodynamic throttling suppresses peak channel velocities during forward flow.
  • Cycle-synchronized ratchet exploits flow reversals for upstream repositioning without extra energy cost.
  • Individualized final approach emerges once agents near the success boundary.
  • Brute-force baselines produce negative energy efficiency while the learned policy remains positive.
  • Absence of PCGrad causes energy and smoothness rewards to collapse within 10,000 steps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same loop could be tested on other time-periodic flows such as respiratory or microfluidic channels to check whether the three-phase emergence is general.
  • If the simulated policies transfer to hardware, the approach could reduce reliance on manual trajectory design for in-vivo micro-robot navigation.
  • The necessity of PCGrad suggests that multi-objective fluid problems systematically produce gradient conflicts that single-objective RL cannot resolve.

Load-bearing premise

The incompressible Navier-Stokes solver accurately represents the physiologically realistic time-dependent fluid environments and the learned policies remain physically consistent without additional real-world validation.

What would settle it

Deploy the trained policy on physical magnetically actuated micro-robots inside a pulsatile-flow channel that matches the simulated waveform and measure whether the observed energy efficiency and smoothness match the simulated values of 0.63–0.65 and 0.97–0.99.

Figures

Figures reproduced from arXiv: 2605.25025 by Josef Berman, Oren Gal.

Figure 2
Figure 2. Figure 2: Schematic of the physical initial setup. the domain is a [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Triphasic temporal velocity profile, representing the rapid systole, [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Architecture of the centralized multi-objective critic. The network [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Per-episode training curves for the three reward objectives of the [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Swarm behavior before policy learning. Snapshots of the x-velocity field (color map, blue: negative; red: positive), swarm member positions (colored [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Swarm behavior after policy learning. Snapshots of the x-velocity field and swarm member positions at twelve successive time points during a [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Ablation study comparing per-episode training curves for the three [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗
read the original abstract

Coordinating micro-robotic swarms in physiologically realistic, time-dependent fluid environments remains an unsolved challenge for biomedical and environmental applications. We present a hybrid Computational Fluid Dynamics - Multi-Objective Multi-Agent Reinforcement Learning framework that directly couples a high-fidelity incompressible Navier-Stokes solver with decentralized proximal policy optimization to learn physically consistent swarm control strategies in oscillatory flow. Sixteen magnetically actuated micro-robots navigate a pulsatile arterial waveform, simultaneously optimizing upstream progression, energy conservation, and motion smoothness, reconciled using PCGrad surgery. Without PCGrad, energy efficiency and smoothness rewards collapse to near zero within 10,000 training steps while progress exhibits persistent large-amplitude oscillations, confirming that gradient conflict resolution is a structural requirement rather than an optional refinement in this domain. The converged policy achieves a progress reward of 6.5-7.0, a sustained energy efficiency of 0.63-0.65, and near-maximum smoothness (0.97-0.99), representing improvements over brute-force baselines on the primary objective while both baselines yield negative energy efficiency throughout. Training reveals three emergent behavioral phases: a collective two-layer hydrodynamic throttling formation that suppresses peak channel velocities during forward flow, a cycle-synchronized ratchet mechanism that exploits flow reversals for upstream repositioning, and an individualized final approach as agents near the success boundary. These results establish that time-dependent fluid-agent interactions can be captured directly within multi-objective reinforcement learning loops, offering a physically grounded paradigm for micro-swarm control in biomedical navigation, environmental monitoring, and industrial microfluidics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents a hybrid CFD-MARL framework coupling a high-fidelity incompressible Navier-Stokes solver with decentralized PPO and PCGrad gradient surgery to optimize upstream progression, energy efficiency, and smoothness for 16 magnetically actuated micro-robots in a pulsatile arterial waveform. It reports that the converged policy attains progress rewards of 6.5-7.0, energy efficiency of 0.63-0.65, and smoothness of 0.97-0.99, outperforming brute-force baselines on the primary objective (while baselines yield negative energy efficiency), demonstrates that PCGrad is required to prevent reward collapse, and identifies three emergent phases: two-layer hydrodynamic throttling, cycle-synchronized ratchet repositioning, and individualized final approach.

Significance. If the simulation results are robust, the work demonstrates that multi-objective MARL can directly incorporate time-dependent fluid-agent interactions to produce coordinated swarm behaviors, with the explicit demonstration that PCGrad is structurally necessary (rather than optional) constituting a useful methodological contribution for conflicting-objective domains. The provision of concrete reward ranges and training-step dynamics strengthens reproducibility within simulation studies.

major comments (2)
  1. [Abstract] Abstract: The central claim that the learned policies are 'physically consistent' and that the framework offers a 'physically grounded paradigm' for biomedical navigation rests on the untested premise that the chosen incompressible NS solver with rigid-domain pulsatile waveform faithfully reproduces dominant forces; no sensitivity analysis to compliant-wall FSI, non-Newtonian rheology, or laboratory PIV data is supplied, rendering the reported reward values and emergent formations (two-layer throttling, ratchet) potentially solver-specific artifacts.
  2. [Abstract] Abstract (training results paragraph): The necessity of PCGrad is shown only via collapse within this single fluid model after 10,000 steps; without ablations across varied waveforms, Reynolds numbers, or alternative solvers, it remains unclear whether gradient conflict resolution is a general structural requirement or an artifact of the specific environment and reward scaling.
minor comments (1)
  1. [Abstract] Abstract: The brute-force baseline implementation details (e.g., whether it optimizes the same multi-objective reward or only progress) are not stated, complicating direct comparison of the reported improvements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and propose targeted revisions to the abstract and discussion to better qualify the scope of our claims while preserving the core contribution of the hybrid CFD-MOMARL framework.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that the learned policies are 'physically consistent' and that the framework offers a 'physically grounded paradigm' rests on the untested premise that the chosen incompressible NS solver with rigid-domain pulsatile waveform faithfully reproduces dominant forces; no sensitivity analysis to compliant-wall FSI, non-Newtonian rheology, or laboratory PIV data is supplied, rendering the reported reward values and emergent formations (two-layer throttling, ratchet) potentially solver-specific artifacts.

    Authors: We agree that the phrasing 'physically consistent' and 'physically grounded paradigm' can be read as implying broader physical fidelity than the chosen model supports. In the manuscript these terms denote direct coupling to the incompressible Navier-Stokes equations (as opposed to reduced-order or surrogate dynamics) within a standard rigid-wall Newtonian setup. We will revise the abstract to replace these phrases with 'consistent with the incompressible NS model' and 'simulation-based paradigm,' and we will add a limitations subsection explicitly noting the rigid-wall and Newtonian assumptions together with the absence of FSI, non-Newtonian, or PIV validation. This revision clarifies the intended scope without altering the reported results. revision: yes

  2. Referee: [Abstract] Abstract (training results paragraph): The necessity of PCGrad is shown only via collapse within this single fluid model after 10,000 steps; without ablations across varied waveforms, Reynolds numbers, or alternative solvers, it remains unclear whether gradient conflict resolution is a general structural requirement or an artifact of the specific environment and reward scaling.

    Authors: The PCGrad ablation demonstrates that, in the presence of the three conflicting objectives and the time-varying fluid forces of this environment, gradient surgery is required to avoid reward collapse. The manuscript does not claim universality across all MARL settings; the statement 'structural requirement rather than an optional refinement in this domain' is intended to be environment-specific. We will revise the abstract and the corresponding methods paragraph to read 'required to prevent collapse in this multi-objective fluid-interaction setting' and will add a short discussion paragraph acknowledging that broader ablations across waveforms and solvers would be valuable future work. revision: partial

Circularity Check

0 steps flagged

No circularity detected; results are empirical outputs from RL training

full rationale

The paper reports performance metrics and emergent behaviors obtained by running decentralized PPO with PCGrad inside an external incompressible Navier-Stokes solver, then comparing against separate brute-force baselines. No mathematical derivation chain exists that reduces a claimed prediction or first-principles result to its own inputs by construction. Claims about PCGrad being required are supported by direct training observations (collapse without it), not by self-definition, fitted parameters renamed as predictions, or load-bearing self-citations. The framework is self-contained as simulation-based optimization with external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper relies on standard CFD and RL assumptions; no new free parameters or invented entities explicitly introduced in the abstract. Assessment limited by abstract-only access.

axioms (2)
  • standard math Incompressible Navier-Stokes equations govern the fluid flow
    Used as the high-fidelity solver in the hybrid framework.
  • domain assumption Decentralized proximal policy optimization can learn swarm control strategies
    Core to the MOMARL component.

pith-pipeline@v0.9.1-grok · 5843 in / 1195 out tokens · 80498 ms · 2026-06-30T00:27:58.119345+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

79 extracted references · 16 canonical work pages · 2 internal anchors

  1. [1]

    Life at low reynolds number,

    E. M. Purcell, “Life at low reynolds number,”American Journal of Physics, vol. 45, no. 1, pp. 3–11, 01 1977. [Online]. Available: https://doi.org/10.1119/1.10903

  2. [2]

    Self-induced oscillations in an open water-channel with slotted walls,

    P. L. Betts, “Self-induced oscillations in an open water-channel with slotted walls,”Journal of Fluid Mechanics, vol. 55, no. 3, p. 401–417, 1972

  3. [3]

    Numerical simulation of blood flow in venous networks,

    M. Cermatori, “Numerical simulation of blood flow in venous networks,” Ph.D. dissertation, Imperial College London, 04 2014

  4. [4]

    Versatile microrobotics using simple modular subunits,

    U. K. Cheang, F. Meshkati, H. Kim, K. Lee, H. C. Fu, and M. J. Kim, “Versatile microrobotics using simple modular subunits,” Scientific Reports, vol. 6, no. 1, p. 30472, 2016. [Online]. Available: https://doi.org/10.1038/srep30472

  5. [5]

    The Self-Propulsion of Microscopic Organisms through Liquids,

    G. J. Hancock, “The Self-Propulsion of Microscopic Organisms through Liquids,”Proceedings of the Royal Society of London Series A, vol. 217, no. 1128, pp. 96–121, Mar. 1953

  6. [6]

    Bio-inspired magnetic swimming microrobots for biomedical applications,

    K. E. Peyer, L. Zhang, and B. J. Nelson, “Bio-inspired magnetic swimming microrobots for biomedical applications,”Nanoscale, vol. 5, pp. 1259–1272, 2013. [Online]. Available: http://dx.doi.org/10.1039/ C2NR32554C

  7. [7]

    Recent process in microrobots: From propulsion to swarming for biomedical applications,

    R. Wu, Y . Zhu, X. Cai, S. Wu, L. Xu, and T. Yu, “Recent process in microrobots: From propulsion to swarming for biomedical applications,”Micromachines, vol. 13, no. 9, 2022. [Online]. Available: https://www.mdpi.com/2072-666X/13/9/1473

  8. [8]

    Propulsion mecha- nisms and applications of multiphysics- driven micro- and nanomotors,

    X. Chang, T. Li, D. Zhou, G. Zhang, and L. LI, “Propulsion mecha- nisms and applications of multiphysics- driven micro- and nanomotors,” Chinese Science Bulletin, vol. 62, pp. 122–135, 01 2017

  9. [9]

    Self-learning how to swim at low reynolds number,

    A. C. H. Tsang, P. W. Tong, S. Nallan, and O. S. Pak, “Self-learning how to swim at low reynolds number,”Phys. Rev. Fluids, vol. 5, p. 074101, Jul 2020. [Online]. Available: https://link.aps.org/doi/10.1103/PhysRevFluids.5.074101

  10. [10]

    Finite element model of a helical swimming robot in comsol,

    M. A. L ´opez, C. Nuevo-Gallardo, J. E. Traver, I. Tejado, and B. M. Vinagre, “Finite element model of a helical swimming robot in comsol,” inCOMSOL Conference 2024, 2024

  11. [11]

    A survey on swarm microrobotics,

    L. Yang, J. Yu, S. Yang, B. Wang, B. J. Nelson, and L. Zhang, “A survey on swarm microrobotics,”IEEE Transactions on Robotics, vol. 38, no. 3, pp. 1531–1551, 2022

  12. [12]

    Magnetic microrobotic swarms in fluid suspensions,

    H. Chen and J. Yu, “Magnetic microrobotic swarms in fluid suspensions,”Current Robotics Reports, vol. 3, no. 3, pp. 127–137,

  13. [13]

    Available: https://doi.org/10.1007/s43154-022-00085-6

    [Online]. Available: https://doi.org/10.1007/s43154-022-00085-6

  14. [14]

    Navigation of magnetic microrobotic swarms with maintained structural integrity in fluidic flow,

    J. Law, X. Du, W. Tang, J. Yu, and Y . Sun, “Navigation of magnetic microrobotic swarms with maintained structural integrity in fluidic flow,” IEEE/ASME Transactions on Mechatronics, vol. 29, no. 1, pp. 74–84, 2024

  15. [15]

    Opportunities and challenges with autonomous micro aerial vehicles,

    V . Kumar and N. Michael, “Opportunities and challenges with autonomous micro aerial vehicles,”Int. J. Rob. Res., vol. 31, no. 11, p. 1279–1291, Sep. 2012. [Online]. Available: https: //doi.org/10.1177/0278364912455954

  16. [16]

    Oscillatory gravity waves in flowing water,

    T. Sarpkaya, “Oscillatory gravity waves in flowing water,”Transactions of the American Society of Civil Engineers, vol. 122, no. 1, pp. 564–586, 1957. [Online]. Available: https://ascelibrary.org/doi/abs/10. 1061/TACEAT.0007497

  17. [17]

    Information theoretic mpc for model-based reinforcement learning,

    G. Williams, N. Wagener, B. Goldfain, P. Drews, J. M. Rehg, B. Boots, and E. A. Theodorou, “Information theoretic mpc for model-based reinforcement learning,” in2017 IEEE International Conference on Robotics and Automation (ICRA), 2017, pp. 1714–1721

  18. [18]

    Independent control of multiple magnetic microrobots in three dimensions,

    E. Diller, J. Giltinan, and M. Sitti, “Independent control of multiple magnetic microrobots in three dimensions,”The International Journal of Robotics Research, vol. 32, no. 5, pp. 614–631, 2013

  19. [19]

    Aircraft trajectory planning with collision avoidance using mixed integer linear programming,

    A. Richards and J. P. How, “Aircraft trajectory planning with collision avoidance using mixed integer linear programming,” inProceedings of the 2002 American control conference (IEEE Cat. No. CH37301), vol. 3. IEEE, 2002, pp. 1936–1941

  20. [20]

    Magnetic therapeutic delivery using navigable agents,

    S. Martel, “Magnetic therapeutic delivery using navigable agents,” Therapeutic delivery, vol. 5, no. 2, pp. 189–204, 2014

  21. [21]

    Mesbahi and M

    M. Mesbahi and M. Egerstedt,Graph theoretic methods in multiagent networks. Princeton University Press, 2010

  22. [22]

    Foundations of swarm robotic chemical plume tracing from a fluid dynamics perspective,

    D. F. Spears, D. R. Thayer, and D. V . Zarzhitsky, “Foundations of swarm robotic chemical plume tracing from a fluid dynamics perspective,”International Journal of Intelligent Computing and Cybernetics, vol. 2, no. 4, pp. 745–785, Jan. 2009. [Online]. Available: https://doi.org/10.1108/17563780911005863

  23. [23]

    Disruption of vertical motility by shear triggers formation of thin phytoplankton layers,

    W. M. Durham, J. O. Kessler, and R. Stocker, “Disruption of vertical motility by shear triggers formation of thin phytoplankton layers,” science, vol. 323, no. 5917, pp. 1067–1070, 2009

  24. [24]

    Swarm robotics,

    M. Dorigo, M. Birattari, and M. Brambilla, “Swarm robotics,”Scholar- pedia, vol. 9, no. 1, p. 1463, 2014

  25. [25]

    R. S. Sutton, A. G. Bartoet al.,Reinforcement learning: An introduction. MIT press Cambridge, 1998, vol. 1, no. 1

  26. [26]

    Pilco: A model-based and data-efficient approach to policy search,

    M. Deisenroth and C. E. Rasmussen, “Pilco: A model-based and data-efficient approach to policy search,” inProceedings of the 28th International Conference on machine learning (ICML-11), 2011, pp. 465–472

  27. [27]

    Dream to Control: Learning Behaviors by Latent Imagination

    D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi, “Dream to control: Learn- ing behaviors by latent imagination,”arXiv preprint arXiv:1912.01603, 2019

  28. [28]

    Model regularization for stable sample rollouts

    E. Talvitie, “Model regularization for stable sample rollouts.” inUAI, 2014, pp. 780–789

  29. [29]

    Multi-agent reinforce- ment learning: An overview,

    L. Bus ¸oniu, R. Babu ˇska, and B. De Schutter, “Multi-agent reinforce- ment learning: An overview,”Innovations in multi-agent systems and applications-1, pp. 183–221, 2010

  30. [30]

    Proximal Policy Optimization Algorithms

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox- imal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017

  31. [31]

    Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor,

    T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor,” inInternational conference on machine learning. Pmlr, 2018, pp. 1861–1870

  32. [32]

    Controlled gliding and perching through deep-reinforcement-learning,

    G. Novati, L. Mahadevan, and P. Koumoutsakos, “Controlled gliding and perching through deep-reinforcement-learning,”Physical Review Fluids, vol. 4, no. 9, p. 093902, 2019

  33. [33]

    Scaling macroscopic aquatic locomotion,

    M. Gazzola, M. Argentina, and L. Mahadevan, “Scaling macroscopic aquatic locomotion,”Nature Physics, vol. 10, no. 10, pp. 758–761, 2014. 13

  34. [34]

    A comprehensive survey of multiagent reinforcement learning,

    L. Busoniu, R. Babuska, and B. De Schutter, “A comprehensive survey of multiagent reinforcement learning,”IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 38, no. 2, pp. 156–172, 2008

  35. [35]

    Multi-agent reinforcement learning: A selective overview of theories and algorithms,

    K. Zhang, Z. Yang, and T. Bas ¸ar, “Multi-agent reinforcement learning: A selective overview of theories and algorithms,”Handbook of rein- forcement learning and control, pp. 321–384, 2021

  36. [36]

    Multi-agent actor-critic for mixed cooperative-competitive environ- ments,

    R. Lowe, Y . I. Wu, A. Tamar, J. Harb, O. Pieter Abbeel, and I. Mordatch, “Multi-agent actor-critic for mixed cooperative-competitive environ- ments,”Advances in neural information processing systems, vol. 30, 2017

  37. [37]

    Counterfactual multi-agent policy gradients,

    J. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson, “Counterfactual multi-agent policy gradients,” inProceedings of the AAAI conference on artificial intelligence, vol. 32, no. 1, 2018

  38. [38]

    Multi-agent reinforcement learning: Independent vs. cooper- ative agents,

    M. Tan, “Multi-agent reinforcement learning: Independent vs. cooper- ative agents,” inProceedings of the tenth international conference on machine learning, 1993, pp. 330–337

  39. [39]

    Optimal and approxi- mate q-value functions for decentralized pomdps,

    F. A. Oliehoek, M. T. Spaan, and N. Vlassis, “Optimal and approxi- mate q-value functions for decentralized pomdps,”Journal of Artificial Intelligence Research, vol. 32, pp. 289–353, 2008

  40. [40]

    Monotonic value function factorisation for deep multi- agent reinforcement learning,

    T. Rashid, M. Samvelyan, C. S. De Witt, G. Farquhar, J. Foerster, and S. Whiteson, “Monotonic value function factorisation for deep multi- agent reinforcement learning,”Journal of Machine Learning Research, vol. 21, no. 178, pp. 1–51, 2020

  41. [41]

    The surprising effectiveness of ppo in cooperative multi-agent games,

    C. Yu, A. Velu, E. Vinitsky, J. Gao, Y . Wang, A. Bayen, and Y . Wu, “The surprising effectiveness of ppo in cooperative multi-agent games,” Advances in neural information processing systems, vol. 35, pp. 24 611– 24 624, 2022

  42. [42]

    Learning multiagent communication with backpropagation,

    S. Sukhbaatar, R. Ferguset al., “Learning multiagent communication with backpropagation,”Advances in neural information processing sys- tems, vol. 29, 2016

  43. [43]

    Graph convolutional reinforce- ment learning,

    J. Jiang, C. Dun, T. Huang, and Z. Lu, “Graph convolutional reinforce- ment learning,”arXiv preprint arXiv:1810.09202, 2018

  44. [44]

    A survey of multi-objective sequential decision-making,

    D. M. Roijers, P. Vamplew, S. Whiteson, and R. Dazeley, “A survey of multi-objective sequential decision-making,”Journal of Artificial Intelligence Research, vol. 48, pp. 67–113, 2013

  45. [45]

    A practical guide to multi-objective reinforcement learning and planning,

    C. F. Hayes, R. R ˘adulescu, E. Bargiacchi, J. K ¨allstr¨om, M. Macfarlane, M. Reymond, T. Verstraeten, L. M. Zintgraf, R. Dazeley, F. Heintz, E. Howley, A. A. Irissappane, P. Mannion, A. Now ´e, G. Ramos, M. Restelli, P. Vamplew, and D. M. Roijers, “A practical guide to multi-objective reinforcement learning and planning,”Autonomous Agents and Multi-Agen...

  46. [46]

    Empirical evaluation methods for multiobjective reinforcement learning algorithms,

    P. Vamplew, R. Dazeley, A. Berry, R. Issabekov, and E. Dekker, “Empirical evaluation methods for multiobjective reinforcement learning algorithms,”Machine learning, vol. 84, no. 1, pp. 51–80, 2011

  47. [47]

    Scalarized multi- objective reinforcement learning: Novel design techniques,

    K. Van Moffaert, M. M. Drugan, and A. Now ´e, “Scalarized multi- objective reinforcement learning: Novel design techniques,” in2013 IEEE symposium on adaptive dynamic programming and reinforcement learning (ADPRL). IEEE, 2013, pp. 191–199

  48. [48]

    Prediction- guided multi-objective reinforcement learning for continuous robot con- trol,

    J. Xu, Y . Tian, P. Ma, D. Rus, S. Sueda, and W. Matusik, “Prediction- guided multi-objective reinforcement learning for continuous robot con- trol,” inInternational conference on machine learning. PMLR, 2020, pp. 10 607–10 616

  49. [49]

    Multi-task learning as multi-objective opti- mization,

    O. Sener and V . Koltun, “Multi-task learning as multi-objective opti- mization,”Advances in neural information processing systems, vol. 31, 2018

  50. [50]

    Gra- dient surgery for multi-task learning,

    T. Yu, S. Kumar, A. Gupta, S. Levine, K. Hausman, and C. Finn, “Gra- dient surgery for multi-task learning,”Advances in neural information processing systems, vol. 33, pp. 5824–5836, 2020

  51. [51]

    Scalable multi-objective robot reinforcement learning through gradient conflict resolution,

    H. Munn, B. Tidd, P. B ¨ohm, M. Gallagher, and D. Howard, “Scalable multi-objective robot reinforcement learning through gradient conflict resolution,”arXiv preprint arXiv:2509.14816, 2025

  52. [52]

    Multi-objective multi-agent decision making: a utility-based analysis and survey,

    R. R ˘adulescu, P. Mannion, D. M. Roijers, and A. Now´e, “Multi-objective multi-agent decision making: a utility-based analysis and survey,” Autonomous Agents and Multi-Agent Systems, vol. 34, no. 1, p. 10,

  53. [53]

    Available: https://doi.org/10.1007/s10458-019-09433-x

    [Online]. Available: https://doi.org/10.1007/s10458-019-09433-x

  54. [54]

    Sample-efficient multi-objective learning via generalized policy improvement prioritization,

    L. N. Alegre, A. L. Bazzan, D. M. Roijers, A. Now ´e, and B. C. da Silva, “Sample-efficient multi-objective learning via generalized policy improvement prioritization,”arXiv preprint arXiv:2301.07784, 2023

  55. [55]

    Meta-gradient reinforcement learning with an objective discovered online,

    Z. Xu, H. P. van Hasselt, M. Hessel, J. Oh, S. Singh, and D. Silver, “Meta-gradient reinforcement learning with an objective discovered online,”Advances in Neural Information Processing Systems, vol. 33, pp. 15 254–15 264, 2020

  56. [56]

    Pac: Assisted value factorization with counterfactual predictions in multi-agent reinforcement learning,

    H. Zhou, T. Lan, and V . Aggarwal, “Pac: Assisted value factorization with counterfactual predictions in multi-agent reinforcement learning,” Advances in Neural Information Processing Systems, vol. 35, pp. 15 757–15 769, 2022

  57. [57]

    Multi-objective deep reinforcement learning for mobile edge computing,

    N. Yang, J. Wen, M. Zhang, and M. Tang, “Multi-objective deep reinforcement learning for mobile edge computing,” in2023 21st in- ternational symposium on modeling and optimization in mobile, ad hoc, and wireless networks (WiOpt). IEEE, 2023, pp. 1–8

  58. [58]

    Machine learning for fluid mechanics,

    S. L. Brunton, B. R. Noack, and P. Koumoutsakos, “Machine learning for fluid mechanics,”Annual review of fluid mechanics, vol. 52, no. 1, pp. 477–508, 2020

  59. [59]

    Accelerating deep reinforcement learn- ing strategies of flow control through a multi-environment approach,

    J. Rabault and A. Kuhnle, “Accelerating deep reinforcement learn- ing strategies of flow control through a multi-environment approach,” Physics of Fluids, vol. 31, no. 9, 2019

  60. [60]

    Direct shape optimization through deep reinforcement learning,

    J. Viquerat, J. Rabault, A. Kuhnle, H. Ghraieb, A. Larcher, and E. Hachem, “Direct shape optimization through deep reinforcement learning,”Journal of Computational Physics, vol. 428, p. 110080, 2021

  61. [61]

    Learning to control pdes with differentiable physics,

    P. Holl, V . Koltun, and N. Thuerey, “Learning to control pdes with differentiable physics,”arXiv preprint arXiv:2001.07457, 2020

  62. [62]

    Machine learning–accelerated computational fluid dynamics,

    D. Kochkov, J. A. Smith, A. Alieva, Q. Wang, M. P. Brenner, and S. Hoyer, “Machine learning–accelerated computational fluid dynamics,” Proceedings of the National Academy of Sciences, vol. 118, no. 21, p. e2101784118, 2021

  63. [63]

    Artificial neural networks trained through deep reinforcement learning discover control strategies for active flow control,

    J. Rabault, M. Kuchta, A. Jensen, U. R ´eglade, and N. Cerardi, “Artificial neural networks trained through deep reinforcement learning discover control strategies for active flow control,”Journal of fluid mechanics, vol. 865, pp. 281–302, 2019

  64. [64]

    A review on deep reinforcement learning for fluid me- chanics,

    P. Garnier, J. Viquerat, J. Rabault, A. Larcher, A. Kuhnle, and E. Hachem, “A review on deep reinforcement learning for fluid me- chanics,”Computers & Fluids, vol. 225, p. 104973, 2021

  65. [65]

    Robust flow control and optimal sensor placement using deep reinforcement learning,

    R. Paris, S. Beneddine, and J. Dandois, “Robust flow control and optimal sensor placement using deep reinforcement learning,”Journal of Fluid Mechanics, vol. 913, p. A25, 2021

  66. [66]

    Recent advances in applying deep reinforcement learning for flow control: Perspectives and future directions,

    C. Vignon, J. Rabault, and R. Vinuesa, “Recent advances in applying deep reinforcement learning for flow control: Perspectives and future directions,”Physics of fluids, vol. 35, no. 3, p. 031301, 2023

  67. [67]

    Applying deep reinforcement learning to active flow control in weakly turbulent conditions,

    F. Ren, J. Rabault, and H. Tang, “Applying deep reinforcement learning to active flow control in weakly turbulent conditions,”Physics of Fluids, vol. 33, no. 3, 2021

  68. [68]

    Reinforcement learning of control strategies for reducing skin friction drag in a fully developed turbulent channel flow,

    T. Sonoda, Z. Liu, T. Itoh, and Y . Hasegawa, “Reinforcement learning of control strategies for reducing skin friction drag in a fully developed turbulent channel flow,”Journal of Fluid Mechanics, vol. 960, p. A30, 2023

  69. [69]

    Flow navi- gation by smart microswimmers via reinforcement learning,

    S. Colabrese, K. Gustavsson, A. Celani, and L. Biferale, “Flow navi- gation by smart microswimmers via reinforcement learning,”Physical review letters, vol. 118, no. 15, p. 158004, 2017

  70. [70]

    Zermelo’s problem: optimal point-to-point navigation in 2d turbulent flows using reinforcement learning,

    L. Biferale, F. Bonaccorso, M. Buzzicotti, P. Clark Di Leoni, and K. Gustavsson, “Zermelo’s problem: optimal point-to-point navigation in 2d turbulent flows using reinforcement learning,”Chaos: An Inter- disciplinary Journal of Nonlinear Science, vol. 29, no. 10, 2019

  71. [71]

    Machine learning strategies for path-planning microswimmers in turbulent flows,

    J. K. Alageshan, A. K. Verma, J. Bec, and R. Pandit, “Machine learning strategies for path-planning microswimmers in turbulent flows,”Physical Review E, vol. 101, no. 4, p. 043110, 2020

  72. [72]

    Learning efficient navigation in vortical flow fields,

    P. Gunnarson, I. Mandralis, G. Novati, P. Koumoutsakos, and J. O. Dabiri, “Learning efficient navigation in vortical flow fields,”Nature communications, vol. 12, no. 1, p. 7143, 2021

  73. [73]

    Frequency selection by feedback control in a turbulent shear flow,

    V . Parezanovi´c, L. Cordier, A. Spohn, T. Duriez, B. R. Noack, J.-P. Bon- net, M. Segond, M. Abel, and S. L. Brunton, “Frequency selection by feedback control in a turbulent shear flow,”Journal of Fluid Mechanics, vol. 797, pp. 247–283, 2016

  74. [74]

    Blood flow in arteries,

    D. N. Ku, “Blood flow in arteries,”Annual review of fluid mechanics, vol. 29, no. 1, pp. 399–434, 1997

  75. [75]

    Phiflow: Differentiable simulations for pytorch, tensorflow and jax,

    P. Holl and N. Thuerey, “Phiflow: Differentiable simulations for pytorch, tensorflow and jax,” inInternational Conference on Machine Learning. PMLR, 2024

  76. [76]

    Development of high-efficiency superparamagnetic drug delivery system with mpi imaging capability,

    S. Bai, X.-d. Zhang, Y .-q. Zou, Y .-x. Lin, Z.-y. Liu, K.-w. Li, P. Huang, T. Yoshida, Y .-l. Liu, M.-s. Li, W. Zhang, X.-j. Wang, M. Zhang, and C. Du, “Development of high-efficiency superparamagnetic drug delivery system with mpi imaging capability,” Frontiers in Bioengineering and Biotechnology, vol. V olume 12 - 2024, 2024. [Online]. Available: https...

  77. [77]

    Stable-baselines3: Reliable reinforcement learning implementations,

    A. Raffin, A. Hill, A. Gleave, A. Kanervisto, M. Ernestus, and N. Dormann, “Stable-baselines3: Reliable reinforcement learning implementations,”Journal of Machine Learning Research, vol. 22, no. 268, pp. 1–8, 2021. [Online]. Available: http://jmlr.org/papers/v22/ 20-1364.html 14

  78. [78]

    Gradient surgery for multi-task learning, 2020

    T. Yu, S. Kumar, A. Gupta, S. Levine, K. Hausman, and C. Finn, “Gradient surgery for multi-task learning,” 2020. [Online]. Available: https://arxiv.org/abs/2001.06782

  79. [79]

    J. Berman. (2025) Fluxswarm: Micro-swarm locomotion using mo-marl, github repository. GitHub repository. [Online]. Available: https://github.com/josefberman/FluxSwarm