Micro-Swarm Locomotion Optimization in Dynamic Flow using Multi-Objective Multi-Agent Reinforcement Learning

Josef Berman; Oren Gal

arxiv: 2605.25025 · v1 · pith:YUYSBJCWnew · submitted 2026-05-24 · 💻 cs.RO · cs.SY· eess.SY

Micro-Swarm Locomotion Optimization in Dynamic Flow using Multi-Objective Multi-Agent Reinforcement Learning

Josef Berman , Oren Gal This is my paper

Pith reviewed 2026-06-30 00:27 UTC · model grok-4.3

classification 💻 cs.RO cs.SYeess.SY

keywords micro-swarmmulti-agent reinforcement learningcomputational fluid dynamicspulsatile flowPCGradenergy efficiencybiomedical navigationdecentralized PPO

0 comments

The pith

A hybrid CFD and multi-objective multi-agent RL framework lets sixteen micro-robots navigate pulsatile arterial flow while balancing upstream progress, energy efficiency, and motion smoothness.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper couples a high-fidelity incompressible Navier-Stokes solver directly to decentralized proximal policy optimization so that agents learn control policies that reconcile three competing objectives inside a single training loop. PCGrad gradient surgery is shown to be required; without it, energy and smoothness rewards collapse while progress oscillates. The resulting policy reaches a progress reward of 6.5–7.0, sustained energy efficiency of 0.63–0.65, and smoothness of 0.97–0.99, outperforming brute-force baselines that produce negative energy efficiency. Three distinct behavioral phases emerge during training: collective hydrodynamic throttling, cycle-synchronized ratcheting during flow reversal, and individualized final approach. These results indicate that time-dependent fluid interactions can be handled inside the reinforcement-learning loop itself rather than through hand-designed rules.

Core claim

Time-dependent fluid-agent interactions can be captured directly within multi-objective reinforcement learning loops; the converged decentralized PPO policy with PCGrad achieves a progress reward of 6.5–7.0, energy efficiency of 0.63–0.65, and smoothness of 0.97–0.99 while navigating a pulsatile arterial waveform, and produces three emergent phases (two-layer throttling formation, cycle-synchronized ratchet, individualized terminal approach) that brute-force baselines cannot match.

What carries the argument

Hybrid CFD-MOMARL loop that couples an incompressible Navier-Stokes solver to decentralized PPO agents whose gradients are reconciled by PCGrad to resolve conflicts among progress, energy, and smoothness rewards.

If this is right

Collective two-layer hydrodynamic throttling suppresses peak channel velocities during forward flow.
Cycle-synchronized ratchet exploits flow reversals for upstream repositioning without extra energy cost.
Individualized final approach emerges once agents near the success boundary.
Brute-force baselines produce negative energy efficiency while the learned policy remains positive.
Absence of PCGrad causes energy and smoothness rewards to collapse within 10,000 steps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same loop could be tested on other time-periodic flows such as respiratory or microfluidic channels to check whether the three-phase emergence is general.
If the simulated policies transfer to hardware, the approach could reduce reliance on manual trajectory design for in-vivo micro-robot navigation.
The necessity of PCGrad suggests that multi-objective fluid problems systematically produce gradient conflicts that single-objective RL cannot resolve.

Load-bearing premise

The incompressible Navier-Stokes solver accurately represents the physiologically realistic time-dependent fluid environments and the learned policies remain physically consistent without additional real-world validation.

What would settle it

Deploy the trained policy on physical magnetically actuated micro-robots inside a pulsatile-flow channel that matches the simulated waveform and measure whether the observed energy efficiency and smoothness match the simulated values of 0.63–0.65 and 0.97–0.99.

Figures

Figures reproduced from arXiv: 2605.25025 by Josef Berman, Oren Gal.

**Figure 3.** Figure 3: Triphasic temporal velocity profile, representing the rapid systole, [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 5.** Figure 5: Architecture of the centralized multi-objective critic. The network [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Per-episode training curves for the three reward objectives of the [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Swarm behavior before policy learning. Snapshots of the x-velocity field (color map, blue: negative; red: positive), swarm member positions (colored [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Swarm behavior after policy learning. Snapshots of the x-velocity field and swarm member positions at twelve successive time points during a [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 9.** Figure 9: Ablation study comparing per-episode training curves for the three [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗

read the original abstract

Coordinating micro-robotic swarms in physiologically realistic, time-dependent fluid environments remains an unsolved challenge for biomedical and environmental applications. We present a hybrid Computational Fluid Dynamics - Multi-Objective Multi-Agent Reinforcement Learning framework that directly couples a high-fidelity incompressible Navier-Stokes solver with decentralized proximal policy optimization to learn physically consistent swarm control strategies in oscillatory flow. Sixteen magnetically actuated micro-robots navigate a pulsatile arterial waveform, simultaneously optimizing upstream progression, energy conservation, and motion smoothness, reconciled using PCGrad surgery. Without PCGrad, energy efficiency and smoothness rewards collapse to near zero within 10,000 training steps while progress exhibits persistent large-amplitude oscillations, confirming that gradient conflict resolution is a structural requirement rather than an optional refinement in this domain. The converged policy achieves a progress reward of 6.5-7.0, a sustained energy efficiency of 0.63-0.65, and near-maximum smoothness (0.97-0.99), representing improvements over brute-force baselines on the primary objective while both baselines yield negative energy efficiency throughout. Training reveals three emergent behavioral phases: a collective two-layer hydrodynamic throttling formation that suppresses peak channel velocities during forward flow, a cycle-synchronized ratchet mechanism that exploits flow reversals for upstream repositioning, and an individualized final approach as agents near the success boundary. These results establish that time-dependent fluid-agent interactions can be captured directly within multi-objective reinforcement learning loops, offering a physically grounded paradigm for micro-swarm control in biomedical navigation, environmental monitoring, and industrial microfluidics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper couples CFD with MOMARL plus PCGrad for micro-swarm control in simulated pulsatile flow and shows PCGrad prevents objective collapse, but all results stay inside one rigid-domain solver with no validation.

read the letter

The core contribution is a simulation framework that trains decentralized PPO agents on 16 micro-robots inside an incompressible Navier-Stokes solver driven by an arterial waveform. They optimize three objectives at once and use PCGrad to resolve gradient conflicts. Without it, energy efficiency and smoothness drop to near zero while progress oscillates; with it, the policy reaches progress rewards of 6.5-7.0, energy efficiency around 0.63-0.65, and smoothness of 0.97-0.99, beating the brute-force baselines on the main goal.

What is actually new is the direct loop between the high-fidelity CFD and the multi-agent learner for this oscillatory setting, plus the reported emergent phases: a two-layer throttling formation, a cycle-synchronized ratchet, and individualized final approaches. Showing the training collapse without PCGrad is a useful concrete result.

The soft spot is the complete absence of any check on the fluid model itself. The domain is rigid, the fluid is Newtonian and incompressible, and there is no sensitivity test to wall compliance, non-Newtonian effects, or laboratory PIV data. The claim that the learned strategies are physically consistent therefore rests on an untested premise about the solver. The reported numbers and formations could be artifacts of that specific setup.

This is for researchers already working on RL for micro-robotics or fluid-mediated swarms who want an example of multi-objective handling in simulation. It is not yet a method ready for biomedical claims. I would send it to peer review because the PCGrad necessity is shown clearly and the application is new for the domain, even though the validation gap will require substantial revision.

Referee Report

2 major / 1 minor

Summary. The manuscript presents a hybrid CFD-MARL framework coupling a high-fidelity incompressible Navier-Stokes solver with decentralized PPO and PCGrad gradient surgery to optimize upstream progression, energy efficiency, and smoothness for 16 magnetically actuated micro-robots in a pulsatile arterial waveform. It reports that the converged policy attains progress rewards of 6.5-7.0, energy efficiency of 0.63-0.65, and smoothness of 0.97-0.99, outperforming brute-force baselines on the primary objective (while baselines yield negative energy efficiency), demonstrates that PCGrad is required to prevent reward collapse, and identifies three emergent phases: two-layer hydrodynamic throttling, cycle-synchronized ratchet repositioning, and individualized final approach.

Significance. If the simulation results are robust, the work demonstrates that multi-objective MARL can directly incorporate time-dependent fluid-agent interactions to produce coordinated swarm behaviors, with the explicit demonstration that PCGrad is structurally necessary (rather than optional) constituting a useful methodological contribution for conflicting-objective domains. The provision of concrete reward ranges and training-step dynamics strengthens reproducibility within simulation studies.

major comments (2)

[Abstract] Abstract: The central claim that the learned policies are 'physically consistent' and that the framework offers a 'physically grounded paradigm' for biomedical navigation rests on the untested premise that the chosen incompressible NS solver with rigid-domain pulsatile waveform faithfully reproduces dominant forces; no sensitivity analysis to compliant-wall FSI, non-Newtonian rheology, or laboratory PIV data is supplied, rendering the reported reward values and emergent formations (two-layer throttling, ratchet) potentially solver-specific artifacts.
[Abstract] Abstract (training results paragraph): The necessity of PCGrad is shown only via collapse within this single fluid model after 10,000 steps; without ablations across varied waveforms, Reynolds numbers, or alternative solvers, it remains unclear whether gradient conflict resolution is a general structural requirement or an artifact of the specific environment and reward scaling.

minor comments (1)

[Abstract] Abstract: The brute-force baseline implementation details (e.g., whether it optimizes the same multi-objective reward or only progress) are not stated, complicating direct comparison of the reported improvements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and propose targeted revisions to the abstract and discussion to better qualify the scope of our claims while preserving the core contribution of the hybrid CFD-MOMARL framework.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that the learned policies are 'physically consistent' and that the framework offers a 'physically grounded paradigm' rests on the untested premise that the chosen incompressible NS solver with rigid-domain pulsatile waveform faithfully reproduces dominant forces; no sensitivity analysis to compliant-wall FSI, non-Newtonian rheology, or laboratory PIV data is supplied, rendering the reported reward values and emergent formations (two-layer throttling, ratchet) potentially solver-specific artifacts.

Authors: We agree that the phrasing 'physically consistent' and 'physically grounded paradigm' can be read as implying broader physical fidelity than the chosen model supports. In the manuscript these terms denote direct coupling to the incompressible Navier-Stokes equations (as opposed to reduced-order or surrogate dynamics) within a standard rigid-wall Newtonian setup. We will revise the abstract to replace these phrases with 'consistent with the incompressible NS model' and 'simulation-based paradigm,' and we will add a limitations subsection explicitly noting the rigid-wall and Newtonian assumptions together with the absence of FSI, non-Newtonian, or PIV validation. This revision clarifies the intended scope without altering the reported results. revision: yes
Referee: [Abstract] Abstract (training results paragraph): The necessity of PCGrad is shown only via collapse within this single fluid model after 10,000 steps; without ablations across varied waveforms, Reynolds numbers, or alternative solvers, it remains unclear whether gradient conflict resolution is a general structural requirement or an artifact of the specific environment and reward scaling.

Authors: The PCGrad ablation demonstrates that, in the presence of the three conflicting objectives and the time-varying fluid forces of this environment, gradient surgery is required to avoid reward collapse. The manuscript does not claim universality across all MARL settings; the statement 'structural requirement rather than an optional refinement in this domain' is intended to be environment-specific. We will revise the abstract and the corresponding methods paragraph to read 'required to prevent collapse in this multi-objective fluid-interaction setting' and will add a short discussion paragraph acknowledging that broader ablations across waveforms and solvers would be valuable future work. revision: partial

Circularity Check

0 steps flagged

No circularity detected; results are empirical outputs from RL training

full rationale

The paper reports performance metrics and emergent behaviors obtained by running decentralized PPO with PCGrad inside an external incompressible Navier-Stokes solver, then comparing against separate brute-force baselines. No mathematical derivation chain exists that reduces a claimed prediction or first-principles result to its own inputs by construction. Claims about PCGrad being required are supported by direct training observations (collapse without it), not by self-definition, fitted parameters renamed as predictions, or load-bearing self-citations. The framework is self-contained as simulation-based optimization with external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper relies on standard CFD and RL assumptions; no new free parameters or invented entities explicitly introduced in the abstract. Assessment limited by abstract-only access.

axioms (2)

standard math Incompressible Navier-Stokes equations govern the fluid flow
Used as the high-fidelity solver in the hybrid framework.
domain assumption Decentralized proximal policy optimization can learn swarm control strategies
Core to the MOMARL component.

pith-pipeline@v0.9.1-grok · 5843 in / 1195 out tokens · 80498 ms · 2026-06-30T00:27:58.119345+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

79 extracted references · 16 canonical work pages · 2 internal anchors

[1]

Life at low reynolds number,

E. M. Purcell, “Life at low reynolds number,”American Journal of Physics, vol. 45, no. 1, pp. 3–11, 01 1977. [Online]. Available: https://doi.org/10.1119/1.10903

work page doi:10.1119/1.10903 1977
[2]

Self-induced oscillations in an open water-channel with slotted walls,

P. L. Betts, “Self-induced oscillations in an open water-channel with slotted walls,”Journal of Fluid Mechanics, vol. 55, no. 3, p. 401–417, 1972

1972
[3]

Numerical simulation of blood flow in venous networks,

M. Cermatori, “Numerical simulation of blood flow in venous networks,” Ph.D. dissertation, Imperial College London, 04 2014

2014
[4]

Versatile microrobotics using simple modular subunits,

U. K. Cheang, F. Meshkati, H. Kim, K. Lee, H. C. Fu, and M. J. Kim, “Versatile microrobotics using simple modular subunits,” Scientific Reports, vol. 6, no. 1, p. 30472, 2016. [Online]. Available: https://doi.org/10.1038/srep30472

work page doi:10.1038/srep30472 2016
[5]

The Self-Propulsion of Microscopic Organisms through Liquids,

G. J. Hancock, “The Self-Propulsion of Microscopic Organisms through Liquids,”Proceedings of the Royal Society of London Series A, vol. 217, no. 1128, pp. 96–121, Mar. 1953

1953
[6]

Bio-inspired magnetic swimming microrobots for biomedical applications,

K. E. Peyer, L. Zhang, and B. J. Nelson, “Bio-inspired magnetic swimming microrobots for biomedical applications,”Nanoscale, vol. 5, pp. 1259–1272, 2013. [Online]. Available: http://dx.doi.org/10.1039/ C2NR32554C

2013
[7]

Recent process in microrobots: From propulsion to swarming for biomedical applications,

R. Wu, Y . Zhu, X. Cai, S. Wu, L. Xu, and T. Yu, “Recent process in microrobots: From propulsion to swarming for biomedical applications,”Micromachines, vol. 13, no. 9, 2022. [Online]. Available: https://www.mdpi.com/2072-666X/13/9/1473

2022
[8]

Propulsion mecha- nisms and applications of multiphysics- driven micro- and nanomotors,

X. Chang, T. Li, D. Zhou, G. Zhang, and L. LI, “Propulsion mecha- nisms and applications of multiphysics- driven micro- and nanomotors,” Chinese Science Bulletin, vol. 62, pp. 122–135, 01 2017

2017
[9]

Self-learning how to swim at low reynolds number,

A. C. H. Tsang, P. W. Tong, S. Nallan, and O. S. Pak, “Self-learning how to swim at low reynolds number,”Phys. Rev. Fluids, vol. 5, p. 074101, Jul 2020. [Online]. Available: https://link.aps.org/doi/10.1103/PhysRevFluids.5.074101

work page doi:10.1103/physrevfluids.5.074101 2020
[10]

Finite element model of a helical swimming robot in comsol,

M. A. L ´opez, C. Nuevo-Gallardo, J. E. Traver, I. Tejado, and B. M. Vinagre, “Finite element model of a helical swimming robot in comsol,” inCOMSOL Conference 2024, 2024

2024
[11]

A survey on swarm microrobotics,

L. Yang, J. Yu, S. Yang, B. Wang, B. J. Nelson, and L. Zhang, “A survey on swarm microrobotics,”IEEE Transactions on Robotics, vol. 38, no. 3, pp. 1531–1551, 2022

2022
[12]

Magnetic microrobotic swarms in fluid suspensions,

H. Chen and J. Yu, “Magnetic microrobotic swarms in fluid suspensions,”Current Robotics Reports, vol. 3, no. 3, pp. 127–137,
[13]

Available: https://doi.org/10.1007/s43154-022-00085-6

[Online]. Available: https://doi.org/10.1007/s43154-022-00085-6

work page doi:10.1007/s43154-022-00085-6
[14]

Navigation of magnetic microrobotic swarms with maintained structural integrity in fluidic flow,

J. Law, X. Du, W. Tang, J. Yu, and Y . Sun, “Navigation of magnetic microrobotic swarms with maintained structural integrity in fluidic flow,” IEEE/ASME Transactions on Mechatronics, vol. 29, no. 1, pp. 74–84, 2024

2024
[15]

Opportunities and challenges with autonomous micro aerial vehicles,

V . Kumar and N. Michael, “Opportunities and challenges with autonomous micro aerial vehicles,”Int. J. Rob. Res., vol. 31, no. 11, p. 1279–1291, Sep. 2012. [Online]. Available: https: //doi.org/10.1177/0278364912455954

work page doi:10.1177/0278364912455954 2012
[16]

Oscillatory gravity waves in flowing water,

T. Sarpkaya, “Oscillatory gravity waves in flowing water,”Transactions of the American Society of Civil Engineers, vol. 122, no. 1, pp. 564–586, 1957. [Online]. Available: https://ascelibrary.org/doi/abs/10. 1061/TACEAT.0007497

1957
[17]

Information theoretic mpc for model-based reinforcement learning,

G. Williams, N. Wagener, B. Goldfain, P. Drews, J. M. Rehg, B. Boots, and E. A. Theodorou, “Information theoretic mpc for model-based reinforcement learning,” in2017 IEEE International Conference on Robotics and Automation (ICRA), 2017, pp. 1714–1721

2017
[18]

Independent control of multiple magnetic microrobots in three dimensions,

E. Diller, J. Giltinan, and M. Sitti, “Independent control of multiple magnetic microrobots in three dimensions,”The International Journal of Robotics Research, vol. 32, no. 5, pp. 614–631, 2013

2013
[19]

Aircraft trajectory planning with collision avoidance using mixed integer linear programming,

A. Richards and J. P. How, “Aircraft trajectory planning with collision avoidance using mixed integer linear programming,” inProceedings of the 2002 American control conference (IEEE Cat. No. CH37301), vol. 3. IEEE, 2002, pp. 1936–1941

2002
[20]

Magnetic therapeutic delivery using navigable agents,

S. Martel, “Magnetic therapeutic delivery using navigable agents,” Therapeutic delivery, vol. 5, no. 2, pp. 189–204, 2014

2014
[21]

Mesbahi and M

M. Mesbahi and M. Egerstedt,Graph theoretic methods in multiagent networks. Princeton University Press, 2010

2010
[22]

Foundations of swarm robotic chemical plume tracing from a fluid dynamics perspective,

D. F. Spears, D. R. Thayer, and D. V . Zarzhitsky, “Foundations of swarm robotic chemical plume tracing from a fluid dynamics perspective,”International Journal of Intelligent Computing and Cybernetics, vol. 2, no. 4, pp. 745–785, Jan. 2009. [Online]. Available: https://doi.org/10.1108/17563780911005863

work page doi:10.1108/17563780911005863 2009
[23]

Disruption of vertical motility by shear triggers formation of thin phytoplankton layers,

W. M. Durham, J. O. Kessler, and R. Stocker, “Disruption of vertical motility by shear triggers formation of thin phytoplankton layers,” science, vol. 323, no. 5917, pp. 1067–1070, 2009

2009
[24]

Swarm robotics,

M. Dorigo, M. Birattari, and M. Brambilla, “Swarm robotics,”Scholar- pedia, vol. 9, no. 1, p. 1463, 2014

2014
[25]

R. S. Sutton, A. G. Bartoet al.,Reinforcement learning: An introduction. MIT press Cambridge, 1998, vol. 1, no. 1

1998
[26]

Pilco: A model-based and data-efficient approach to policy search,

M. Deisenroth and C. E. Rasmussen, “Pilco: A model-based and data-efficient approach to policy search,” inProceedings of the 28th International Conference on machine learning (ICML-11), 2011, pp. 465–472

2011
[27]

Dream to Control: Learning Behaviors by Latent Imagination

D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi, “Dream to control: Learn- ing behaviors by latent imagination,”arXiv preprint arXiv:1912.01603, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1912
[28]

Model regularization for stable sample rollouts

E. Talvitie, “Model regularization for stable sample rollouts.” inUAI, 2014, pp. 780–789

2014
[29]

Multi-agent reinforce- ment learning: An overview,

L. Bus ¸oniu, R. Babu ˇska, and B. De Schutter, “Multi-agent reinforce- ment learning: An overview,”Innovations in multi-agent systems and applications-1, pp. 183–221, 2010

2010
[30]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox- imal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[31]

Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor,

T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor,” inInternational conference on machine learning. Pmlr, 2018, pp. 1861–1870

2018
[32]

Controlled gliding and perching through deep-reinforcement-learning,

G. Novati, L. Mahadevan, and P. Koumoutsakos, “Controlled gliding and perching through deep-reinforcement-learning,”Physical Review Fluids, vol. 4, no. 9, p. 093902, 2019

2019
[33]

Scaling macroscopic aquatic locomotion,

M. Gazzola, M. Argentina, and L. Mahadevan, “Scaling macroscopic aquatic locomotion,”Nature Physics, vol. 10, no. 10, pp. 758–761, 2014. 13

2014
[34]

A comprehensive survey of multiagent reinforcement learning,

L. Busoniu, R. Babuska, and B. De Schutter, “A comprehensive survey of multiagent reinforcement learning,”IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 38, no. 2, pp. 156–172, 2008

2008
[35]

Multi-agent reinforcement learning: A selective overview of theories and algorithms,

K. Zhang, Z. Yang, and T. Bas ¸ar, “Multi-agent reinforcement learning: A selective overview of theories and algorithms,”Handbook of rein- forcement learning and control, pp. 321–384, 2021

2021
[36]

Multi-agent actor-critic for mixed cooperative-competitive environ- ments,

R. Lowe, Y . I. Wu, A. Tamar, J. Harb, O. Pieter Abbeel, and I. Mordatch, “Multi-agent actor-critic for mixed cooperative-competitive environ- ments,”Advances in neural information processing systems, vol. 30, 2017

2017
[37]

Counterfactual multi-agent policy gradients,

J. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson, “Counterfactual multi-agent policy gradients,” inProceedings of the AAAI conference on artificial intelligence, vol. 32, no. 1, 2018

2018
[38]

Multi-agent reinforcement learning: Independent vs. cooper- ative agents,

M. Tan, “Multi-agent reinforcement learning: Independent vs. cooper- ative agents,” inProceedings of the tenth international conference on machine learning, 1993, pp. 330–337

1993
[39]

Optimal and approxi- mate q-value functions for decentralized pomdps,

F. A. Oliehoek, M. T. Spaan, and N. Vlassis, “Optimal and approxi- mate q-value functions for decentralized pomdps,”Journal of Artificial Intelligence Research, vol. 32, pp. 289–353, 2008

2008
[40]

Monotonic value function factorisation for deep multi- agent reinforcement learning,

T. Rashid, M. Samvelyan, C. S. De Witt, G. Farquhar, J. Foerster, and S. Whiteson, “Monotonic value function factorisation for deep multi- agent reinforcement learning,”Journal of Machine Learning Research, vol. 21, no. 178, pp. 1–51, 2020

2020
[41]

The surprising effectiveness of ppo in cooperative multi-agent games,

C. Yu, A. Velu, E. Vinitsky, J. Gao, Y . Wang, A. Bayen, and Y . Wu, “The surprising effectiveness of ppo in cooperative multi-agent games,” Advances in neural information processing systems, vol. 35, pp. 24 611– 24 624, 2022

2022
[42]

Learning multiagent communication with backpropagation,

S. Sukhbaatar, R. Ferguset al., “Learning multiagent communication with backpropagation,”Advances in neural information processing sys- tems, vol. 29, 2016

2016
[43]

Graph convolutional reinforce- ment learning,

J. Jiang, C. Dun, T. Huang, and Z. Lu, “Graph convolutional reinforce- ment learning,”arXiv preprint arXiv:1810.09202, 2018

work page arXiv 2018
[44]

A survey of multi-objective sequential decision-making,

D. M. Roijers, P. Vamplew, S. Whiteson, and R. Dazeley, “A survey of multi-objective sequential decision-making,”Journal of Artificial Intelligence Research, vol. 48, pp. 67–113, 2013

2013
[45]

A practical guide to multi-objective reinforcement learning and planning,

C. F. Hayes, R. R ˘adulescu, E. Bargiacchi, J. K ¨allstr¨om, M. Macfarlane, M. Reymond, T. Verstraeten, L. M. Zintgraf, R. Dazeley, F. Heintz, E. Howley, A. A. Irissappane, P. Mannion, A. Now ´e, G. Ramos, M. Restelli, P. Vamplew, and D. M. Roijers, “A practical guide to multi-objective reinforcement learning and planning,”Autonomous Agents and Multi-Agen...

work page doi:10.1007/s10458-022-09552-y 2022
[46]

Empirical evaluation methods for multiobjective reinforcement learning algorithms,

P. Vamplew, R. Dazeley, A. Berry, R. Issabekov, and E. Dekker, “Empirical evaluation methods for multiobjective reinforcement learning algorithms,”Machine learning, vol. 84, no. 1, pp. 51–80, 2011

2011
[47]

Scalarized multi- objective reinforcement learning: Novel design techniques,

K. Van Moffaert, M. M. Drugan, and A. Now ´e, “Scalarized multi- objective reinforcement learning: Novel design techniques,” in2013 IEEE symposium on adaptive dynamic programming and reinforcement learning (ADPRL). IEEE, 2013, pp. 191–199

2013
[48]

Prediction- guided multi-objective reinforcement learning for continuous robot con- trol,

J. Xu, Y . Tian, P. Ma, D. Rus, S. Sueda, and W. Matusik, “Prediction- guided multi-objective reinforcement learning for continuous robot con- trol,” inInternational conference on machine learning. PMLR, 2020, pp. 10 607–10 616

2020
[49]

Multi-task learning as multi-objective opti- mization,

O. Sener and V . Koltun, “Multi-task learning as multi-objective opti- mization,”Advances in neural information processing systems, vol. 31, 2018

2018
[50]

Gra- dient surgery for multi-task learning,

T. Yu, S. Kumar, A. Gupta, S. Levine, K. Hausman, and C. Finn, “Gra- dient surgery for multi-task learning,”Advances in neural information processing systems, vol. 33, pp. 5824–5836, 2020

2020
[51]

Scalable multi-objective robot reinforcement learning through gradient conflict resolution,

H. Munn, B. Tidd, P. B ¨ohm, M. Gallagher, and D. Howard, “Scalable multi-objective robot reinforcement learning through gradient conflict resolution,”arXiv preprint arXiv:2509.14816, 2025

work page arXiv 2025
[52]

Multi-objective multi-agent decision making: a utility-based analysis and survey,

R. R ˘adulescu, P. Mannion, D. M. Roijers, and A. Now´e, “Multi-objective multi-agent decision making: a utility-based analysis and survey,” Autonomous Agents and Multi-Agent Systems, vol. 34, no. 1, p. 10,
[53]

Available: https://doi.org/10.1007/s10458-019-09433-x

[Online]. Available: https://doi.org/10.1007/s10458-019-09433-x

work page doi:10.1007/s10458-019-09433-x
[54]

Sample-efficient multi-objective learning via generalized policy improvement prioritization,

L. N. Alegre, A. L. Bazzan, D. M. Roijers, A. Now ´e, and B. C. da Silva, “Sample-efficient multi-objective learning via generalized policy improvement prioritization,”arXiv preprint arXiv:2301.07784, 2023

work page arXiv 2023
[55]

Meta-gradient reinforcement learning with an objective discovered online,

Z. Xu, H. P. van Hasselt, M. Hessel, J. Oh, S. Singh, and D. Silver, “Meta-gradient reinforcement learning with an objective discovered online,”Advances in Neural Information Processing Systems, vol. 33, pp. 15 254–15 264, 2020

2020
[56]

Pac: Assisted value factorization with counterfactual predictions in multi-agent reinforcement learning,

H. Zhou, T. Lan, and V . Aggarwal, “Pac: Assisted value factorization with counterfactual predictions in multi-agent reinforcement learning,” Advances in Neural Information Processing Systems, vol. 35, pp. 15 757–15 769, 2022

2022
[57]

Multi-objective deep reinforcement learning for mobile edge computing,

N. Yang, J. Wen, M. Zhang, and M. Tang, “Multi-objective deep reinforcement learning for mobile edge computing,” in2023 21st in- ternational symposium on modeling and optimization in mobile, ad hoc, and wireless networks (WiOpt). IEEE, 2023, pp. 1–8

2023
[58]

Machine learning for fluid mechanics,

S. L. Brunton, B. R. Noack, and P. Koumoutsakos, “Machine learning for fluid mechanics,”Annual review of fluid mechanics, vol. 52, no. 1, pp. 477–508, 2020

2020
[59]

Accelerating deep reinforcement learn- ing strategies of flow control through a multi-environment approach,

J. Rabault and A. Kuhnle, “Accelerating deep reinforcement learn- ing strategies of flow control through a multi-environment approach,” Physics of Fluids, vol. 31, no. 9, 2019

2019
[60]

Direct shape optimization through deep reinforcement learning,

J. Viquerat, J. Rabault, A. Kuhnle, H. Ghraieb, A. Larcher, and E. Hachem, “Direct shape optimization through deep reinforcement learning,”Journal of Computational Physics, vol. 428, p. 110080, 2021

2021
[61]

Learning to control pdes with differentiable physics,

P. Holl, V . Koltun, and N. Thuerey, “Learning to control pdes with differentiable physics,”arXiv preprint arXiv:2001.07457, 2020

work page arXiv 2001
[62]

Machine learning–accelerated computational fluid dynamics,

D. Kochkov, J. A. Smith, A. Alieva, Q. Wang, M. P. Brenner, and S. Hoyer, “Machine learning–accelerated computational fluid dynamics,” Proceedings of the National Academy of Sciences, vol. 118, no. 21, p. e2101784118, 2021

2021
[63]

Artificial neural networks trained through deep reinforcement learning discover control strategies for active flow control,

J. Rabault, M. Kuchta, A. Jensen, U. R ´eglade, and N. Cerardi, “Artificial neural networks trained through deep reinforcement learning discover control strategies for active flow control,”Journal of fluid mechanics, vol. 865, pp. 281–302, 2019

2019
[64]

A review on deep reinforcement learning for fluid me- chanics,

P. Garnier, J. Viquerat, J. Rabault, A. Larcher, A. Kuhnle, and E. Hachem, “A review on deep reinforcement learning for fluid me- chanics,”Computers & Fluids, vol. 225, p. 104973, 2021

2021
[65]

Robust flow control and optimal sensor placement using deep reinforcement learning,

R. Paris, S. Beneddine, and J. Dandois, “Robust flow control and optimal sensor placement using deep reinforcement learning,”Journal of Fluid Mechanics, vol. 913, p. A25, 2021

2021
[66]

Recent advances in applying deep reinforcement learning for flow control: Perspectives and future directions,

C. Vignon, J. Rabault, and R. Vinuesa, “Recent advances in applying deep reinforcement learning for flow control: Perspectives and future directions,”Physics of fluids, vol. 35, no. 3, p. 031301, 2023

2023
[67]

Applying deep reinforcement learning to active flow control in weakly turbulent conditions,

F. Ren, J. Rabault, and H. Tang, “Applying deep reinforcement learning to active flow control in weakly turbulent conditions,”Physics of Fluids, vol. 33, no. 3, 2021

2021
[68]

Reinforcement learning of control strategies for reducing skin friction drag in a fully developed turbulent channel flow,

T. Sonoda, Z. Liu, T. Itoh, and Y . Hasegawa, “Reinforcement learning of control strategies for reducing skin friction drag in a fully developed turbulent channel flow,”Journal of Fluid Mechanics, vol. 960, p. A30, 2023

2023
[69]

Flow navi- gation by smart microswimmers via reinforcement learning,

S. Colabrese, K. Gustavsson, A. Celani, and L. Biferale, “Flow navi- gation by smart microswimmers via reinforcement learning,”Physical review letters, vol. 118, no. 15, p. 158004, 2017

2017
[70]

Zermelo’s problem: optimal point-to-point navigation in 2d turbulent flows using reinforcement learning,

L. Biferale, F. Bonaccorso, M. Buzzicotti, P. Clark Di Leoni, and K. Gustavsson, “Zermelo’s problem: optimal point-to-point navigation in 2d turbulent flows using reinforcement learning,”Chaos: An Inter- disciplinary Journal of Nonlinear Science, vol. 29, no. 10, 2019

2019
[71]

Machine learning strategies for path-planning microswimmers in turbulent flows,

J. K. Alageshan, A. K. Verma, J. Bec, and R. Pandit, “Machine learning strategies for path-planning microswimmers in turbulent flows,”Physical Review E, vol. 101, no. 4, p. 043110, 2020

2020
[72]

Learning efficient navigation in vortical flow fields,

P. Gunnarson, I. Mandralis, G. Novati, P. Koumoutsakos, and J. O. Dabiri, “Learning efficient navigation in vortical flow fields,”Nature communications, vol. 12, no. 1, p. 7143, 2021

2021
[73]

Frequency selection by feedback control in a turbulent shear flow,

V . Parezanovi´c, L. Cordier, A. Spohn, T. Duriez, B. R. Noack, J.-P. Bon- net, M. Segond, M. Abel, and S. L. Brunton, “Frequency selection by feedback control in a turbulent shear flow,”Journal of Fluid Mechanics, vol. 797, pp. 247–283, 2016

2016
[74]

Blood flow in arteries,

D. N. Ku, “Blood flow in arteries,”Annual review of fluid mechanics, vol. 29, no. 1, pp. 399–434, 1997

1997
[75]

Phiflow: Differentiable simulations for pytorch, tensorflow and jax,

P. Holl and N. Thuerey, “Phiflow: Differentiable simulations for pytorch, tensorflow and jax,” inInternational Conference on Machine Learning. PMLR, 2024

2024
[76]

Development of high-efficiency superparamagnetic drug delivery system with mpi imaging capability,

S. Bai, X.-d. Zhang, Y .-q. Zou, Y .-x. Lin, Z.-y. Liu, K.-w. Li, P. Huang, T. Yoshida, Y .-l. Liu, M.-s. Li, W. Zhang, X.-j. Wang, M. Zhang, and C. Du, “Development of high-efficiency superparamagnetic drug delivery system with mpi imaging capability,” Frontiers in Bioengineering and Biotechnology, vol. V olume 12 - 2024, 2024. [Online]. Available: https...

work page doi:10.3389/fbioe.2024.1382085 2024
[77]

Stable-baselines3: Reliable reinforcement learning implementations,

A. Raffin, A. Hill, A. Gleave, A. Kanervisto, M. Ernestus, and N. Dormann, “Stable-baselines3: Reliable reinforcement learning implementations,”Journal of Machine Learning Research, vol. 22, no. 268, pp. 1–8, 2021. [Online]. Available: http://jmlr.org/papers/v22/ 20-1364.html 14

2021
[78]

Gradient surgery for multi-task learning, 2020

T. Yu, S. Kumar, A. Gupta, S. Levine, K. Hausman, and C. Finn, “Gradient surgery for multi-task learning,” 2020. [Online]. Available: https://arxiv.org/abs/2001.06782

work page arXiv 2020
[79]

J. Berman. (2025) Fluxswarm: Micro-swarm locomotion using mo-marl, github repository. GitHub repository. [Online]. Available: https://github.com/josefberman/FluxSwarm

2025

[1] [1]

Life at low reynolds number,

E. M. Purcell, “Life at low reynolds number,”American Journal of Physics, vol. 45, no. 1, pp. 3–11, 01 1977. [Online]. Available: https://doi.org/10.1119/1.10903

work page doi:10.1119/1.10903 1977

[2] [2]

Self-induced oscillations in an open water-channel with slotted walls,

P. L. Betts, “Self-induced oscillations in an open water-channel with slotted walls,”Journal of Fluid Mechanics, vol. 55, no. 3, p. 401–417, 1972

1972

[3] [3]

Numerical simulation of blood flow in venous networks,

M. Cermatori, “Numerical simulation of blood flow in venous networks,” Ph.D. dissertation, Imperial College London, 04 2014

2014

[4] [4]

Versatile microrobotics using simple modular subunits,

U. K. Cheang, F. Meshkati, H. Kim, K. Lee, H. C. Fu, and M. J. Kim, “Versatile microrobotics using simple modular subunits,” Scientific Reports, vol. 6, no. 1, p. 30472, 2016. [Online]. Available: https://doi.org/10.1038/srep30472

work page doi:10.1038/srep30472 2016

[5] [5]

The Self-Propulsion of Microscopic Organisms through Liquids,

G. J. Hancock, “The Self-Propulsion of Microscopic Organisms through Liquids,”Proceedings of the Royal Society of London Series A, vol. 217, no. 1128, pp. 96–121, Mar. 1953

1953

[6] [6]

Bio-inspired magnetic swimming microrobots for biomedical applications,

K. E. Peyer, L. Zhang, and B. J. Nelson, “Bio-inspired magnetic swimming microrobots for biomedical applications,”Nanoscale, vol. 5, pp. 1259–1272, 2013. [Online]. Available: http://dx.doi.org/10.1039/ C2NR32554C

2013

[7] [7]

Recent process in microrobots: From propulsion to swarming for biomedical applications,

R. Wu, Y . Zhu, X. Cai, S. Wu, L. Xu, and T. Yu, “Recent process in microrobots: From propulsion to swarming for biomedical applications,”Micromachines, vol. 13, no. 9, 2022. [Online]. Available: https://www.mdpi.com/2072-666X/13/9/1473

2022

[8] [8]

Propulsion mecha- nisms and applications of multiphysics- driven micro- and nanomotors,

X. Chang, T. Li, D. Zhou, G. Zhang, and L. LI, “Propulsion mecha- nisms and applications of multiphysics- driven micro- and nanomotors,” Chinese Science Bulletin, vol. 62, pp. 122–135, 01 2017

2017

[9] [9]

Self-learning how to swim at low reynolds number,

A. C. H. Tsang, P. W. Tong, S. Nallan, and O. S. Pak, “Self-learning how to swim at low reynolds number,”Phys. Rev. Fluids, vol. 5, p. 074101, Jul 2020. [Online]. Available: https://link.aps.org/doi/10.1103/PhysRevFluids.5.074101

work page doi:10.1103/physrevfluids.5.074101 2020

[10] [10]

Finite element model of a helical swimming robot in comsol,

M. A. L ´opez, C. Nuevo-Gallardo, J. E. Traver, I. Tejado, and B. M. Vinagre, “Finite element model of a helical swimming robot in comsol,” inCOMSOL Conference 2024, 2024

2024

[11] [11]

A survey on swarm microrobotics,

L. Yang, J. Yu, S. Yang, B. Wang, B. J. Nelson, and L. Zhang, “A survey on swarm microrobotics,”IEEE Transactions on Robotics, vol. 38, no. 3, pp. 1531–1551, 2022

2022

[12] [12]

Magnetic microrobotic swarms in fluid suspensions,

H. Chen and J. Yu, “Magnetic microrobotic swarms in fluid suspensions,”Current Robotics Reports, vol. 3, no. 3, pp. 127–137,

[13] [13]

Available: https://doi.org/10.1007/s43154-022-00085-6

[Online]. Available: https://doi.org/10.1007/s43154-022-00085-6

work page doi:10.1007/s43154-022-00085-6

[14] [14]

Navigation of magnetic microrobotic swarms with maintained structural integrity in fluidic flow,

J. Law, X. Du, W. Tang, J. Yu, and Y . Sun, “Navigation of magnetic microrobotic swarms with maintained structural integrity in fluidic flow,” IEEE/ASME Transactions on Mechatronics, vol. 29, no. 1, pp. 74–84, 2024

2024

[15] [15]

Opportunities and challenges with autonomous micro aerial vehicles,

V . Kumar and N. Michael, “Opportunities and challenges with autonomous micro aerial vehicles,”Int. J. Rob. Res., vol. 31, no. 11, p. 1279–1291, Sep. 2012. [Online]. Available: https: //doi.org/10.1177/0278364912455954

work page doi:10.1177/0278364912455954 2012

[16] [16]

Oscillatory gravity waves in flowing water,

T. Sarpkaya, “Oscillatory gravity waves in flowing water,”Transactions of the American Society of Civil Engineers, vol. 122, no. 1, pp. 564–586, 1957. [Online]. Available: https://ascelibrary.org/doi/abs/10. 1061/TACEAT.0007497

1957

[17] [17]

Information theoretic mpc for model-based reinforcement learning,

G. Williams, N. Wagener, B. Goldfain, P. Drews, J. M. Rehg, B. Boots, and E. A. Theodorou, “Information theoretic mpc for model-based reinforcement learning,” in2017 IEEE International Conference on Robotics and Automation (ICRA), 2017, pp. 1714–1721

2017

[18] [18]

Independent control of multiple magnetic microrobots in three dimensions,

E. Diller, J. Giltinan, and M. Sitti, “Independent control of multiple magnetic microrobots in three dimensions,”The International Journal of Robotics Research, vol. 32, no. 5, pp. 614–631, 2013

2013

[19] [19]

Aircraft trajectory planning with collision avoidance using mixed integer linear programming,

A. Richards and J. P. How, “Aircraft trajectory planning with collision avoidance using mixed integer linear programming,” inProceedings of the 2002 American control conference (IEEE Cat. No. CH37301), vol. 3. IEEE, 2002, pp. 1936–1941

2002

[20] [20]

Magnetic therapeutic delivery using navigable agents,

S. Martel, “Magnetic therapeutic delivery using navigable agents,” Therapeutic delivery, vol. 5, no. 2, pp. 189–204, 2014

2014

[21] [21]

Mesbahi and M

M. Mesbahi and M. Egerstedt,Graph theoretic methods in multiagent networks. Princeton University Press, 2010

2010

[22] [22]

Foundations of swarm robotic chemical plume tracing from a fluid dynamics perspective,

D. F. Spears, D. R. Thayer, and D. V . Zarzhitsky, “Foundations of swarm robotic chemical plume tracing from a fluid dynamics perspective,”International Journal of Intelligent Computing and Cybernetics, vol. 2, no. 4, pp. 745–785, Jan. 2009. [Online]. Available: https://doi.org/10.1108/17563780911005863

work page doi:10.1108/17563780911005863 2009

[23] [23]

Disruption of vertical motility by shear triggers formation of thin phytoplankton layers,

W. M. Durham, J. O. Kessler, and R. Stocker, “Disruption of vertical motility by shear triggers formation of thin phytoplankton layers,” science, vol. 323, no. 5917, pp. 1067–1070, 2009

2009

[24] [24]

Swarm robotics,

M. Dorigo, M. Birattari, and M. Brambilla, “Swarm robotics,”Scholar- pedia, vol. 9, no. 1, p. 1463, 2014

2014

[25] [25]

R. S. Sutton, A. G. Bartoet al.,Reinforcement learning: An introduction. MIT press Cambridge, 1998, vol. 1, no. 1

1998

[26] [26]

Pilco: A model-based and data-efficient approach to policy search,

M. Deisenroth and C. E. Rasmussen, “Pilco: A model-based and data-efficient approach to policy search,” inProceedings of the 28th International Conference on machine learning (ICML-11), 2011, pp. 465–472

2011

[27] [27]

Dream to Control: Learning Behaviors by Latent Imagination

D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi, “Dream to control: Learn- ing behaviors by latent imagination,”arXiv preprint arXiv:1912.01603, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1912

[28] [28]

Model regularization for stable sample rollouts

E. Talvitie, “Model regularization for stable sample rollouts.” inUAI, 2014, pp. 780–789

2014

[29] [29]

Multi-agent reinforce- ment learning: An overview,

L. Bus ¸oniu, R. Babu ˇska, and B. De Schutter, “Multi-agent reinforce- ment learning: An overview,”Innovations in multi-agent systems and applications-1, pp. 183–221, 2010

2010

[30] [30]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox- imal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[31] [31]

Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor,

T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor,” inInternational conference on machine learning. Pmlr, 2018, pp. 1861–1870

2018

[32] [32]

Controlled gliding and perching through deep-reinforcement-learning,

G. Novati, L. Mahadevan, and P. Koumoutsakos, “Controlled gliding and perching through deep-reinforcement-learning,”Physical Review Fluids, vol. 4, no. 9, p. 093902, 2019

2019

[33] [33]

Scaling macroscopic aquatic locomotion,

M. Gazzola, M. Argentina, and L. Mahadevan, “Scaling macroscopic aquatic locomotion,”Nature Physics, vol. 10, no. 10, pp. 758–761, 2014. 13

2014

[34] [34]

A comprehensive survey of multiagent reinforcement learning,

L. Busoniu, R. Babuska, and B. De Schutter, “A comprehensive survey of multiagent reinforcement learning,”IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 38, no. 2, pp. 156–172, 2008

2008

[35] [35]

Multi-agent reinforcement learning: A selective overview of theories and algorithms,

K. Zhang, Z. Yang, and T. Bas ¸ar, “Multi-agent reinforcement learning: A selective overview of theories and algorithms,”Handbook of rein- forcement learning and control, pp. 321–384, 2021

2021

[36] [36]

Multi-agent actor-critic for mixed cooperative-competitive environ- ments,

R. Lowe, Y . I. Wu, A. Tamar, J. Harb, O. Pieter Abbeel, and I. Mordatch, “Multi-agent actor-critic for mixed cooperative-competitive environ- ments,”Advances in neural information processing systems, vol. 30, 2017

2017

[37] [37]

Counterfactual multi-agent policy gradients,

J. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson, “Counterfactual multi-agent policy gradients,” inProceedings of the AAAI conference on artificial intelligence, vol. 32, no. 1, 2018

2018

[38] [38]

Multi-agent reinforcement learning: Independent vs. cooper- ative agents,

M. Tan, “Multi-agent reinforcement learning: Independent vs. cooper- ative agents,” inProceedings of the tenth international conference on machine learning, 1993, pp. 330–337

1993

[39] [39]

Optimal and approxi- mate q-value functions for decentralized pomdps,

F. A. Oliehoek, M. T. Spaan, and N. Vlassis, “Optimal and approxi- mate q-value functions for decentralized pomdps,”Journal of Artificial Intelligence Research, vol. 32, pp. 289–353, 2008

2008

[40] [40]

Monotonic value function factorisation for deep multi- agent reinforcement learning,

T. Rashid, M. Samvelyan, C. S. De Witt, G. Farquhar, J. Foerster, and S. Whiteson, “Monotonic value function factorisation for deep multi- agent reinforcement learning,”Journal of Machine Learning Research, vol. 21, no. 178, pp. 1–51, 2020

2020

[41] [41]

The surprising effectiveness of ppo in cooperative multi-agent games,

C. Yu, A. Velu, E. Vinitsky, J. Gao, Y . Wang, A. Bayen, and Y . Wu, “The surprising effectiveness of ppo in cooperative multi-agent games,” Advances in neural information processing systems, vol. 35, pp. 24 611– 24 624, 2022

2022

[42] [42]

Learning multiagent communication with backpropagation,

S. Sukhbaatar, R. Ferguset al., “Learning multiagent communication with backpropagation,”Advances in neural information processing sys- tems, vol. 29, 2016

2016

[43] [43]

Graph convolutional reinforce- ment learning,

J. Jiang, C. Dun, T. Huang, and Z. Lu, “Graph convolutional reinforce- ment learning,”arXiv preprint arXiv:1810.09202, 2018

work page arXiv 2018

[44] [44]

A survey of multi-objective sequential decision-making,

D. M. Roijers, P. Vamplew, S. Whiteson, and R. Dazeley, “A survey of multi-objective sequential decision-making,”Journal of Artificial Intelligence Research, vol. 48, pp. 67–113, 2013

2013

[45] [45]

A practical guide to multi-objective reinforcement learning and planning,

C. F. Hayes, R. R ˘adulescu, E. Bargiacchi, J. K ¨allstr¨om, M. Macfarlane, M. Reymond, T. Verstraeten, L. M. Zintgraf, R. Dazeley, F. Heintz, E. Howley, A. A. Irissappane, P. Mannion, A. Now ´e, G. Ramos, M. Restelli, P. Vamplew, and D. M. Roijers, “A practical guide to multi-objective reinforcement learning and planning,”Autonomous Agents and Multi-Agen...

work page doi:10.1007/s10458-022-09552-y 2022

[46] [46]

Empirical evaluation methods for multiobjective reinforcement learning algorithms,

P. Vamplew, R. Dazeley, A. Berry, R. Issabekov, and E. Dekker, “Empirical evaluation methods for multiobjective reinforcement learning algorithms,”Machine learning, vol. 84, no. 1, pp. 51–80, 2011

2011

[47] [47]

Scalarized multi- objective reinforcement learning: Novel design techniques,

K. Van Moffaert, M. M. Drugan, and A. Now ´e, “Scalarized multi- objective reinforcement learning: Novel design techniques,” in2013 IEEE symposium on adaptive dynamic programming and reinforcement learning (ADPRL). IEEE, 2013, pp. 191–199

2013

[48] [48]

Prediction- guided multi-objective reinforcement learning for continuous robot con- trol,

J. Xu, Y . Tian, P. Ma, D. Rus, S. Sueda, and W. Matusik, “Prediction- guided multi-objective reinforcement learning for continuous robot con- trol,” inInternational conference on machine learning. PMLR, 2020, pp. 10 607–10 616

2020

[49] [49]

Multi-task learning as multi-objective opti- mization,

O. Sener and V . Koltun, “Multi-task learning as multi-objective opti- mization,”Advances in neural information processing systems, vol. 31, 2018

2018

[50] [50]

Gra- dient surgery for multi-task learning,

T. Yu, S. Kumar, A. Gupta, S. Levine, K. Hausman, and C. Finn, “Gra- dient surgery for multi-task learning,”Advances in neural information processing systems, vol. 33, pp. 5824–5836, 2020

2020

[51] [51]

Scalable multi-objective robot reinforcement learning through gradient conflict resolution,

H. Munn, B. Tidd, P. B ¨ohm, M. Gallagher, and D. Howard, “Scalable multi-objective robot reinforcement learning through gradient conflict resolution,”arXiv preprint arXiv:2509.14816, 2025

work page arXiv 2025

[52] [52]

Multi-objective multi-agent decision making: a utility-based analysis and survey,

R. R ˘adulescu, P. Mannion, D. M. Roijers, and A. Now´e, “Multi-objective multi-agent decision making: a utility-based analysis and survey,” Autonomous Agents and Multi-Agent Systems, vol. 34, no. 1, p. 10,

[53] [53]

Available: https://doi.org/10.1007/s10458-019-09433-x

[Online]. Available: https://doi.org/10.1007/s10458-019-09433-x

work page doi:10.1007/s10458-019-09433-x

[54] [54]

Sample-efficient multi-objective learning via generalized policy improvement prioritization,

L. N. Alegre, A. L. Bazzan, D. M. Roijers, A. Now ´e, and B. C. da Silva, “Sample-efficient multi-objective learning via generalized policy improvement prioritization,”arXiv preprint arXiv:2301.07784, 2023

work page arXiv 2023

[55] [55]

Meta-gradient reinforcement learning with an objective discovered online,

Z. Xu, H. P. van Hasselt, M. Hessel, J. Oh, S. Singh, and D. Silver, “Meta-gradient reinforcement learning with an objective discovered online,”Advances in Neural Information Processing Systems, vol. 33, pp. 15 254–15 264, 2020

2020

[56] [56]

Pac: Assisted value factorization with counterfactual predictions in multi-agent reinforcement learning,

H. Zhou, T. Lan, and V . Aggarwal, “Pac: Assisted value factorization with counterfactual predictions in multi-agent reinforcement learning,” Advances in Neural Information Processing Systems, vol. 35, pp. 15 757–15 769, 2022

2022

[57] [57]

Multi-objective deep reinforcement learning for mobile edge computing,

N. Yang, J. Wen, M. Zhang, and M. Tang, “Multi-objective deep reinforcement learning for mobile edge computing,” in2023 21st in- ternational symposium on modeling and optimization in mobile, ad hoc, and wireless networks (WiOpt). IEEE, 2023, pp. 1–8

2023

[58] [58]

Machine learning for fluid mechanics,

S. L. Brunton, B. R. Noack, and P. Koumoutsakos, “Machine learning for fluid mechanics,”Annual review of fluid mechanics, vol. 52, no. 1, pp. 477–508, 2020

2020

[59] [59]

Accelerating deep reinforcement learn- ing strategies of flow control through a multi-environment approach,

J. Rabault and A. Kuhnle, “Accelerating deep reinforcement learn- ing strategies of flow control through a multi-environment approach,” Physics of Fluids, vol. 31, no. 9, 2019

2019

[60] [60]

Direct shape optimization through deep reinforcement learning,

J. Viquerat, J. Rabault, A. Kuhnle, H. Ghraieb, A. Larcher, and E. Hachem, “Direct shape optimization through deep reinforcement learning,”Journal of Computational Physics, vol. 428, p. 110080, 2021

2021

[61] [61]

Learning to control pdes with differentiable physics,

P. Holl, V . Koltun, and N. Thuerey, “Learning to control pdes with differentiable physics,”arXiv preprint arXiv:2001.07457, 2020

work page arXiv 2001

[62] [62]

Machine learning–accelerated computational fluid dynamics,

D. Kochkov, J. A. Smith, A. Alieva, Q. Wang, M. P. Brenner, and S. Hoyer, “Machine learning–accelerated computational fluid dynamics,” Proceedings of the National Academy of Sciences, vol. 118, no. 21, p. e2101784118, 2021

2021

[63] [63]

Artificial neural networks trained through deep reinforcement learning discover control strategies for active flow control,

J. Rabault, M. Kuchta, A. Jensen, U. R ´eglade, and N. Cerardi, “Artificial neural networks trained through deep reinforcement learning discover control strategies for active flow control,”Journal of fluid mechanics, vol. 865, pp. 281–302, 2019

2019

[64] [64]

A review on deep reinforcement learning for fluid me- chanics,

P. Garnier, J. Viquerat, J. Rabault, A. Larcher, A. Kuhnle, and E. Hachem, “A review on deep reinforcement learning for fluid me- chanics,”Computers & Fluids, vol. 225, p. 104973, 2021

2021

[65] [65]

Robust flow control and optimal sensor placement using deep reinforcement learning,

R. Paris, S. Beneddine, and J. Dandois, “Robust flow control and optimal sensor placement using deep reinforcement learning,”Journal of Fluid Mechanics, vol. 913, p. A25, 2021

2021

[66] [66]

Recent advances in applying deep reinforcement learning for flow control: Perspectives and future directions,

C. Vignon, J. Rabault, and R. Vinuesa, “Recent advances in applying deep reinforcement learning for flow control: Perspectives and future directions,”Physics of fluids, vol. 35, no. 3, p. 031301, 2023

2023

[67] [67]

Applying deep reinforcement learning to active flow control in weakly turbulent conditions,

F. Ren, J. Rabault, and H. Tang, “Applying deep reinforcement learning to active flow control in weakly turbulent conditions,”Physics of Fluids, vol. 33, no. 3, 2021

2021

[68] [68]

Reinforcement learning of control strategies for reducing skin friction drag in a fully developed turbulent channel flow,

T. Sonoda, Z. Liu, T. Itoh, and Y . Hasegawa, “Reinforcement learning of control strategies for reducing skin friction drag in a fully developed turbulent channel flow,”Journal of Fluid Mechanics, vol. 960, p. A30, 2023

2023

[69] [69]

Flow navi- gation by smart microswimmers via reinforcement learning,

S. Colabrese, K. Gustavsson, A. Celani, and L. Biferale, “Flow navi- gation by smart microswimmers via reinforcement learning,”Physical review letters, vol. 118, no. 15, p. 158004, 2017

2017

[70] [70]

Zermelo’s problem: optimal point-to-point navigation in 2d turbulent flows using reinforcement learning,

L. Biferale, F. Bonaccorso, M. Buzzicotti, P. Clark Di Leoni, and K. Gustavsson, “Zermelo’s problem: optimal point-to-point navigation in 2d turbulent flows using reinforcement learning,”Chaos: An Inter- disciplinary Journal of Nonlinear Science, vol. 29, no. 10, 2019

2019

[71] [71]

Machine learning strategies for path-planning microswimmers in turbulent flows,

J. K. Alageshan, A. K. Verma, J. Bec, and R. Pandit, “Machine learning strategies for path-planning microswimmers in turbulent flows,”Physical Review E, vol. 101, no. 4, p. 043110, 2020

2020

[72] [72]

Learning efficient navigation in vortical flow fields,

P. Gunnarson, I. Mandralis, G. Novati, P. Koumoutsakos, and J. O. Dabiri, “Learning efficient navigation in vortical flow fields,”Nature communications, vol. 12, no. 1, p. 7143, 2021

2021

[73] [73]

Frequency selection by feedback control in a turbulent shear flow,

V . Parezanovi´c, L. Cordier, A. Spohn, T. Duriez, B. R. Noack, J.-P. Bon- net, M. Segond, M. Abel, and S. L. Brunton, “Frequency selection by feedback control in a turbulent shear flow,”Journal of Fluid Mechanics, vol. 797, pp. 247–283, 2016

2016

[74] [74]

Blood flow in arteries,

D. N. Ku, “Blood flow in arteries,”Annual review of fluid mechanics, vol. 29, no. 1, pp. 399–434, 1997

1997

[75] [75]

Phiflow: Differentiable simulations for pytorch, tensorflow and jax,

P. Holl and N. Thuerey, “Phiflow: Differentiable simulations for pytorch, tensorflow and jax,” inInternational Conference on Machine Learning. PMLR, 2024

2024

[76] [76]

Development of high-efficiency superparamagnetic drug delivery system with mpi imaging capability,

S. Bai, X.-d. Zhang, Y .-q. Zou, Y .-x. Lin, Z.-y. Liu, K.-w. Li, P. Huang, T. Yoshida, Y .-l. Liu, M.-s. Li, W. Zhang, X.-j. Wang, M. Zhang, and C. Du, “Development of high-efficiency superparamagnetic drug delivery system with mpi imaging capability,” Frontiers in Bioengineering and Biotechnology, vol. V olume 12 - 2024, 2024. [Online]. Available: https...

work page doi:10.3389/fbioe.2024.1382085 2024

[77] [77]

Stable-baselines3: Reliable reinforcement learning implementations,

A. Raffin, A. Hill, A. Gleave, A. Kanervisto, M. Ernestus, and N. Dormann, “Stable-baselines3: Reliable reinforcement learning implementations,”Journal of Machine Learning Research, vol. 22, no. 268, pp. 1–8, 2021. [Online]. Available: http://jmlr.org/papers/v22/ 20-1364.html 14

2021

[78] [78]

Gradient surgery for multi-task learning, 2020

T. Yu, S. Kumar, A. Gupta, S. Levine, K. Hausman, and C. Finn, “Gradient surgery for multi-task learning,” 2020. [Online]. Available: https://arxiv.org/abs/2001.06782

work page arXiv 2020

[79] [79]

J. Berman. (2025) Fluxswarm: Micro-swarm locomotion using mo-marl, github repository. GitHub repository. [Online]. Available: https://github.com/josefberman/FluxSwarm

2025