Memory-Efficient Meta-Reinforcement Learning for Adaptive Safety-Critical Control in Adversarial Spacecraft Proximity Operations
Pith reviewed 2026-06-27 01:56 UTC · model grok-4.3
The pith
Mamba with PPO outperforms LSTM and GRU in meta-RL for learning safety functions in adversarial spacecraft rendezvous.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that selective state space models such as Mamba, when used with proximal policy optimization to learn the class-K functions defining the input-constrained control barrier function recursion via meta-reinforcement learning, achieve superior task completion, safety maintenance, and fuel savings relative to long short-term memory and gated recurrent unit networks trained with either proximal policy optimization or soft actor-critic, in both cooperative and adversarial spacecraft proximity operation scenarios.
What carries the argument
Meta-RL training of recurrent networks to parameterize class-K functions inside the ICCBF forward-invariance recursion, evaluated by task success, safety constraint satisfaction, and propellant consumption under adversarial target motion.
If this is right
- Controllers for rendezvous can maintain safety margins against uncooperative targets while using less fuel than current recurrent baselines.
- The same meta-RL pipeline can be applied to other nonlinear systems that require input-constrained safety filters.
- State-space-model-based policies reduce memory footprint during online adaptation compared with LSTM or GRU equivalents.
- Safety-critical meta-RL becomes viable for missions where the target may actively degrade the chaser's feasible set.
Where Pith is reading between the lines
- Onboard spacecraft computers with limited RAM could run these policies in real time where LSTM versions would exceed memory limits.
- The performance edge may allow shorter meta-training episodes, lowering the computational cost of adapting to new orbital regimes.
- Similar architecture comparisons could be run for other safety-filtered control problems such as autonomous underwater vehicles or aerial collision avoidance.
Load-bearing premise
The simulation environments and selected adversarial behaviors are representative enough that performance gaps seen in training will appear under real spacecraft dynamics and disturbances.
What would settle it
A hardware-in-the-loop experiment or on-orbit test in which the chaser encounters a target with unmodeled dynamics or different adversarial tactics would show whether the reported gains in completion, safety, and fuel use persist.
read the original abstract
Autonomous spacecraft rendezvous and proximity operations (RPO) require controllers that guarantee safety under thrust constraints while minimizing fuel expenditure. Input-constrained control barrier functions (ICCBFs) provide a control method for nonlinear systems with actuation constraints that construct a forward-invariant safe set. Previous work has shown that learning class-$\mathcal{K}$ functions defining the ICCBF recursion via meta reinforcement learning (meta-RL) yields a robust, non-greedy approach to safety-critical control in RPO. This paper extends that framework further by investigating the performance of three recurrent network architectures (Long Short Term Memory (LSTM), Gated Recurrent Unit (GRU), Selective State Space Model (Mamba)) and two training algorithms (Proximal Policy Optimization (PPO) and Soft Actor Critic (SAC)) to identify the best setup for tuning ICCBF class-K functions via meta-RL. In addition to cooperative test cases, performance is evaluated in the presence of adversarial behavior where the target spacecraft behaves in a way that worsens the safety of the chaser spacecraft. Results indicate that state space models such as Mamba when used with PPO achieve superior task completion, safety, and fuel-savings compared to other architectures, across all cooperative and uncooperative scenarios tested.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript extends prior meta-RL work on learning class-K functions for Input-Constrained Control Barrier Functions (ICCBFs) in spacecraft rendezvous and proximity operations. It empirically compares three recurrent architectures (LSTM, GRU, Mamba) paired with PPO and SAC across cooperative and adversarial target behaviors, reporting that Mamba+PPO yields the highest task completion rates, safety margins, and fuel efficiency.
Significance. If the reported rankings hold under the stated experimental conditions, the results supply actionable guidance on architecture selection for meta-RL safety filters in aerospace control. The explicit inclusion of adversarial scenarios and the focus on fuel-constrained, actuation-limited dynamics add practical value beyond standard RL benchmarks.
minor comments (3)
- [Abstract] Abstract: the superiority claim is stated without any numerical values, trial counts, or statistical tests; adding one or two key metrics (e.g., success rate or fuel delta) would improve immediate readability.
- [Section 4] Section 4 (Experimental Setup): hyperparameter tables list network sizes and learning rates but omit the exact meta-RL horizon length and the number of independent seeds used for each architecture-algorithm pair; these details are needed for reproducibility.
- [Figures 5-7] Figures 5-7: the performance plots lack error bars or shaded regions indicating variability across trials; adding them would strengthen the visual comparison of Mamba+PPO against the baselines.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of our work on meta-RL for ICCBF tuning in spacecraft RPO, including the recognition of its practical value in adversarial settings. We are pleased with the recommendation for minor revision.
Circularity Check
No significant circularity; empirical benchmark with independent results
full rationale
The manuscript is an empirical study comparing recurrent architectures (LSTM, GRU, Mamba) paired with PPO/SAC for meta-RL tuning of ICCBF class-K functions in cooperative and adversarial RPO scenarios. No derivation chain, uniqueness theorem, or fitted-parameter prediction is present that reduces reported performance metrics to quantities defined inside the same loop. The reference to prior work on the meta-RL + ICCBF framework is background context only and does not bear the load of the new architecture ranking. All claims rest on simulation metrics that are externally falsifiable and not forced by construction from the inputs.
Axiom & Free-Parameter Ledger
free parameters (1)
- meta-RL training hyperparameters and network sizes
Reference graph
Works this paper leans on
-
[1]
Survey of Numerical Methods for Trajectory Optimization,
J. T. Betts, “Survey of Numerical Methods for Trajectory Optimization,”Journal of Guidance, Control, and Dynamics, V ol. 21, No. 2, 1998, pp. 193–207, 10.2514/2.4231
-
[2]
M. C. Wijayatunga, R. Armellin, and L. Pirovano, “Exploiting Scaling Constants to Facilitate the Con- vergence of Indirect Trajectory Optimization Methods,”Journal of Guidance, Control, and Dynamics, V ol. 46, No. 5, 2023, pp. 958–969, 10.2514/1.G007091
-
[3]
Autonomous Trajectory Planning for Rendezvous and Proximity Operations by Conic Optimization,
P. Lu and X. Liu, “Autonomous Trajectory Planning for Rendezvous and Proximity Operations by Conic Optimization,”Journal of Guidance, Control, and Dynamics, V ol. 36, No. 2, 2013, pp. 375–389, 10.2514/1.58436
-
[4]
Reynolds, Michael Szmuk, Thomas Lew, Riccardo Bonalli, Marco Pavone, and Behçet Açıkme¸ se
D. Malyuta, T. P. Reynolds, M. Szmuk, T. Lew, R. Bonalli, M. Pavone, and B. Ac ¸ıkmes ¸e, “Convex Optimization for Trajectory Generation: A Tutorial on Generating Dynamically Feasible Trajecto- ries Reliably and Efficiently,”IEEE Control Systems Magazine, V ol. 42, No. 5, 2022, pp. 40–113, 10.1109/MCS.2022.3187542
-
[5]
State-Dependent Trust Region for Successive Convex Optimization of Spacecraft Trajectories,
N. Bernardini, M. C. Wijayatunga, N. Baresi, and R. Armellin, “State-Dependent Trust Region for Successive Convex Optimization of Spacecraft Trajectories,”33rd AAS/AIAA Space Flight Mechanics Meeting, Austin, TX, 2023
2023
-
[6]
Wijayatunga, Roberto Armellin, Harry Holt, Laura Pirovano, and Aleksander A
M. C. Wijayatunga, R. Armellin, H. Holt, L. Pirovano, and A. A. Lidtke, “Design and Guidance of a Multi-Active Debris Removal Mission,”Astrodynamics, V ol. 7, No. 4, 2023, pp. 383–399, 10.1007/s42064-023-0159-3
-
[7]
M. C. Wijayatunga, J. Guinane, N. D. Wallace, and X. Wu, “An Autonomous, End-to-End, Convex- Based Framework for Close-Range Rendezvous Trajectory Design and Guidance with Hardware Testbed Validation,” 2026, 10.48550/arXiv.2602.12421
-
[8]
A. Weiss, M. Baldwin, R. S. Erwin, and I. Kolmanovsky, “Model Predictive Control for Spacecraft Rendezvous and Docking: Strategies for Handling Constraints and Case Studies,”IEEE Transactions on Control Systems Technology, V ol. 23, No. 4, 2015, pp. 1638–1647, 10.1109/TCST.2014.2379639
-
[9]
B. Gaudet, R. Linares, and R. Furfaro, “Deep Reinforcement Learning for Six Degree-of- Freedom Planetary Landing,”Advances in Space Research, V ol. 65, No. 7, 2020, pp. 1723–1741, 10.1016/j.asr.2019.12.030
-
[10]
A. Zavoli and L. Federici, “Reinforcement Learning for Robust Trajectory Design of Interplane- tary Missions,”Journal of Guidance, Control, and Dynamics, V ol. 44, No. 8, 2021, pp. 1440–1453, 10.2514/1.G005794
-
[11]
L. Federici, B. Benedikter, and A. Zavoli, “Deep Learning Techniques for Autonomous Spacecraft Guidance During Proximity Operations,”Journal of Spacecraft and Rockets, V ol. 58, No. 6, 2021, pp. 1774–1785, 10.2514/1.A35076
-
[12]
M. C. Wijayatunga, R. Armellin, and H. Holt, “Robust Trajectory Design and Guidance for Far-Range Rendezvous Using Reinforcement Learning with Safety and Observability Considerations,”Aerospace Science and Technology, V ol. 159, 2025, p. 109996, 10.1016/j.ast.2025.109996
-
[13]
L. Federici, A. Scorsoglio, A. Zavoli, and R. Furfaro, “Meta-Reinforcement Learning for Adaptive Spacecraft Guidance During Finite-Thrust Rendezvous Missions,”Acta Astronautica, V ol. 201, 2022, pp. 129–141, 10.1016/j.actaastro.2022.08.047
-
[14]
G. Fereoli, H. Schaub, and P. Di Lizia, “Meta-Reinforcement Learning for Spacecraft Proximity Op- erations Guidance and Control in Cislunar Space,”Journal of Spacecraft and Rockets, V ol. 62, No. 3, 2025, pp. 706–718, 10.2514/1.A36100
-
[15]
Safe Reinforcement Learning via Shielding,
M. Alshiekh, R. Bloem, R. Ehlers, B. K ¨onighofer, S. Niekum, and U. Topcu, “Safe Reinforcement Learning via Shielding,”Proceedings of the AAAI Conference on Artificial Intelligence, V ol. 32, 2018
2018
-
[16]
Run Time Assured Reinforcement Learning for Safe Satellite Docking,
K. Dunlap, M. Mote, K. Delsing, and K. L. Hobbs, “Run Time Assured Reinforcement Learning for Safe Satellite Docking,”Journal of Aerospace Information Systems, V ol. 20, No. 1, 2023, pp. 25–36, 10.2514/1.I011126
-
[17]
Safe Spacecraft Inspection via Deep Reinforcement Learning and Discrete Control Barrier Functions,
D. v. Wijk, K. Dunlap, M. Majji, and K. L. Hobbs, “Safe Spacecraft Inspection via Deep Reinforcement Learning and Discrete Control Barrier Functions,”Journal of Aerospace Information Systems, V ol. 21, No. 12, 2024, pp. 996–1013
2024
-
[18]
Learning Safety-Guaranteed, Non- Greedy Control Barrier Functions Using Reinforcement Learning,
M. Wijayatunga, N. Wallace, S. Sukkarieh, and R. Armellin, “Learning Safety-Guaranteed, Non- Greedy Control Barrier Functions Using Reinforcement Learning,” 2026
2026
-
[19]
Control Barrier Functions: Theory and Applications,
A. D. Ames, S. Coogan, M. Egerstedt, G. Notomista, K. Sreenath, and P. Tabuada, “Control Barrier Functions: Theory and Applications,” 2019
2019
-
[20]
Neural network training as an optimal control problem : — an augmented lagrangian approach —
D. R. Agrawal and D. Panagou, “Safe Control Synthesis via Input Constrained Control Barrier Func- tions,”2021 60th IEEE Conference on Decision and Control (CDC), IEEE, Dec. 2021, p. 6113–6118, 10.1109/cdc45484.2021.9682938. 23
-
[21]
C. Dawson, S. Gao, and C. Fan, “Safe Control With Learned Certificates: A Survey of Neural Lyapunov, Barrier, and Contraction Methods for Robotics and Control,”IEEE Transactions on Robotics, V ol. 39, No. 3, 2023, pp. 1749–1767, 10.1109/TRO.2022.3232542
-
[22]
Meta-Reinforcement Learning for Robust and Non- greedy Control Barrier Functions in Spacecraft Proximity Operations,
M. C. Wijayatunga, R. Linares, and R. Armellin, “Meta-Reinforcement Learning for Robust and Non- greedy Control Barrier Functions in Spacecraft Proximity Operations,” 2026
2026
-
[23]
Empirical Evaluation of Gated Recurrent Neural Net- works on Sequence Modeling,
J. Chung, C. Gulcehre, K. Cho, and Y . Bengio, “Empirical Evaluation of Gated Recurrent Neural Net- works on Sequence Modeling,” 2014
2014
-
[24]
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality,
T. Dao and A. Gu, “Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality,” 2024
2024
-
[25]
Proximal Policy Optimization Algo- rithms,
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal Policy Optimization Algo- rithms,” 2017
2017
-
[26]
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor,
T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor,” 2018
2018
-
[27]
R. S. Sutton and A. G. Barto,Reinforcement Learning: An Introduction. The MIT Press, second ed., 2018
2018
-
[28]
J. Beck, R. Vuorio, E. Liu, Z. Xiong, L. Zintgraf, C. Finn, and S. Whiteson,A Tutorial on Meta- Reinforcement Learning. Foundations and Trends in Artificial Intelligence Series, Now Publishers, 2025
2025
-
[29]
B. Gaudet, R. Linares, and R. Furfaro, “Adaptive guidance and integrated navigation with reinforcement meta-learning,”Acta Astronautica, V ol. 169, 2020, pp. 180–190, https://doi.org/10.1016/j.actaastro.2020.01.007
-
[30]
Control Barrier Functions in Sampled-Data Systems,
J. Breeden, K. Garg, and D. Panagou, “Control Barrier Functions in Sampled-Data Systems,”IEEE Control Systems Letters, V ol. 6, 2022, p. 367–372, 10.1109/lcsys.2021.3076127
-
[31]
Safe Spacecraft Inspection via Deep Reinforcement Learning and Discrete Control Barrier Functions,
D. Van Wijk, K. Dunlap, M. Majji, and K. Hobbs, “Safe Spacecraft Inspection via Deep Reinforcement Learning and Discrete Control Barrier Functions,”Journal of Aerospace Information Systems, V ol. 21, No. 12, 2024, pp. 996–1013, 10.2514/1.I011391. 24
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.