pith. sign in

arxiv: 2606.17414 · v1 · pith:E2XZ5WBGnew · submitted 2026-06-16 · 💻 cs.LG · math.DS

Memory-Efficient Meta-Reinforcement Learning for Adaptive Safety-Critical Control in Adversarial Spacecraft Proximity Operations

Pith reviewed 2026-06-27 01:56 UTC · model grok-4.3

classification 💻 cs.LG math.DS
keywords meta-reinforcement learningcontrol barrier functionsspacecraft proximity operationsstate space modelsadversarial scenariossafety-critical controlproximal policy optimizationMamba
0
0 comments X

The pith

Mamba with PPO outperforms LSTM and GRU in meta-RL for learning safety functions in adversarial spacecraft rendezvous.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper extends prior meta-RL work on tuning class-K functions for input-constrained control barrier functions by comparing three recurrent architectures and two training algorithms on spacecraft proximity operations. It evaluates performance in cooperative scenarios and in uncooperative ones where the target spacecraft deliberately reduces the chaser's safety margin. Results show that selective state space models paired with proximal policy optimization deliver higher task completion rates, fewer safety violations, and lower fuel use than the alternatives across all tested conditions. A reader would care because the finding identifies a practical memory-efficient setup for adaptive safety-critical controllers that must operate under thrust limits and potential opposition.

Core claim

The paper establishes that selective state space models such as Mamba, when used with proximal policy optimization to learn the class-K functions defining the input-constrained control barrier function recursion via meta-reinforcement learning, achieve superior task completion, safety maintenance, and fuel savings relative to long short-term memory and gated recurrent unit networks trained with either proximal policy optimization or soft actor-critic, in both cooperative and adversarial spacecraft proximity operation scenarios.

What carries the argument

Meta-RL training of recurrent networks to parameterize class-K functions inside the ICCBF forward-invariance recursion, evaluated by task success, safety constraint satisfaction, and propellant consumption under adversarial target motion.

If this is right

  • Controllers for rendezvous can maintain safety margins against uncooperative targets while using less fuel than current recurrent baselines.
  • The same meta-RL pipeline can be applied to other nonlinear systems that require input-constrained safety filters.
  • State-space-model-based policies reduce memory footprint during online adaptation compared with LSTM or GRU equivalents.
  • Safety-critical meta-RL becomes viable for missions where the target may actively degrade the chaser's feasible set.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Onboard spacecraft computers with limited RAM could run these policies in real time where LSTM versions would exceed memory limits.
  • The performance edge may allow shorter meta-training episodes, lowering the computational cost of adapting to new orbital regimes.
  • Similar architecture comparisons could be run for other safety-filtered control problems such as autonomous underwater vehicles or aerial collision avoidance.

Load-bearing premise

The simulation environments and selected adversarial behaviors are representative enough that performance gaps seen in training will appear under real spacecraft dynamics and disturbances.

What would settle it

A hardware-in-the-loop experiment or on-orbit test in which the chaser encounters a target with unmodeled dynamics or different adversarial tactics would show whether the reported gains in completion, safety, and fuel use persist.

read the original abstract

Autonomous spacecraft rendezvous and proximity operations (RPO) require controllers that guarantee safety under thrust constraints while minimizing fuel expenditure. Input-constrained control barrier functions (ICCBFs) provide a control method for nonlinear systems with actuation constraints that construct a forward-invariant safe set. Previous work has shown that learning class-$\mathcal{K}$ functions defining the ICCBF recursion via meta reinforcement learning (meta-RL) yields a robust, non-greedy approach to safety-critical control in RPO. This paper extends that framework further by investigating the performance of three recurrent network architectures (Long Short Term Memory (LSTM), Gated Recurrent Unit (GRU), Selective State Space Model (Mamba)) and two training algorithms (Proximal Policy Optimization (PPO) and Soft Actor Critic (SAC)) to identify the best setup for tuning ICCBF class-K functions via meta-RL. In addition to cooperative test cases, performance is evaluated in the presence of adversarial behavior where the target spacecraft behaves in a way that worsens the safety of the chaser spacecraft. Results indicate that state space models such as Mamba when used with PPO achieve superior task completion, safety, and fuel-savings compared to other architectures, across all cooperative and uncooperative scenarios tested.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript extends prior meta-RL work on learning class-K functions for Input-Constrained Control Barrier Functions (ICCBFs) in spacecraft rendezvous and proximity operations. It empirically compares three recurrent architectures (LSTM, GRU, Mamba) paired with PPO and SAC across cooperative and adversarial target behaviors, reporting that Mamba+PPO yields the highest task completion rates, safety margins, and fuel efficiency.

Significance. If the reported rankings hold under the stated experimental conditions, the results supply actionable guidance on architecture selection for meta-RL safety filters in aerospace control. The explicit inclusion of adversarial scenarios and the focus on fuel-constrained, actuation-limited dynamics add practical value beyond standard RL benchmarks.

minor comments (3)
  1. [Abstract] Abstract: the superiority claim is stated without any numerical values, trial counts, or statistical tests; adding one or two key metrics (e.g., success rate or fuel delta) would improve immediate readability.
  2. [Section 4] Section 4 (Experimental Setup): hyperparameter tables list network sizes and learning rates but omit the exact meta-RL horizon length and the number of independent seeds used for each architecture-algorithm pair; these details are needed for reproducibility.
  3. [Figures 5-7] Figures 5-7: the performance plots lack error bars or shaded regions indicating variability across trials; adding them would strengthen the visual comparison of Mamba+PPO against the baselines.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our work on meta-RL for ICCBF tuning in spacecraft RPO, including the recognition of its practical value in adversarial settings. We are pleased with the recommendation for minor revision.

Circularity Check

0 steps flagged

No significant circularity; empirical benchmark with independent results

full rationale

The manuscript is an empirical study comparing recurrent architectures (LSTM, GRU, Mamba) paired with PPO/SAC for meta-RL tuning of ICCBF class-K functions in cooperative and adversarial RPO scenarios. No derivation chain, uniqueness theorem, or fitted-parameter prediction is present that reduces reported performance metrics to quantities defined inside the same loop. The reference to prior work on the meta-RL + ICCBF framework is background context only and does not bear the load of the new architecture ranking. All claims rest on simulation metrics that are externally falsifiable and not forced by construction from the inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The paper is an empirical ML application study. No new physical axioms or invented entities are introduced. Free parameters consist of standard RL hyperparameters and network sizes whose values are not reported in the abstract.

free parameters (1)
  • meta-RL training hyperparameters and network sizes
    Typical for any RL study; values not stated in abstract and would be fitted or chosen to produce the reported ranking.

pith-pipeline@v0.9.1-grok · 5765 in / 1174 out tokens · 31805 ms · 2026-06-27T01:56:01.765944+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

31 extracted references · 19 canonical work pages

  1. [1]

    Survey of Numerical Methods for Trajectory Optimization,

    J. T. Betts, “Survey of Numerical Methods for Trajectory Optimization,”Journal of Guidance, Control, and Dynamics, V ol. 21, No. 2, 1998, pp. 193–207, 10.2514/2.4231

  2. [2]

    Exploiting Scaling Constants to Facilitate the Con- vergence of Indirect Trajectory Optimization Methods,

    M. C. Wijayatunga, R. Armellin, and L. Pirovano, “Exploiting Scaling Constants to Facilitate the Con- vergence of Indirect Trajectory Optimization Methods,”Journal of Guidance, Control, and Dynamics, V ol. 46, No. 5, 2023, pp. 958–969, 10.2514/1.G007091

  3. [3]

    Autonomous Trajectory Planning for Rendezvous and Proximity Operations by Conic Optimization,

    P. Lu and X. Liu, “Autonomous Trajectory Planning for Rendezvous and Proximity Operations by Conic Optimization,”Journal of Guidance, Control, and Dynamics, V ol. 36, No. 2, 2013, pp. 375–389, 10.2514/1.58436

  4. [4]

    Reynolds, Michael Szmuk, Thomas Lew, Riccardo Bonalli, Marco Pavone, and Behçet Açıkme¸ se

    D. Malyuta, T. P. Reynolds, M. Szmuk, T. Lew, R. Bonalli, M. Pavone, and B. Ac ¸ıkmes ¸e, “Convex Optimization for Trajectory Generation: A Tutorial on Generating Dynamically Feasible Trajecto- ries Reliably and Efficiently,”IEEE Control Systems Magazine, V ol. 42, No. 5, 2022, pp. 40–113, 10.1109/MCS.2022.3187542

  5. [5]

    State-Dependent Trust Region for Successive Convex Optimization of Spacecraft Trajectories,

    N. Bernardini, M. C. Wijayatunga, N. Baresi, and R. Armellin, “State-Dependent Trust Region for Successive Convex Optimization of Spacecraft Trajectories,”33rd AAS/AIAA Space Flight Mechanics Meeting, Austin, TX, 2023

  6. [6]

    Wijayatunga, Roberto Armellin, Harry Holt, Laura Pirovano, and Aleksander A

    M. C. Wijayatunga, R. Armellin, H. Holt, L. Pirovano, and A. A. Lidtke, “Design and Guidance of a Multi-Active Debris Removal Mission,”Astrodynamics, V ol. 7, No. 4, 2023, pp. 383–399, 10.1007/s42064-023-0159-3

  7. [7]

    An Autonomous, End-to-End, Convex- Based Framework for Close-Range Rendezvous Trajectory Design and Guidance with Hardware Testbed Validation,

    M. C. Wijayatunga, J. Guinane, N. D. Wallace, and X. Wu, “An Autonomous, End-to-End, Convex- Based Framework for Close-Range Rendezvous Trajectory Design and Guidance with Hardware Testbed Validation,” 2026, 10.48550/arXiv.2602.12421

  8. [8]

    Model Predictive Control for Spacecraft Rendezvous and Docking: Strategies for Handling Constraints and Case Studies,

    A. Weiss, M. Baldwin, R. S. Erwin, and I. Kolmanovsky, “Model Predictive Control for Spacecraft Rendezvous and Docking: Strategies for Handling Constraints and Case Studies,”IEEE Transactions on Control Systems Technology, V ol. 23, No. 4, 2015, pp. 1638–1647, 10.1109/TCST.2014.2379639

  9. [9]

    Gaudet, R

    B. Gaudet, R. Linares, and R. Furfaro, “Deep Reinforcement Learning for Six Degree-of- Freedom Planetary Landing,”Advances in Space Research, V ol. 65, No. 7, 2020, pp. 1723–1741, 10.1016/j.asr.2019.12.030

  10. [10]

    Zavoli, L

    A. Zavoli and L. Federici, “Reinforcement Learning for Robust Trajectory Design of Interplane- tary Missions,”Journal of Guidance, Control, and Dynamics, V ol. 44, No. 8, 2021, pp. 1440–1453, 10.2514/1.G005794

  11. [11]

    Federici, B

    L. Federici, B. Benedikter, and A. Zavoli, “Deep Learning Techniques for Autonomous Spacecraft Guidance During Proximity Operations,”Journal of Spacecraft and Rockets, V ol. 58, No. 6, 2021, pp. 1774–1785, 10.2514/1.A35076

  12. [12]

    Robust Trajectory Design and Guidance for Far-Range Rendezvous Using Reinforcement Learning with Safety and Observability Considerations,

    M. C. Wijayatunga, R. Armellin, and H. Holt, “Robust Trajectory Design and Guidance for Far-Range Rendezvous Using Reinforcement Learning with Safety and Observability Considerations,”Aerospace Science and Technology, V ol. 159, 2025, p. 109996, 10.1016/j.ast.2025.109996

  13. [13]

    Meta-Reinforcement Learning for Adaptive Spacecraft Guidance During Finite-Thrust Rendezvous Missions,

    L. Federici, A. Scorsoglio, A. Zavoli, and R. Furfaro, “Meta-Reinforcement Learning for Adaptive Spacecraft Guidance During Finite-Thrust Rendezvous Missions,”Acta Astronautica, V ol. 201, 2022, pp. 129–141, 10.1016/j.actaastro.2022.08.047

  14. [14]

    Meta-Reinforcement Learning for Spacecraft Proximity Op- erations Guidance and Control in Cislunar Space,

    G. Fereoli, H. Schaub, and P. Di Lizia, “Meta-Reinforcement Learning for Spacecraft Proximity Op- erations Guidance and Control in Cislunar Space,”Journal of Spacecraft and Rockets, V ol. 62, No. 3, 2025, pp. 706–718, 10.2514/1.A36100

  15. [15]

    Safe Reinforcement Learning via Shielding,

    M. Alshiekh, R. Bloem, R. Ehlers, B. K ¨onighofer, S. Niekum, and U. Topcu, “Safe Reinforcement Learning via Shielding,”Proceedings of the AAAI Conference on Artificial Intelligence, V ol. 32, 2018

  16. [16]

    Run Time Assured Reinforcement Learning for Safe Satellite Docking,

    K. Dunlap, M. Mote, K. Delsing, and K. L. Hobbs, “Run Time Assured Reinforcement Learning for Safe Satellite Docking,”Journal of Aerospace Information Systems, V ol. 20, No. 1, 2023, pp. 25–36, 10.2514/1.I011126

  17. [17]

    Safe Spacecraft Inspection via Deep Reinforcement Learning and Discrete Control Barrier Functions,

    D. v. Wijk, K. Dunlap, M. Majji, and K. L. Hobbs, “Safe Spacecraft Inspection via Deep Reinforcement Learning and Discrete Control Barrier Functions,”Journal of Aerospace Information Systems, V ol. 21, No. 12, 2024, pp. 996–1013

  18. [18]

    Learning Safety-Guaranteed, Non- Greedy Control Barrier Functions Using Reinforcement Learning,

    M. Wijayatunga, N. Wallace, S. Sukkarieh, and R. Armellin, “Learning Safety-Guaranteed, Non- Greedy Control Barrier Functions Using Reinforcement Learning,” 2026

  19. [19]

    Control Barrier Functions: Theory and Applications,

    A. D. Ames, S. Coogan, M. Egerstedt, G. Notomista, K. Sreenath, and P. Tabuada, “Control Barrier Functions: Theory and Applications,” 2019

  20. [20]

    Neural network training as an optimal control problem : — an augmented lagrangian approach —

    D. R. Agrawal and D. Panagou, “Safe Control Synthesis via Input Constrained Control Barrier Func- tions,”2021 60th IEEE Conference on Decision and Control (CDC), IEEE, Dec. 2021, p. 6113–6118, 10.1109/cdc45484.2021.9682938. 23

  21. [21]

    Safe Control With Learned Certificates: A Survey of Neural Lyapunov, Barrier, and Contraction Methods for Robotics and Control,

    C. Dawson, S. Gao, and C. Fan, “Safe Control With Learned Certificates: A Survey of Neural Lyapunov, Barrier, and Contraction Methods for Robotics and Control,”IEEE Transactions on Robotics, V ol. 39, No. 3, 2023, pp. 1749–1767, 10.1109/TRO.2022.3232542

  22. [22]

    Meta-Reinforcement Learning for Robust and Non- greedy Control Barrier Functions in Spacecraft Proximity Operations,

    M. C. Wijayatunga, R. Linares, and R. Armellin, “Meta-Reinforcement Learning for Robust and Non- greedy Control Barrier Functions in Spacecraft Proximity Operations,” 2026

  23. [23]

    Empirical Evaluation of Gated Recurrent Neural Net- works on Sequence Modeling,

    J. Chung, C. Gulcehre, K. Cho, and Y . Bengio, “Empirical Evaluation of Gated Recurrent Neural Net- works on Sequence Modeling,” 2014

  24. [24]

    Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality,

    T. Dao and A. Gu, “Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality,” 2024

  25. [25]

    Proximal Policy Optimization Algo- rithms,

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal Policy Optimization Algo- rithms,” 2017

  26. [26]

    Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor,

    T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor,” 2018

  27. [27]

    R. S. Sutton and A. G. Barto,Reinforcement Learning: An Introduction. The MIT Press, second ed., 2018

  28. [28]

    J. Beck, R. Vuorio, E. Liu, Z. Xiong, L. Zintgraf, C. Finn, and S. Whiteson,A Tutorial on Meta- Reinforcement Learning. Foundations and Trends in Artificial Intelligence Series, Now Publishers, 2025

  29. [29]

    Gaudet, R

    B. Gaudet, R. Linares, and R. Furfaro, “Adaptive guidance and integrated navigation with reinforcement meta-learning,”Acta Astronautica, V ol. 169, 2020, pp. 180–190, https://doi.org/10.1016/j.actaastro.2020.01.007

  30. [30]

    Control Barrier Functions in Sampled-Data Systems,

    J. Breeden, K. Garg, and D. Panagou, “Control Barrier Functions in Sampled-Data Systems,”IEEE Control Systems Letters, V ol. 6, 2022, p. 367–372, 10.1109/lcsys.2021.3076127

  31. [31]

    Safe Spacecraft Inspection via Deep Reinforcement Learning and Discrete Control Barrier Functions,

    D. Van Wijk, K. Dunlap, M. Majji, and K. Hobbs, “Safe Spacecraft Inspection via Deep Reinforcement Learning and Discrete Control Barrier Functions,”Journal of Aerospace Information Systems, V ol. 21, No. 12, 2024, pp. 996–1013, 10.2514/1.I011391. 24