arxiv: 2604.02438 · v2 · submitted 2026-04-02 · 💻 cs.LG

Recognition: no theorem link

Mitigating Data Scarcity in Spaceflight Applications for Offline Reinforcement Learning Using Physics-Informed Deep Generative Models

Alex E. Ballentine , Nachiket U. Bapat , Raghvendra V. Cowlagi

Authors on Pith no claims yet

Pith reviewed 2026-05-13 21:32 UTC · model grok-4.3

classification 💻 cs.LG

keywords physics-informed generative modelsvariational autoencoderoffline reinforcement learningdata scarcitysim-to-real gapplanetary landerspaceflight applications

0 comments

The pith

A physics-informed split VAE learns discrepancies with physics models to generate synthetic data that improves offline RL policies for planetary landing under severe data scarcity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to establish that a mutual-information-based split variational autoencoder can produce physically consistent synthetic trajectories by capturing the differences between limited real observations and predictions from physics-based models. This augmentation then enables offline reinforcement learning controllers to achieve higher success rates on spaceflight tasks such as planetary landing, where collecting real data is prohibitively expensive. A sympathetic reader would care because the sim-to-real gap remains a core barrier to deploying learned controllers on actual spacecraft, and traditional data-generation methods often fail when real samples are too few to support reliable system identification or unconstrained generative models.

Core claim

The MI-VAE is a physics-informed generative model whose latent space is structured to separately encode physics-model predictions and real trajectory residuals through a mutual-information objective. By training on the difference between observed data and physics predictions, the model generates new samples that respect physical constraints while matching the statistical properties of the scarce real dataset. When these samples augment the training set for offline RL on a planetary lander problem, the resulting policies exhibit improved success rates, greater sample diversity, and higher statistical fidelity than policies trained with unaugmented data or data from standard VAEs.

What carries the argument

The Mutual Information-based Split Variational Autoencoder (MI-VAE), a generative model that uses a split latent representation and mutual-information regularization to learn residuals between real trajectories and physics-based predictions, thereby enabling synthesis of constraint-respecting data.

If this is right

Augmenting limited real datasets with MI-VAE samples produces higher statistical fidelity and sample diversity than standard VAE augmentation.
Offline RL policies trained on the augmented datasets achieve higher success rates on the planetary lander task.
The approach lowers the volume of real-world data needed to train robust controllers while still enforcing physical consistency.
The method offers a scalable route to narrowing the sim-to-real gap for autonomous systems in data-constrained environments such as spaceflight.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same residual-learning idea could be tested on other physical systems that possess approximate models but scarce real data, such as underwater vehicles or ground robots.
If the physics model contains systematic biases larger than the real-data residuals, the generated samples may reinforce rather than correct those biases.
Combining MI-VAE augmentation with lightweight online fine-tuning after deployment could further reduce remaining performance gaps.

Load-bearing premise

That physics-based models supply a sufficiently accurate baseline so the MI-VAE can learn meaningful corrections from only a small number of real trajectories.

What would settle it

An experiment on the planetary lander task in which offline RL policies trained on MI-VAE-augmented data show no improvement in success rate, fidelity, or diversity over policies trained on standard VAE-augmented data or real data alone would falsify the central claim.

read the original abstract

The deployment of reinforcement learning (RL)-based controllers on physical systems is often limited by poor generalization to real-world scenarios, known as the simulation-to-reality (sim-to-real) gap. This gap is particularly challenging in spaceflight, where real-world training data are scarce due to high cost and limited planetary exploration data. Traditional approaches, such as system identification and synthetic data generation, depend on sufficient data and often fail due to modeling assumptions or lack of physics-based constraints. We propose addressing this data scarcity by introducing physics-based learning bias in a generative model. Specifically, we develop the Mutual Information-based Split Variational Autoencoder (MI-VAE), a physics-informed VAE that learns differences between observed system trajectories and those predicted by physics-based models. The latent space of the MI-VAE enables generation of synthetic datasets that respect physical constraints. We evaluate MI-VAE on a planetary lander problem, focusing on limited real-world data and offline RL training. Results show that augmenting datasets with MI-VAE samples significantly improves downstream RL performance, outperforming standard VAEs in statistical fidelity, sample diversity, and policy success rate. This work demonstrates a scalable strategy for enhancing autonomous controller robustness in complex, data-constrained environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The MI-VAE splits latent space to capture physics deviations for trajectory augmentation in offline RL, with reported gains on a lander task, but the experiments skip the ablations needed to tie gains specifically to the physics term.

read the letter

The paper's main move is a split VAE that uses mutual information to separate the part of trajectories explained by a physics model from the residuals in limited real data. This produces synthetic samples that respect physical constraints better than a plain VAE, then feeds them into offline RL training for a planetary lander controller. The claim is that the augmented dataset lifts policy success rate, sample diversity, and statistical fidelity over the baseline VAE case.

Referee Report

2 major / 2 minor

Summary. The paper proposes the Mutual Information-based Split Variational Autoencoder (MI-VAE), a physics-informed generative model that learns deviations between limited real trajectories and physics-based predictions to produce synthetic data respecting physical constraints; this augmented data is then used for offline RL training on a planetary lander task, with results claiming superior statistical fidelity, sample diversity, and policy success rates compared to standard VAEs.

Significance. If the central claims hold after proper controls, the work offers a concrete mechanism for injecting physics bias into generative models to address data scarcity in sim-to-real RL transfer, which is particularly relevant for spaceflight applications where real trajectories are expensive to obtain; the approach could reduce dependence on purely data-driven augmentation while preserving physical plausibility.

major comments (2)

[§5] §5 (Experiments / Results): the headline claim that MI-VAE augmentation improves downstream RL success rate over standard VAE augmentation on the planetary lander task is not supported by an ablation that removes the physics-based reconstruction loss or the mutual-information split while holding the split-VAE architecture and training protocol fixed; without this isolation, it remains possible that any sufficiently expressive generative model trained on the same limited data would yield comparable gains in fidelity, diversity, and policy performance.
[Methods] Methods section: the description of the MI-VAE latent space encoding physics deviations does not include a quantitative check (e.g., via an equation or table) demonstrating that the reported improvements do not reduce to a fitted parameter by construction when the external physics model is accurate; this is load-bearing for the claim that the method meaningfully corrects for model-reality mismatch rather than simply fitting the limited real data.

minor comments (2)

[§4] The abstract and §4 (Evaluation) assert performance gains but supply incomplete definitions of the statistical fidelity and sample diversity metrics; explicit formulas or references to standard measures (e.g., MMD, FID) would improve reproducibility.
[Methods] Notation in the MI-VAE loss (likely Eq. (3) or (4)) mixes reconstruction, KL, and mutual-information terms without a clear table of hyperparameter values used in the planetary lander experiments; adding this would aid replication.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important opportunities to strengthen the experimental validation and clarify the method's mechanisms. We respond to each major comment below and will incorporate the suggested revisions in the updated version.

read point-by-point responses

Referee: [§5] §5 (Experiments / Results): the headline claim that MI-VAE augmentation improves downstream RL success rate over standard VAE augmentation on the planetary lander task is not supported by an ablation that removes the physics-based reconstruction loss or the mutual-information split while holding the split-VAE architecture and training protocol fixed; without this isolation, it remains possible that any sufficiently expressive generative model trained on the same limited data would yield comparable gains in fidelity, diversity, and policy performance.

Authors: We agree that an explicit ablation isolating the physics-based reconstruction loss and mutual-information term—while holding the split-VAE architecture and training protocol fixed—would provide stronger evidence that the observed gains are attributable to the physics-informed components rather than model capacity alone. The current manuscript compares MI-VAE to a standard VAE baseline, which lacks both the split architecture and the physics loss. In the revision we will add the requested ablation: a split-VAE trained without the physics reconstruction loss and without the MI objective, using identical architecture, latent dimensions, and training protocol. This will quantify the incremental benefit of the physics bias and address the concern that any expressive generative model could produce similar results. revision: yes
Referee: [Methods] Methods section: the description of the MI-VAE latent space encoding physics deviations does not include a quantitative check (e.g., via an equation or table) demonstrating that the reported improvements do not reduce to a fitted parameter by construction when the external physics model is accurate; this is load-bearing for the claim that the method meaningfully corrects for model-reality mismatch rather than simply fitting the limited real data.

Authors: We acknowledge the need for a quantitative demonstration that MI-VAE captures genuine model-reality deviations rather than trivially fitting the scarce real data. In the revised Methods section we will add a quantitative check consisting of (i) an equation for the deviation term (real trajectory minus physics-model prediction) and (ii) a table reporting the L2 norm of this deviation across trajectories, the mutual-information value between the split latents, and a comparison of generative fidelity when the physics model is provided versus withheld. When the external physics model is accurate, the learned deviation term approaches zero; we will include a controlled experiment verifying this behavior to confirm the method addresses mismatch rather than acting as a pure data fitter. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical evaluation on downstream RL task uses independent physics baseline

full rationale

The paper introduces MI-VAE as a generative model that learns residuals between limited real trajectories and predictions from an external physics-based model, then augments data for offline RL on a planetary lander task. Performance gains are reported via statistical comparisons of fidelity, diversity, and policy success rate. No equation or claim reduces a 'prediction' to a fitted input by construction, no self-citation chain bears the central result, and the physics model is treated as an independent prior rather than derived from the same data. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Central claim rests on availability of accurate physics models and the ability of the MI-VAE latent space to produce valid augmentations; abstract introduces no explicit free parameters or new entities beyond the model itself.

axioms (1)

domain assumption Physics-based models provide usable baseline trajectory predictions against which real observations can be compared
MI-VAE is defined to learn differences from these predictions.

invented entities (1)

MI-VAE latent space encoding physics deviations no independent evidence
purpose: To enable generation of synthetic trajectories that respect physical constraints
New model component introduced to solve data scarcity; no independent falsifiable evidence provided in abstract.

pith-pipeline@v0.9.0 · 5536 in / 1221 out tokens · 37160 ms · 2026-05-13T21:32:35.532543+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 5 internal anchors

[1]

Dynamic Reward-Based Dueling Deep Dyna-Q: Robust Policy Learning in Noisy Environments,

Y . Zhao, Z. Wang, K. Yin, R. Zhang, Z. Huang, and P. Wang, “Dynamic Reward-Based Dueling Deep Dyna-Q: Robust Policy Learning in Noisy Environments,”Proceedings of the AAAI Conference on Artificial Intelligence, V ol. 34, Apr. 2020, pp. 9676–9684, 10.1609/aaai.v34i05.6516

work page doi:10.1609/aaai.v34i05.6516 2020
[2]

Autonomous Drone Racing: A Survey,

D. Hanover, A. Loquercio, L. Bauersfeld, A. Romero, R. Penicka, Y . Song, G. Cioffi, E. Kaufmann, and D. Scaramuzza, “Autonomous Drone Racing: A Survey,”IEEE Transactions on Robotics, V ol. 40, 2024, pp. 3044–3067, 10.1109/TRO.2024.3400838

work page doi:10.1109/tro.2024.3400838 2024
[3]

A Systematic Study on Reinforcement Learning Based Applications,

K. Sivamayil, E. Rajasekar, B. Aljafari, S. Nikolovski, S. Vairavasundaram, and I. Vairavasundaram, “A Systematic Study on Reinforcement Learning Based Applications,”Energies, V ol. 16, Feb. 2023, p. 1512, 10.3390/en16031512

work page doi:10.3390/en16031512 2023
[4]

R. Chai, K. Chen, L. Cui, S. Chai, G. Inalhan, and A. Tsourdos,Review of Advanced Guidance and Control Methods, pp. 167–206. Singapore: Springer Nature Singapore, 2023

work page 2023
[5]

Artificial Intelligence for Trusted Autonomous Satellite Operations,

K. Thangavel, R. Sabatini, A. Gardi, K. Ranasinghe, S. Hilton, P. Servidia, and D. Spiller, “Artificial Intelligence for Trusted Autonomous Satellite Operations,”Progress in Aerospace Sciences, V ol. 144, 2024, p. 100960, https://doi.org/10.1016/j.paerosci.2023.100960

work page doi:10.1016/j.paerosci.2023.100960 2024
[6]

Reinforced Model Predictive Guidance and Control for Spacecraft Proximity Operations,

L. Capra, A. Brandonisio, and M. R. Lavagna, “Reinforced Model Predictive Guidance and Control for Spacecraft Proximity Operations,”Aerospace, V ol. 12, No. 9, 2025, 10.3390/aerospace12090837

work page doi:10.3390/aerospace12090837 2025
[7]

Reinforcement learning in spacecraft control applica- tions: Advances, prospects, and challenges,

M. Tipaldi, R. Iervolino, and P. R. Massenio, “Reinforcement learning in spacecraft control applica- tions: Advances, prospects, and challenges,”Annual Reviews in Control, V ol. 54, 2022, pp. 1–23, https://doi.org/10.1016/j.arcontrol.2022.07.004. 18

work page doi:10.1016/j.arcontrol.2022.07.004 2022
[8]

An on-orbit servicing framework for satellite collision avoidance: To- wards autonomous mission planning with reinforcement learning,

S. Patnala and A. Abdin, “An on-orbit servicing framework for satellite collision avoidance: To- wards autonomous mission planning with reinforcement learning,”Advances in Space Research, 2025, https://doi.org/10.1016/j.asr.2025.12.022

work page doi:10.1016/j.asr.2025.12.022 2025
[9]

Adaptive pinpoint and fuel efficient mars landing using reinforce- ment learning,

B. Gaudet and R. Furfaro, “Adaptive pinpoint and fuel efficient mars landing using reinforce- ment learning,”IEEE/CAA Journal of Automatica Sinica, V ol. 1, No. 4, 2014, pp. 397–411, 10.1109/JAS.2014.7004667

work page doi:10.1109/jas.2014.7004667 2014
[10]

Data-Efficient Deep Reinforcement Learning for Attitude Control of Fixed-Wing UA Vs: Field Experiments,

E. Bøhn, E. M. Coates, D. Reinhardt, and T. A. Johansen, “Data-Efficient Deep Reinforcement Learning for Attitude Control of Fixed-Wing UA Vs: Field Experiments,”IEEE Transactions on Neural Networks and Learning Systems, V ol. 35, No. 3, 2024, pp. 3168–3180, 10.1109/TNNLS.2023.3263430

work page doi:10.1109/tnnls.2023.3263430 2024
[11]

Sim-to-Real Reinforcement Learning for Deformable Object Manipulation,

J. Matas, S. James, and A. J. Davison, “Sim-to-Real Reinforcement Learning for Deformable Object Manipulation,” Oct. 2018. arXiv:1806.07851 [cs], 10.48550/arXiv.1806.07851

work page doi:10.48550/arxiv.1806.07851 2018
[12]

Quantifying the Reality Gap in Robotic Manipulation Tasks,

J. Collins, D. Howard, and J. Leitner, “Quantifying the Reality Gap in Robotic Manipulation Tasks,”2019 International Conference on Robotics and Automation (ICRA), 2019, pp. 6706–6712, 10.1109/ICRA.2019.8793591

work page doi:10.1109/icra.2019.8793591 2019
[13]

Sim-to-Real Transfer in Deep Reinforcement Learning for Robotics: a Survey

W. Zhao, J. P. Queralta, and T. Westerlund, “Sim-to-Real Transfer in Deep Reinforcement Learning for Robotics: a Survey,”2020 IEEE Symposium Series on Computational Intelligence (SSCI), Canberra, ACT, Australia, IEEE, Dec. 2020, pp. 737–744, 10.1109/SSCI47803.2020.9308468

work page doi:10.1109/ssci47803.2020.9308468 2020
[14]

Bridging the Reality Gap: Analyzing Sim-to-Real Transfer Tech- niques for Reinforcement Learning in Humanoid Bipedal Locomotion,

D. Kim, H. Lee, J. Cha, and J. Park, “Bridging the Reality Gap: Analyzing Sim-to-Real Transfer Tech- niques for Reinforcement Learning in Humanoid Bipedal Locomotion,”IEEE Robotics & Automation Magazine, V ol. 32, Mar. 2025, pp. 49–58, 10.1109/MRA.2024.3505784

work page doi:10.1109/mra.2024.3505784 2025
[15]

Towards Sample-Efficiency and General- ization of Transfer and Inverse Reinforcement Learning: A Comprehensive Literature Review,

H. Hassani, E. Hallaji, R. Razavi-Far, M. Saif, and L. Lin, “Towards Sample-Efficiency and General- ization of Transfer and Inverse Reinforcement Learning: A Comprehensive Literature Review,”IEEE Transactions on Artificial Intelligence, 2025, pp. 1–16, 10.1109/TAI.2025.3610590

work page doi:10.1109/tai.2025.3610590 2025
[16]

A survey of sim- to-real methods in rl: Progress, prospects and challenges with foundation models,

L. Da, J. Turnau, T. P. Kutralingam, A. Velasquez, P. Shakarian, and H. Wei, “A survey of sim- to-real methods in rl: Progress, prospects and challenges with foundation models,”arXiv preprint arXiv:2502.13187, 2025, https://doi.org/10.48550/arXiv.2502.13187

work page doi:10.48550/arxiv.2502.13187 2025
[17]

Pessimistic q-learning for offline reinforcement learning: Towards optimal sample complexity,

L. Shi, G. Li, Y . Wei, Y . Chen, and Y . Chi, “Pessimistic q-learning for offline reinforcement learning: Towards optimal sample complexity,”International conference on machine learning, PMLR, 2022, pp. 19967–20025, https://doi.org/10.48550/arXiv.2202.13890

work page doi:10.48550/arxiv.2202.13890 2022
[18]

S. A. Billings,Nonlinear system identification: NARMAX methods in the time, frequency, and spatio- temporal domains. John Wiley & Sons, 2013

work page 2013
[19]

Jategaonkar,Flight Vehicle System Identification: A Time Domain Methodology, pp

R. Jategaonkar,Flight Vehicle System Identification: A Time Domain Methodology, pp. 97 – 155. Progress in Aeronautics and Astronautics, Reston, V A, USA: AIAA, 2006, 10.2514/4.102790

work page doi:10.2514/4.102790 2006
[20]

Generative Adversarial Networks

I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Ben- gio, “Generative adversarial networks,”Communications of the ACM, V ol. 63, No. 11, 2020, pp. 139– 144, https://doi.org/10.48550/arXiv.1406.2661

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1406.2661 2020
[21]

Generative adversarial networks: An overview,

A. Creswell, T. White, V . Dumoulin, K. Arulkumaran, B. Sengupta, and A. A. Bharath, “Generative adversarial networks: An overview,”IEEE Signal Processing Magazine, V ol. 35, No. 1, 2018, pp. 53– 65, https://doi.org/10.48550/arXiv.1710.07035

work page doi:10.48550/arxiv.1710.07035 2018
[22]

Kingma and Max Welling

D. P. Kingma and M. Welling, “An introduction to variational autoencoders,”Foundations and Trends® in Machine Learning, V ol. 12, No. 4, 2019, pp. 307–392, https://doi.org/10.48550/arXiv.1906.02691

work page doi:10.48550/arxiv.1906.02691 2019
[23]

Face generation and editing with stylegan: A survey,

A. Melnik, M. Miasayedzenkau, D. Makaravets, D. Pirshtuk, E. Akbulut, D. Holzmann, T. Renusch, G. Reichert, and H. Ritter, “Face generation and editing with stylegan: A survey,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

work page 2024
[24]

NV AE: A deep hierarchical variational autoencoder,

A. Vahdat and J. Kautz, “NV AE: A deep hierarchical variational autoencoder,”Ad- vances in Neural Information Processing Systems, V ol. 33, 2020, pp. 19667–19679, https://doi.org/10.48550/arXiv.2007.03898

work page doi:10.48550/arxiv.2007.03898 2020
[25]

Sig-Wasserstein GANs for time series generation,

H. Ni, L. Szpruch, M. Sabate-Vidales, B. Xiao, M. Wiese, and S. Liao, “Sig-Wasserstein GANs for time series generation,” 2021, pp. 1–8, https://doi.org/10.48550/arXiv.2111.01207

work page doi:10.48550/arxiv.2111.01207 2021
[26]

Training generative adversar- ial networks with limited data,

T. Karras, M. Aittala, J. Hellsten, S. Laine, J. Lehtinen, and T. Aila, “Training generative adversar- ial networks with limited data,”Advances in neural information processing systems, V ol. 33, 2020, pp. 12104–12114, https://doi.org/10.48550/arXiv.2006.06676

work page doi:10.48550/arxiv.2006.06676 2020
[27]

Scientific machine learning through physics-informed neural networks: where we are and what’s next.Journal of Scientific Computing, 92(3):88, 2022

S. Cuomo, V . S. Di Cola, F. Giampaolo, G. Rozza, M. Raissi, and F. Piccialli, “Scientific Machine Learning Through Physics–Informed Neural Networks: Where we are and What’s Next,”Journal of Scientific Computing, V ol. 92, Sept. 2022, p. 88, 10.1007/s10915-022-01939-z

work page doi:10.1007/s10915-022-01939-z 2022
[28]

An Example of Synthetic Data Generation for Control Systems using Generative Adversarial Networks: Zermelo Minimum-Time Naviga- tion,

N. U. Bapat, R. Paffenroth, and R. V . Cowlagi, “An Example of Synthetic Data Generation for Control Systems using Generative Adversarial Networks: Zermelo Minimum-Time Naviga- tion,”Proceedings of the 2024 American Control Conference (ACC), Toronto, Canada, 2024, 10.23919/ACC60939.2024.10644306. 19

work page doi:10.23919/acc60939.2024.10644306 2024
[29]

In: Advances in Neural Information Processing Systems, vol

A. Krishnapriyan, A. Gholami, S. Zhe, R. Kirby, and M. W. Mahoney, “Characterizing possible fail- ure modes in physics-informed neural networks,”Advances in neural information processing systems, V ol. 34, 2021, pp. 26548–26560, https://doi.org/10.48550/arXiv.2109.01050

work page doi:10.48550/arxiv.2109.01050 2021
[30]

Case Studies of Generative Ma- chine Learning Models for Dynamical Systems,

N. U. Bapat, R. C. Paffenroth, and R. V . Cowlagi, “Case Studies of Generative Ma- chine Learning Models for Dynamical Systems,”arXiv preprint arXiv:2508.04459, 2025, https://doi.org/10.48550/arXiv.2508.04459

work page doi:10.48550/arxiv.2508.04459 2025
[31]

Synthetic Data Generation for Minimum- Exposure Navigation in a Time-Varying Environment using Generative AI Models,

N. U. Bapat, R. C. Paffenroth, and R. V . Cowlagi, “Synthetic Data Generation for Minimum- Exposure Navigation in a Time-Varying Environment using Generative AI Models,”arXiv preprint arXiv:2503.06619, 2025, https://doi.org/10.48550/arXiv.2503.06619

work page doi:10.48550/arxiv.2503.06619 2025
[32]

Reinforcement Learning,

A. G. Barto, “Reinforcement Learning,”Neural Systems for Control, pp. 7–30, Elsevier, 1997, 10.1016/B978-012526430-3/50003-9

work page doi:10.1016/b978-012526430-3/50003-9 1997
[33]

R. S. Sutton and A. Barto,Reinforcement learning: an introduction. Adaptive computation and machine learning, Cambridge, Massachusetts London, England: The MIT Press, second edition ed., 2018

work page 2018
[34]

Inverse Reinforcement Learning for Minimum-Exposure Paths in Spatiotemporally Varying Scalar Fields,

A. Ballentine and R. V . Cowlagi, “Inverse Reinforcement Learning for Minimum-Exposure Paths in Spatiotemporally Varying Scalar Fields,”IFAC-PapersOnLine, V ol. 59, No. 30, 2025, pp. 791–796, 10.1016/j.ifacol.2025.12.335

work page doi:10.1016/j.ifacol.2025.12.335 2025
[35]

Behavior Proximal Policy Optimization,

Z. Zhuang, K. Lei, J. Liu, D. Wang, and Y . Guo, “Behavior Proximal Policy Optimization,” Feb. 2023. arXiv:2302.11312 [cs], 10.48550/arXiv.2302.11312

work page doi:10.48550/arxiv.2302.11312 2023
[36]

Policy Gradient Methods for Reinforcement Learning with Function Approximation,

R. S. Sutton, D. McAllester, S. Singh, and Y . Mansour, “Policy Gradient Methods for Reinforcement Learning with Function Approximation,”Advances in Neural Information Processing Systems(S. Solla, T. Leen, and K. M¨uller, eds.), V ol. 12, MIT Press, 1999

work page 1999
[37]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal Policy Optimization Algo- rithms,” Aug. 2017. arXiv:1707.06347 [cs], 10.48550/arXiv.1707.06347

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1707.06347 2017
[38]

On-Policy vs. Off-Policy Deep Reinforcement Learning for Resource Allocation in Open Radio Access Network,

N. Hammami and K. K. Nguyen, “On-Policy vs. Off-Policy Deep Reinforcement Learning for Resource Allocation in Open Radio Access Network,”2022 IEEE Wireless Communica- tions and Networking Conference (WCNC), Austin, TX, USA, IEEE, Apr. 2022, pp. 1461–1466, 10.1109/WCNC51071.2022.9771605

work page doi:10.1109/wcnc51071.2022.9771605 2022
[39]

High-Dimensional Continuous Control Using Generalized Advantage Estimation

J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “High-Dimensional Contin- uous Control Using Generalized Advantage Estimation,” Oct. 2018. arXiv:1506.02438 [cs], 10.48550/arXiv.1506.02438

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1506.02438 2018
[40]

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

S. Levine, A. Kumar, G. Tucker, and J. Fu, “Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems,” 2020. Version Number: 3, 10.48550/ARXIV .2005.01643

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2020
[41]

Improved SARSA and DQN algorithms for reinforcement learning,

G. Yao, N. Zhang, Z. Duan, and C. Tian, “Improved SARSA and DQN algorithms for reinforcement learning,”Theoretical Computer Science, V ol. 1027, 2025, p. 115025, https://doi.org/10.1016/j.tcs.2024.115025

work page doi:10.1016/j.tcs.2024.115025 2025
[42]

PyTorch: An Imperative Style, High-Performance Deep Learning Library

A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga,et al., “Pytorch: An imperative style, high-performance deep learning library,”Advances in neural information processing systems, V ol. 32, 2019, https://doi.org/10.48550/arXiv.1912.01703

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1912.01703 2019
[43]

Stable-Baselines3: Re- liable Reinforcement Learning Implementations,

A. Raffin, A. Hill, A. Gleave, A. Kanervisto, M. Ernestus, and N. Dormann, “Stable-Baselines3: Re- liable Reinforcement Learning Implementations,”Journal of Machine Learning Research, V ol. 22, No. 268, 2021, pp. 1–8

work page 2021
[44]

Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping,

A. Y . Ng, D. Harada, and S. J. Russell, “Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping,”Proceedings of the Sixteenth International Conference on Machine Learning, ICML ’99, San Francisco, CA, USA, Morgan Kaufmann Publishers Inc., 1999, p. 278–287

work page 1999
[45]

An Empirical Study on Generalizations of the ReLU Ac- tivation Function,

C. Banerjee, T. Mukherjee, and E. Pasiliao, “An Empirical Study on Generalizations of the ReLU Ac- tivation Function,”Proceedings of the 2019 ACM Southeast Conference, ACMSE ’19, New York, NY , USA, Association for Computing Machinery, 2019, p. 164–167, 10.1145/3299815.3314450. 20

work page doi:10.1145/3299815.3314450 2019