pith. machine review for the scientific record. sign in

arxiv: 2604.02438 · v2 · submitted 2026-04-02 · 💻 cs.LG

Recognition: no theorem link

Mitigating Data Scarcity in Spaceflight Applications for Offline Reinforcement Learning Using Physics-Informed Deep Generative Models

Authors on Pith no claims yet

Pith reviewed 2026-05-13 21:32 UTC · model grok-4.3

classification 💻 cs.LG
keywords physics-informed generative modelsvariational autoencoderoffline reinforcement learningdata scarcitysim-to-real gapplanetary landerspaceflight applications
0
0 comments X

The pith

A physics-informed split VAE learns discrepancies with physics models to generate synthetic data that improves offline RL policies for planetary landing under severe data scarcity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to establish that a mutual-information-based split variational autoencoder can produce physically consistent synthetic trajectories by capturing the differences between limited real observations and predictions from physics-based models. This augmentation then enables offline reinforcement learning controllers to achieve higher success rates on spaceflight tasks such as planetary landing, where collecting real data is prohibitively expensive. A sympathetic reader would care because the sim-to-real gap remains a core barrier to deploying learned controllers on actual spacecraft, and traditional data-generation methods often fail when real samples are too few to support reliable system identification or unconstrained generative models.

Core claim

The MI-VAE is a physics-informed generative model whose latent space is structured to separately encode physics-model predictions and real trajectory residuals through a mutual-information objective. By training on the difference between observed data and physics predictions, the model generates new samples that respect physical constraints while matching the statistical properties of the scarce real dataset. When these samples augment the training set for offline RL on a planetary lander problem, the resulting policies exhibit improved success rates, greater sample diversity, and higher statistical fidelity than policies trained with unaugmented data or data from standard VAEs.

What carries the argument

The Mutual Information-based Split Variational Autoencoder (MI-VAE), a generative model that uses a split latent representation and mutual-information regularization to learn residuals between real trajectories and physics-based predictions, thereby enabling synthesis of constraint-respecting data.

If this is right

  • Augmenting limited real datasets with MI-VAE samples produces higher statistical fidelity and sample diversity than standard VAE augmentation.
  • Offline RL policies trained on the augmented datasets achieve higher success rates on the planetary lander task.
  • The approach lowers the volume of real-world data needed to train robust controllers while still enforcing physical consistency.
  • The method offers a scalable route to narrowing the sim-to-real gap for autonomous systems in data-constrained environments such as spaceflight.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same residual-learning idea could be tested on other physical systems that possess approximate models but scarce real data, such as underwater vehicles or ground robots.
  • If the physics model contains systematic biases larger than the real-data residuals, the generated samples may reinforce rather than correct those biases.
  • Combining MI-VAE augmentation with lightweight online fine-tuning after deployment could further reduce remaining performance gaps.

Load-bearing premise

That physics-based models supply a sufficiently accurate baseline so the MI-VAE can learn meaningful corrections from only a small number of real trajectories.

What would settle it

An experiment on the planetary lander task in which offline RL policies trained on MI-VAE-augmented data show no improvement in success rate, fidelity, or diversity over policies trained on standard VAE-augmented data or real data alone would falsify the central claim.

read the original abstract

The deployment of reinforcement learning (RL)-based controllers on physical systems is often limited by poor generalization to real-world scenarios, known as the simulation-to-reality (sim-to-real) gap. This gap is particularly challenging in spaceflight, where real-world training data are scarce due to high cost and limited planetary exploration data. Traditional approaches, such as system identification and synthetic data generation, depend on sufficient data and often fail due to modeling assumptions or lack of physics-based constraints. We propose addressing this data scarcity by introducing physics-based learning bias in a generative model. Specifically, we develop the Mutual Information-based Split Variational Autoencoder (MI-VAE), a physics-informed VAE that learns differences between observed system trajectories and those predicted by physics-based models. The latent space of the MI-VAE enables generation of synthetic datasets that respect physical constraints. We evaluate MI-VAE on a planetary lander problem, focusing on limited real-world data and offline RL training. Results show that augmenting datasets with MI-VAE samples significantly improves downstream RL performance, outperforming standard VAEs in statistical fidelity, sample diversity, and policy success rate. This work demonstrates a scalable strategy for enhancing autonomous controller robustness in complex, data-constrained environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes the Mutual Information-based Split Variational Autoencoder (MI-VAE), a physics-informed generative model that learns deviations between limited real trajectories and physics-based predictions to produce synthetic data respecting physical constraints; this augmented data is then used for offline RL training on a planetary lander task, with results claiming superior statistical fidelity, sample diversity, and policy success rates compared to standard VAEs.

Significance. If the central claims hold after proper controls, the work offers a concrete mechanism for injecting physics bias into generative models to address data scarcity in sim-to-real RL transfer, which is particularly relevant for spaceflight applications where real trajectories are expensive to obtain; the approach could reduce dependence on purely data-driven augmentation while preserving physical plausibility.

major comments (2)
  1. [§5] §5 (Experiments / Results): the headline claim that MI-VAE augmentation improves downstream RL success rate over standard VAE augmentation on the planetary lander task is not supported by an ablation that removes the physics-based reconstruction loss or the mutual-information split while holding the split-VAE architecture and training protocol fixed; without this isolation, it remains possible that any sufficiently expressive generative model trained on the same limited data would yield comparable gains in fidelity, diversity, and policy performance.
  2. [Methods] Methods section: the description of the MI-VAE latent space encoding physics deviations does not include a quantitative check (e.g., via an equation or table) demonstrating that the reported improvements do not reduce to a fitted parameter by construction when the external physics model is accurate; this is load-bearing for the claim that the method meaningfully corrects for model-reality mismatch rather than simply fitting the limited real data.
minor comments (2)
  1. [§4] The abstract and §4 (Evaluation) assert performance gains but supply incomplete definitions of the statistical fidelity and sample diversity metrics; explicit formulas or references to standard measures (e.g., MMD, FID) would improve reproducibility.
  2. [Methods] Notation in the MI-VAE loss (likely Eq. (3) or (4)) mixes reconstruction, KL, and mutual-information terms without a clear table of hyperparameter values used in the planetary lander experiments; adding this would aid replication.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important opportunities to strengthen the experimental validation and clarify the method's mechanisms. We respond to each major comment below and will incorporate the suggested revisions in the updated version.

read point-by-point responses
  1. Referee: [§5] §5 (Experiments / Results): the headline claim that MI-VAE augmentation improves downstream RL success rate over standard VAE augmentation on the planetary lander task is not supported by an ablation that removes the physics-based reconstruction loss or the mutual-information split while holding the split-VAE architecture and training protocol fixed; without this isolation, it remains possible that any sufficiently expressive generative model trained on the same limited data would yield comparable gains in fidelity, diversity, and policy performance.

    Authors: We agree that an explicit ablation isolating the physics-based reconstruction loss and mutual-information term—while holding the split-VAE architecture and training protocol fixed—would provide stronger evidence that the observed gains are attributable to the physics-informed components rather than model capacity alone. The current manuscript compares MI-VAE to a standard VAE baseline, which lacks both the split architecture and the physics loss. In the revision we will add the requested ablation: a split-VAE trained without the physics reconstruction loss and without the MI objective, using identical architecture, latent dimensions, and training protocol. This will quantify the incremental benefit of the physics bias and address the concern that any expressive generative model could produce similar results. revision: yes

  2. Referee: [Methods] Methods section: the description of the MI-VAE latent space encoding physics deviations does not include a quantitative check (e.g., via an equation or table) demonstrating that the reported improvements do not reduce to a fitted parameter by construction when the external physics model is accurate; this is load-bearing for the claim that the method meaningfully corrects for model-reality mismatch rather than simply fitting the limited real data.

    Authors: We acknowledge the need for a quantitative demonstration that MI-VAE captures genuine model-reality deviations rather than trivially fitting the scarce real data. In the revised Methods section we will add a quantitative check consisting of (i) an equation for the deviation term (real trajectory minus physics-model prediction) and (ii) a table reporting the L2 norm of this deviation across trajectories, the mutual-information value between the split latents, and a comparison of generative fidelity when the physics model is provided versus withheld. When the external physics model is accurate, the learned deviation term approaches zero; we will include a controlled experiment verifying this behavior to confirm the method addresses mismatch rather than acting as a pure data fitter. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical evaluation on downstream RL task uses independent physics baseline

full rationale

The paper introduces MI-VAE as a generative model that learns residuals between limited real trajectories and predictions from an external physics-based model, then augments data for offline RL on a planetary lander task. Performance gains are reported via statistical comparisons of fidelity, diversity, and policy success rate. No equation or claim reduces a 'prediction' to a fitted input by construction, no self-citation chain bears the central result, and the physics model is treated as an independent prior rather than derived from the same data. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Central claim rests on availability of accurate physics models and the ability of the MI-VAE latent space to produce valid augmentations; abstract introduces no explicit free parameters or new entities beyond the model itself.

axioms (1)
  • domain assumption Physics-based models provide usable baseline trajectory predictions against which real observations can be compared
    MI-VAE is defined to learn differences from these predictions.
invented entities (1)
  • MI-VAE latent space encoding physics deviations no independent evidence
    purpose: To enable generation of synthetic trajectories that respect physical constraints
    New model component introduced to solve data scarcity; no independent falsifiable evidence provided in abstract.

pith-pipeline@v0.9.0 · 5536 in / 1221 out tokens · 37160 ms · 2026-05-13T21:32:35.532543+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 5 internal anchors

  1. [1]

    Dynamic Reward-Based Dueling Deep Dyna-Q: Robust Policy Learning in Noisy Environments,

    Y . Zhao, Z. Wang, K. Yin, R. Zhang, Z. Huang, and P. Wang, “Dynamic Reward-Based Dueling Deep Dyna-Q: Robust Policy Learning in Noisy Environments,”Proceedings of the AAAI Conference on Artificial Intelligence, V ol. 34, Apr. 2020, pp. 9676–9684, 10.1609/aaai.v34i05.6516

  2. [2]

    Autonomous Drone Racing: A Survey,

    D. Hanover, A. Loquercio, L. Bauersfeld, A. Romero, R. Penicka, Y . Song, G. Cioffi, E. Kaufmann, and D. Scaramuzza, “Autonomous Drone Racing: A Survey,”IEEE Transactions on Robotics, V ol. 40, 2024, pp. 3044–3067, 10.1109/TRO.2024.3400838

  3. [3]

    A Systematic Study on Reinforcement Learning Based Applications,

    K. Sivamayil, E. Rajasekar, B. Aljafari, S. Nikolovski, S. Vairavasundaram, and I. Vairavasundaram, “A Systematic Study on Reinforcement Learning Based Applications,”Energies, V ol. 16, Feb. 2023, p. 1512, 10.3390/en16031512

  4. [4]

    R. Chai, K. Chen, L. Cui, S. Chai, G. Inalhan, and A. Tsourdos,Review of Advanced Guidance and Control Methods, pp. 167–206. Singapore: Springer Nature Singapore, 2023

  5. [5]

    Artificial Intelligence for Trusted Autonomous Satellite Operations,

    K. Thangavel, R. Sabatini, A. Gardi, K. Ranasinghe, S. Hilton, P. Servidia, and D. Spiller, “Artificial Intelligence for Trusted Autonomous Satellite Operations,”Progress in Aerospace Sciences, V ol. 144, 2024, p. 100960, https://doi.org/10.1016/j.paerosci.2023.100960

  6. [6]

    Reinforced Model Predictive Guidance and Control for Spacecraft Proximity Operations,

    L. Capra, A. Brandonisio, and M. R. Lavagna, “Reinforced Model Predictive Guidance and Control for Spacecraft Proximity Operations,”Aerospace, V ol. 12, No. 9, 2025, 10.3390/aerospace12090837

  7. [7]

    Reinforcement learning in spacecraft control applica- tions: Advances, prospects, and challenges,

    M. Tipaldi, R. Iervolino, and P. R. Massenio, “Reinforcement learning in spacecraft control applica- tions: Advances, prospects, and challenges,”Annual Reviews in Control, V ol. 54, 2022, pp. 1–23, https://doi.org/10.1016/j.arcontrol.2022.07.004. 18

  8. [8]

    An on-orbit servicing framework for satellite collision avoidance: To- wards autonomous mission planning with reinforcement learning,

    S. Patnala and A. Abdin, “An on-orbit servicing framework for satellite collision avoidance: To- wards autonomous mission planning with reinforcement learning,”Advances in Space Research, 2025, https://doi.org/10.1016/j.asr.2025.12.022

  9. [9]

    Adaptive pinpoint and fuel efficient mars landing using reinforce- ment learning,

    B. Gaudet and R. Furfaro, “Adaptive pinpoint and fuel efficient mars landing using reinforce- ment learning,”IEEE/CAA Journal of Automatica Sinica, V ol. 1, No. 4, 2014, pp. 397–411, 10.1109/JAS.2014.7004667

  10. [10]

    Data-Efficient Deep Reinforcement Learning for Attitude Control of Fixed-Wing UA Vs: Field Experiments,

    E. Bøhn, E. M. Coates, D. Reinhardt, and T. A. Johansen, “Data-Efficient Deep Reinforcement Learning for Attitude Control of Fixed-Wing UA Vs: Field Experiments,”IEEE Transactions on Neural Networks and Learning Systems, V ol. 35, No. 3, 2024, pp. 3168–3180, 10.1109/TNNLS.2023.3263430

  11. [11]

    Sim-to-Real Reinforcement Learning for Deformable Object Manipulation,

    J. Matas, S. James, and A. J. Davison, “Sim-to-Real Reinforcement Learning for Deformable Object Manipulation,” Oct. 2018. arXiv:1806.07851 [cs], 10.48550/arXiv.1806.07851

  12. [12]

    Quantifying the Reality Gap in Robotic Manipulation Tasks,

    J. Collins, D. Howard, and J. Leitner, “Quantifying the Reality Gap in Robotic Manipulation Tasks,”2019 International Conference on Robotics and Automation (ICRA), 2019, pp. 6706–6712, 10.1109/ICRA.2019.8793591

  13. [13]

    Sim-to-Real Transfer in Deep Reinforcement Learning for Robotics: a Survey

    W. Zhao, J. P. Queralta, and T. Westerlund, “Sim-to-Real Transfer in Deep Reinforcement Learning for Robotics: a Survey,”2020 IEEE Symposium Series on Computational Intelligence (SSCI), Canberra, ACT, Australia, IEEE, Dec. 2020, pp. 737–744, 10.1109/SSCI47803.2020.9308468

  14. [14]

    Bridging the Reality Gap: Analyzing Sim-to-Real Transfer Tech- niques for Reinforcement Learning in Humanoid Bipedal Locomotion,

    D. Kim, H. Lee, J. Cha, and J. Park, “Bridging the Reality Gap: Analyzing Sim-to-Real Transfer Tech- niques for Reinforcement Learning in Humanoid Bipedal Locomotion,”IEEE Robotics & Automation Magazine, V ol. 32, Mar. 2025, pp. 49–58, 10.1109/MRA.2024.3505784

  15. [15]

    Towards Sample-Efficiency and General- ization of Transfer and Inverse Reinforcement Learning: A Comprehensive Literature Review,

    H. Hassani, E. Hallaji, R. Razavi-Far, M. Saif, and L. Lin, “Towards Sample-Efficiency and General- ization of Transfer and Inverse Reinforcement Learning: A Comprehensive Literature Review,”IEEE Transactions on Artificial Intelligence, 2025, pp. 1–16, 10.1109/TAI.2025.3610590

  16. [16]

    A survey of sim- to-real methods in rl: Progress, prospects and challenges with foundation models,

    L. Da, J. Turnau, T. P. Kutralingam, A. Velasquez, P. Shakarian, and H. Wei, “A survey of sim- to-real methods in rl: Progress, prospects and challenges with foundation models,”arXiv preprint arXiv:2502.13187, 2025, https://doi.org/10.48550/arXiv.2502.13187

  17. [17]

    Pessimistic q-learning for offline reinforcement learning: Towards optimal sample complexity,

    L. Shi, G. Li, Y . Wei, Y . Chen, and Y . Chi, “Pessimistic q-learning for offline reinforcement learning: Towards optimal sample complexity,”International conference on machine learning, PMLR, 2022, pp. 19967–20025, https://doi.org/10.48550/arXiv.2202.13890

  18. [18]

    S. A. Billings,Nonlinear system identification: NARMAX methods in the time, frequency, and spatio- temporal domains. John Wiley & Sons, 2013

  19. [19]

    Jategaonkar,Flight Vehicle System Identification: A Time Domain Methodology, pp

    R. Jategaonkar,Flight Vehicle System Identification: A Time Domain Methodology, pp. 97 – 155. Progress in Aeronautics and Astronautics, Reston, V A, USA: AIAA, 2006, 10.2514/4.102790

  20. [20]

    Generative Adversarial Networks

    I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Ben- gio, “Generative adversarial networks,”Communications of the ACM, V ol. 63, No. 11, 2020, pp. 139– 144, https://doi.org/10.48550/arXiv.1406.2661

  21. [21]

    Generative adversarial networks: An overview,

    A. Creswell, T. White, V . Dumoulin, K. Arulkumaran, B. Sengupta, and A. A. Bharath, “Generative adversarial networks: An overview,”IEEE Signal Processing Magazine, V ol. 35, No. 1, 2018, pp. 53– 65, https://doi.org/10.48550/arXiv.1710.07035

  22. [22]

    Kingma and Max Welling

    D. P. Kingma and M. Welling, “An introduction to variational autoencoders,”Foundations and Trends® in Machine Learning, V ol. 12, No. 4, 2019, pp. 307–392, https://doi.org/10.48550/arXiv.1906.02691

  23. [23]

    Face generation and editing with stylegan: A survey,

    A. Melnik, M. Miasayedzenkau, D. Makaravets, D. Pirshtuk, E. Akbulut, D. Holzmann, T. Renusch, G. Reichert, and H. Ritter, “Face generation and editing with stylegan: A survey,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

  24. [24]

    NV AE: A deep hierarchical variational autoencoder,

    A. Vahdat and J. Kautz, “NV AE: A deep hierarchical variational autoencoder,”Ad- vances in Neural Information Processing Systems, V ol. 33, 2020, pp. 19667–19679, https://doi.org/10.48550/arXiv.2007.03898

  25. [25]

    Sig-Wasserstein GANs for time series generation,

    H. Ni, L. Szpruch, M. Sabate-Vidales, B. Xiao, M. Wiese, and S. Liao, “Sig-Wasserstein GANs for time series generation,” 2021, pp. 1–8, https://doi.org/10.48550/arXiv.2111.01207

  26. [26]

    Training generative adversar- ial networks with limited data,

    T. Karras, M. Aittala, J. Hellsten, S. Laine, J. Lehtinen, and T. Aila, “Training generative adversar- ial networks with limited data,”Advances in neural information processing systems, V ol. 33, 2020, pp. 12104–12114, https://doi.org/10.48550/arXiv.2006.06676

  27. [27]

    Scientific machine learning through physics-informed neural networks: where we are and what’s next.Journal of Scientific Computing, 92(3):88, 2022

    S. Cuomo, V . S. Di Cola, F. Giampaolo, G. Rozza, M. Raissi, and F. Piccialli, “Scientific Machine Learning Through Physics–Informed Neural Networks: Where we are and What’s Next,”Journal of Scientific Computing, V ol. 92, Sept. 2022, p. 88, 10.1007/s10915-022-01939-z

  28. [28]

    An Example of Synthetic Data Generation for Control Systems using Generative Adversarial Networks: Zermelo Minimum-Time Naviga- tion,

    N. U. Bapat, R. Paffenroth, and R. V . Cowlagi, “An Example of Synthetic Data Generation for Control Systems using Generative Adversarial Networks: Zermelo Minimum-Time Naviga- tion,”Proceedings of the 2024 American Control Conference (ACC), Toronto, Canada, 2024, 10.23919/ACC60939.2024.10644306. 19

  29. [29]

    In: Advances in Neural Information Processing Systems, vol

    A. Krishnapriyan, A. Gholami, S. Zhe, R. Kirby, and M. W. Mahoney, “Characterizing possible fail- ure modes in physics-informed neural networks,”Advances in neural information processing systems, V ol. 34, 2021, pp. 26548–26560, https://doi.org/10.48550/arXiv.2109.01050

  30. [30]

    Case Studies of Generative Ma- chine Learning Models for Dynamical Systems,

    N. U. Bapat, R. C. Paffenroth, and R. V . Cowlagi, “Case Studies of Generative Ma- chine Learning Models for Dynamical Systems,”arXiv preprint arXiv:2508.04459, 2025, https://doi.org/10.48550/arXiv.2508.04459

  31. [31]

    Synthetic Data Generation for Minimum- Exposure Navigation in a Time-Varying Environment using Generative AI Models,

    N. U. Bapat, R. C. Paffenroth, and R. V . Cowlagi, “Synthetic Data Generation for Minimum- Exposure Navigation in a Time-Varying Environment using Generative AI Models,”arXiv preprint arXiv:2503.06619, 2025, https://doi.org/10.48550/arXiv.2503.06619

  32. [32]

    Reinforcement Learning,

    A. G. Barto, “Reinforcement Learning,”Neural Systems for Control, pp. 7–30, Elsevier, 1997, 10.1016/B978-012526430-3/50003-9

  33. [33]

    R. S. Sutton and A. Barto,Reinforcement learning: an introduction. Adaptive computation and machine learning, Cambridge, Massachusetts London, England: The MIT Press, second edition ed., 2018

  34. [34]

    Inverse Reinforcement Learning for Minimum-Exposure Paths in Spatiotemporally Varying Scalar Fields,

    A. Ballentine and R. V . Cowlagi, “Inverse Reinforcement Learning for Minimum-Exposure Paths in Spatiotemporally Varying Scalar Fields,”IFAC-PapersOnLine, V ol. 59, No. 30, 2025, pp. 791–796, 10.1016/j.ifacol.2025.12.335

  35. [35]

    Behavior Proximal Policy Optimization,

    Z. Zhuang, K. Lei, J. Liu, D. Wang, and Y . Guo, “Behavior Proximal Policy Optimization,” Feb. 2023. arXiv:2302.11312 [cs], 10.48550/arXiv.2302.11312

  36. [36]

    Policy Gradient Methods for Reinforcement Learning with Function Approximation,

    R. S. Sutton, D. McAllester, S. Singh, and Y . Mansour, “Policy Gradient Methods for Reinforcement Learning with Function Approximation,”Advances in Neural Information Processing Systems(S. Solla, T. Leen, and K. M¨uller, eds.), V ol. 12, MIT Press, 1999

  37. [37]

    Proximal Policy Optimization Algorithms

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal Policy Optimization Algo- rithms,” Aug. 2017. arXiv:1707.06347 [cs], 10.48550/arXiv.1707.06347

  38. [38]

    On-Policy vs. Off-Policy Deep Reinforcement Learning for Resource Allocation in Open Radio Access Network,

    N. Hammami and K. K. Nguyen, “On-Policy vs. Off-Policy Deep Reinforcement Learning for Resource Allocation in Open Radio Access Network,”2022 IEEE Wireless Communica- tions and Networking Conference (WCNC), Austin, TX, USA, IEEE, Apr. 2022, pp. 1461–1466, 10.1109/WCNC51071.2022.9771605

  39. [39]

    High-Dimensional Continuous Control Using Generalized Advantage Estimation

    J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “High-Dimensional Contin- uous Control Using Generalized Advantage Estimation,” Oct. 2018. arXiv:1506.02438 [cs], 10.48550/arXiv.1506.02438

  40. [40]

    Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

    S. Levine, A. Kumar, G. Tucker, and J. Fu, “Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems,” 2020. Version Number: 3, 10.48550/ARXIV .2005.01643

  41. [41]

    Improved SARSA and DQN algorithms for reinforcement learning,

    G. Yao, N. Zhang, Z. Duan, and C. Tian, “Improved SARSA and DQN algorithms for reinforcement learning,”Theoretical Computer Science, V ol. 1027, 2025, p. 115025, https://doi.org/10.1016/j.tcs.2024.115025

  42. [42]

    PyTorch: An Imperative Style, High-Performance Deep Learning Library

    A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga,et al., “Pytorch: An imperative style, high-performance deep learning library,”Advances in neural information processing systems, V ol. 32, 2019, https://doi.org/10.48550/arXiv.1912.01703

  43. [43]

    Stable-Baselines3: Re- liable Reinforcement Learning Implementations,

    A. Raffin, A. Hill, A. Gleave, A. Kanervisto, M. Ernestus, and N. Dormann, “Stable-Baselines3: Re- liable Reinforcement Learning Implementations,”Journal of Machine Learning Research, V ol. 22, No. 268, 2021, pp. 1–8

  44. [44]

    Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping,

    A. Y . Ng, D. Harada, and S. J. Russell, “Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping,”Proceedings of the Sixteenth International Conference on Machine Learning, ICML ’99, San Francisco, CA, USA, Morgan Kaufmann Publishers Inc., 1999, p. 278–287

  45. [45]

    An Empirical Study on Generalizations of the ReLU Ac- tivation Function,

    C. Banerjee, T. Mukherjee, and E. Pasiliao, “An Empirical Study on Generalizations of the ReLU Ac- tivation Function,”Proceedings of the 2019 ACM Southeast Conference, ACMSE ’19, New York, NY , USA, Association for Computing Machinery, 2019, p. 164–167, 10.1145/3299815.3314450. 20