arxiv: 2604.08685 · v1 · submitted 2026-04-09 · 💻 cs.AI

Recognition: unknown

RAMP: Hybrid DRL for Online Learning of Numeric Action Models

Yarin Benyamin , Argaman Mordoch , Shahaf S. Shperberg , Roni Stern

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:01 UTC · model grok-4.3

classification 💻 cs.AI

keywords numeric action modelsonline learningdeep reinforcement learningautomated planninghybrid RL and planningIPC numeric domainsPDDLGym

0 comments

The pith

RAMP learns numeric action models online by interleaving deep RL policy training with model learning and planning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Obtaining accurate action models for numeric planning is hard, and prior methods require offline expert traces as input. The paper introduces RAMP, which runs a deep reinforcement learning policy while learning a numeric action model from its own environment interactions and using the learned model to generate plans whenever possible. These three activities reinforce one another: the policy supplies interaction data that refines the model, and the planner supplies higher-quality trajectories that improve the policy. The resulting system is evaluated on standard IPC numeric domains after conversion to Gym environments via a new Numeric PDDLGym interface, and it records higher solvability rates and better plan quality than the pure RL baseline PPO.

Core claim

RAMP simultaneously trains a Deep Reinforcement Learning policy, learns a numeric action model from past interactions, and uses that model to plan future actions when possible. These components form a positive feedback loop in which the RL policy gathers data to refine the action model while the planner generates plans to continue training the RL policy.

What carries the argument

The positive feedback loop that couples online numeric action model learning to both a DRL policy and a planner that can be invoked on the current learned model.

If this is right

The RL policy receives higher-quality trajectories from the planner, accelerating policy improvement.
The learned numeric model becomes usable for planning without any expert traces supplied in advance.
Solvability and plan quality both rise relative to a pure DRL baseline on the same numeric domains.
Online model learning and planning can be sustained together without requiring separate offline data collection phases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same interleaving pattern may let other model-free learners bootstrap usable models in numeric control settings where no prior model exists.
If the loop remains stable, it offers a route to reduce manual action-model engineering for numeric planning tasks.
Extending the approach to domains with continuous numeric effects or partial observability would test whether the feedback remains beneficial.

Load-bearing premise

The action model learned from RL interactions stays accurate and stable enough that the plans it produces improve the RL policy rather than introducing divergence or low-quality training data.

What would settle it

A run on one of the standard IPC numeric domains in which RAMP produces lower solvability or worse plan quality than PPO because the learned numeric model is too inaccurate to support reliable planning.

Figures

Figures reproduced from arXiv: 2604.08685 by Argaman Mordoch, Roni Stern, Shahaf S. Shperberg, Yarin Benyamin.

**Figure 1.** Figure 1: A high-level diagram of the RAMP strategy. work is the only online learning algorithm that supports learning a numeric action model. 4 THE RAMP STRATEGY RAMP integrates three components: a DRL algorithm, an AML algorithm, and a numeric planner. It maintains a set T of observed trajectories and an incumbent learned domain model, denoted 𝑀. Both are initialized to be empty. At the beginning of every episode… view at source ↗

**Figure 2.** Figure 2: The Numeric PDDLGym observation encoding. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: (Left) Rolling average success rate with 95% confidence intervals. (Right) Cumulative solution length with 95% [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

read the original abstract

Automated planning algorithms require an action model specifying the preconditions and effects of each action, but obtaining such a model is often hard. Learning action models from observations is feasible, but existing algorithms for numeric domains are offline, requiring expert traces as input. We propose the Reinforcement learning, Action Model learning, and Planning (RAMP) strategy for learning numeric planning action models online via interactions with the environment. RAMP simultaneously trains a Deep Reinforcement Learning (DRL) policy, learns a numeric action model from past interactions, and uses that model to plan future actions when possible. These components form a positive feedback loop: the RL policy gathers data to refine the action model, while the planner generates plans to continue training the RL policy. To facilitate this integration of RL and numeric planning, we developed Numeric PDDLGym, an automated framework for converting numeric planning problems to Gym environments. Experimental results on standard IPC numeric domains show that RAMP significantly outperforms PPO, a well-known DRL algorithm, in terms of solvability and plan quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RAMP sets up an online loop of DRL data collection, numeric model learning, and planner-guided training with a new gym wrapper, but reports only downstream wins over PPO without checking if the learned models are accurate or stable.

read the letter

The main thing to know is that RAMP runs DRL, learns a numeric action model from the collected interactions, and feeds that model to a planner to generate trajectories that continue training the policy. They also built Numeric PDDLGym to turn standard numeric planning problems into Gym environments automatically. This targets the practical problem of getting action models for numeric domains without expert traces upfront, and the positive feedback idea is a direct way to close the data gap that offline methods face.

Referee Report

3 major / 2 minor

Summary. The paper proposes RAMP, a hybrid framework that interleaves deep reinforcement learning (DRL) policy training, online learning of numeric action models (preconditions and effects) from interaction data, and planning with the learned model to generate trajectories that further train the policy. It introduces Numeric PDDLGym to convert IPC numeric planning problems into Gym environments and reports that this positive-feedback loop yields higher solvability and better plan quality than PPO on standard IPC numeric domains.

Significance. If the central claim holds, the work would be significant for integrating model-based planning with model-free RL in numeric domains without requiring offline expert traces. The introduction of Numeric PDDLGym is a useful engineering contribution. However, the absence of direct empirical validation of the learned models' accuracy and stability means the reported gains could arise from the hybrid data-collection schedule rather than successful model learning, limiting the strength of the contribution.

major comments (3)

[Abstract / Experimental results] Abstract and experimental results section: the central claim that RAMP outperforms PPO because of successful online numeric action-model learning is not supported by any reported metrics on model quality (prediction error on effects, precondition accuracy, or regression fit to ground-truth transitions). Only downstream solvability and plan-quality numbers are given, so it is impossible to confirm that the planner is using an accurate model rather than benefiting from altered exploration or data bias.
[Approach / Algorithm description] The positive-feedback-loop description (RL gathers data for the model; planner supplies trajectories for RL) is presented without analysis of stability or divergence risk. No discussion or experiments address whether inaccurate early models produce low-quality plans that degrade the RL policy or whether the loop can be shown to converge.
[Experiments] Experimental setup lacks standard details required for reproducibility and statistical claims: number of independent runs, variance or confidence intervals on solvability/plan quality, exact IPC numeric domains and problem instances used, and how the hybrid schedule (when to plan vs. act with RL) is parameterized.

minor comments (2)

[Approach] The paper should clarify the precise form of the numeric action model (e.g., linear vs. non-linear effects, how continuous parameters are handled) and the model-learning algorithm employed.
[Figures / Algorithm 1] Figure captions and algorithm pseudocode would benefit from explicit notation for the three interacting components (RL policy, model learner, planner) to improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight important ways to strengthen the empirical support and reproducibility of our claims. We respond to each major comment below and will incorporate the suggested changes in the revised manuscript.

read point-by-point responses

Referee: [Abstract / Experimental results] Abstract and experimental results section: the central claim that RAMP outperforms PPO because of successful online numeric action-model learning is not supported by any reported metrics on model quality (prediction error on effects, precondition accuracy, or regression fit to ground-truth transitions). Only downstream solvability and plan-quality numbers are given, so it is impossible to confirm that the planner is using an accurate model rather than benefiting from altered exploration or data bias.

Authors: We acknowledge that direct metrics on learned model accuracy (e.g., effect prediction error and precondition accuracy) would provide stronger evidence that performance gains stem from successful model learning rather than other factors in the hybrid schedule. While the consistent outperformance of RAMP over PPO across IPC numeric domains offers indirect support for the utility of the learned models, we agree this is not conclusive. In the revision we will add explicit model-quality evaluations computed on held-out transitions from the Numeric PDDLGym environments. revision: yes
Referee: [Approach / Algorithm description] The positive-feedback-loop description (RL gathers data for the model; planner supplies trajectories for RL) is presented without analysis of stability or divergence risk. No discussion or experiments address whether inaccurate early models produce low-quality plans that degrade the RL policy or whether the loop can be shown to converge.

Authors: The referee is correct that we did not provide a formal stability analysis or targeted experiments on early-model inaccuracy. In practice the loop remained stable across the tested domains, with performance improving rather than degrading. We will add a discussion of potential divergence risks, describe the safeguards already present in the hybrid schedule (e.g., confidence-based model usage), and report observed convergence behavior from the existing runs. revision: partial
Referee: [Experiments] Experimental setup lacks standard details required for reproducibility and statistical claims: number of independent runs, variance or confidence intervals on solvability/plan quality, exact IPC numeric domains and problem instances used, and how the hybrid schedule (when to plan vs. act with RL) is parameterized.

Authors: We apologize for these omissions. The revised manuscript will explicitly state the number of independent runs, include variance or confidence intervals for all reported metrics, list the precise IPC numeric domains and problem instances, and provide a full parameterization of the hybrid schedule (including decision thresholds and frequencies for invoking planning versus the RL policy). revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical hybrid method with external benchmarks

full rationale

The paper presents RAMP as an empirical integration of DRL policy training, online numeric action model learning from environment interactions, and planning in a positive feedback loop. No mathematical derivations, equations, or parameter fittings are described that reduce predictions to inputs by construction. Claims of outperformance rest on experimental comparisons to PPO on standard IPC numeric domains, which serve as independent external benchmarks rather than self-referential fits. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The feedback loop is a design choice whose effectiveness is tested empirically, not assumed tautologically. This is a standard non-circular empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

Relies on standard assumptions from RL and automated planning; no free parameters or invented entities beyond the new framework are described in the abstract.

axioms (2)

domain assumption Environment interactions yield data sufficient to learn usable numeric action models.
Core premise enabling the feedback loop between RL and planning.
domain assumption A planner can generate useful training signals for the RL policy when given an approximate action model.
Required for the positive feedback loop to function.

invented entities (1)

Numeric PDDLGym no independent evidence
purpose: Automated conversion of numeric planning problems into Gym environments for RL training.
New software framework developed to support the RAMP integration.

pith-pipeline@v0.9.0 · 5487 in / 1260 out tokens · 55472 ms · 2026-05-10T17:01:58.021732+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 9 canonical work pages · 3 internal anchors

[1]

Diego Aineto, Sergio Jiménez Celorrio, and Eva Onaindia. 2019. Learning action models with minimal observability.Artificial Intelligence275 (2019), 104–137

2019
[2]

Masataro Asai and Alex Fukunaga. 2018. Classical planning in deep latent space: Bridging the subsymbolic-symbolic boundary. InProceedings of the aaai conference on artificial intelligence, Vol. 32

2018
[3]

Yarin Benyamin, Argaman Mordoch, Shahaf Shperberg, Wiktor Piotrowski, and Roni Stern. 2024. Crafting a Pogo Stick in Minecraft with Heuristic Search. In International Symposium on Combinatorial Search. 261–262

2024
[4]

Yarin Benyamin, Argaman Mordoch, Shahaf S Shperberg, and Roni Stern. 2025. Toward PDDL Planning Copilot.arXiv preprint arXiv:2509.12987(2025)

work page arXiv 2025
[5]

Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław Dkebiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, Rafał Józefowicz, Scott Gray, Catherine Olsson, Jakub Pachocki, Michael Petrov, Henrique Ponda de Oliveira Pinto, Jonathan Raiman, Tim Salimans, Jeremy Schlatter, Jonas Schneider, Szymon Sidor, Ilya S...

work page internal anchor Pith review arXiv 2019
[6]

Rohan Chitnis, Tom Silver, Joshua B Tenenbaum, Leslie Pack Kaelbling, and Tomás Lozano-Pérez. 2021. Glib: Efficient exploration for relational model-based reinforcement learning via goal-literal babbling. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 11782–11791

2021
[7]

Stephen Cresswell, Thomas McCluskey, and Margaret West. 2013. Acquiring planning domain models using LOCM.The Knowledge Engineering Review28, 2 (2013), 195–213

2013
[8]

Maria Fox and Derek Long. 2003. PDDL2. 1: An extension to PDDL for expressing temporal planning domains.Journal of artificial intelligence research20 (2003), 61–124

2003
[9]

2004.Automated Planning: theory and practice

Malik Ghallab, Dana Nau, and Paolo Traverso. 2004.Automated Planning: theory and practice

2004
[10]

Stephen A Goss, Robert J Steininger, Dhruv Narayanan, Daniel V Olivença, Yutong Sun, Peng Qiu, Jim Amato, Eberhard O Voit, Walter E Voit, and Eric J Kildebeck. 2023. Polycraft World AI Lab (PAL): An Extensible Platform for Evaluating Artificial Intelligence Agents.arXiv preprint arXiv:2301.11891(2023)

work page arXiv 2023
[11]

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. 2018. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. InInternational conference on machine learning. Pmlr, 1861–1870

2018
[12]

Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostro- vski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver
[13]

In Proceedings of the AAAI conference on artificial intelligence, Vol

Rainbow: Combining improvements in deep reinforcement learning. In Proceedings of the AAAI conference on artificial intelligence, Vol. 32
[14]

Ignoring Delete Lists

Jörg Hoffmann. 2003. The Metric-FF Planning System: Translating “Ignoring Delete Lists” to Numeric State Variables.Journal of Artificial Intelligence Research 20 (2003), 291–341

2003
[15]

Shengyi Huang and Santiago Ontañón. 2022. A Closer Look at Invalid Action Masking in Policy Gradient Algorithms. InFLAIRS

2022
[16]

Steven James, Benjamin Rosman, and GD Konidaris. 2022. Autonomous learning of object-centric abstractions for high-level planning. InInternational Conference on Learning Representations (ICLR)

2022
[17]

Eric Jang, Shixiang Gu, and Ben Poole. 2017. Categorical Reparametrization with Gumble-Softmax. InInternational Conference on Learning Representations (ICLR)

2017
[18]

Mu Jin, Zhihao Ma, Kebing Jin, Hankz Hankui Zhuo, Chen Chen, and Chao Yu. 2022. Creativity of AI: Automatic symbolic option discovery for facilitating deep reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 7042–7050

2022
[19]

Le, and Roni Stern

Brendan Juba, Hai S. Le, and Roni Stern. 2021. Safe Learning of Lifted Action Models. InInternational Conference on Principles of Knowledge Representation and Reasoning (KR). 379–389

2021
[20]

Brendan Juba and Roni Stern. 2022. Learning probably approximately complete and safe action models for stochastic worlds. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 9795–9804

2022
[21]

Rushang Karia, Pulkit Verma, Gaurav Vipat, and Siddharth Srivastava. 2023. Epistemic Exploration for Generalizable Planning and Learning in Non-Stationary Stochastic Settings. InNeurIPS 2023 Workshop on Generalization in Planning

2023
[22]

Leonardo Lamanna, Alessandro Saetti, Luciano Serafini, Alfonso Gerevini, Paolo Traverso, et al. 2021. Online Learning of Action Models for PDDL Planning.. In IJCAI. 4112–4118

2021
[23]

Leonardo Lamanna and Luciano Serafini. 2024. Action Model Learning from Noisy Traces: a Probabilistic Approach. InICAPS. AAAI Press, 342–350

2024
[24]

Hai S Le, Brendan Juba, and Roni Stern. 2024. Learning Safe Action Models with Partial Observability. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 20159–20167

2024
[25]

Gonzalez, Michael I

Eric Liang, Richard Liaw, Robert Nishihara, Philipp Moritz, Roy Fox, Ken Gold- berg, Joseph E. Gonzalez, Michael I. Jordan, and Ion Stoica. 2018. RLlib: Ab- stractions for Distributed Reinforcement Learning. InInternational Conference on Machine Learning (ICML). https://arxiv.org/pdf/1712.09381

work page arXiv 2018
[26]

Alan Lindsay, Jonathon Read, Joao F Ferreira, Thomas Hayton, Julie Porteous, and PJ Gregory. 2017. Framer: Planning Models from Natural Language Action Descriptions. InInternational Conference on Automated Planning and Scheduling (ICAPS)

2017
[27]

Derek Long and Maria Fox. 2003. The 3rd international planning competition: Results and analysis.Journal of Artificial Intelligence Research20 (2003), 1–59

2003
[28]

Argaman Mordoch, Brendan Juba, and Roni Stern. 2023. Learning Safe Numeric Action Models. InAAAI. AAAI Press, 12079–12086

2023
[29]

Argaman Mordoch, Enrico Scala, Roni Stern, and Brendan Juba. 2024. Safe Learning of PDDL Domains with Conditional Effects. InICAPS. AAAI Press, 387–395

2024
[30]

Shperberg, Roni Stern, and Brendan Juba

Argaman Mordoch, Shahaf S. Shperberg, Roni Stern, and Brendan Juba. 2024. En- hancing Numeric-SAM for Learning with Few Observations. InICAPS Workshop on Knowledge Engineering for Planning and Scheduling (KEPS)

2024
[31]

Jun Hao Alvin Ng and Ronald PA Petrick. 2019. Incremental Learning of Planning Actions in Model-Based Reinforcement Learning. InIJCAI. 3195–3201

2019
[32]

Antonin Raffin, Ashley Hill, Adam Gleave, Anssi Kanervisto, Maximilian Ernestus, and Noah Dormann. 2021. Stable-Baselines3: Reliable Reinforcement Learning Implementations.Journal of Machine Learning Research22, 268 (2021), 1–8. http://jmlr.org/papers/v22/20-1364.html

2021
[33]

Enrico Scala, Patrik Haslum, and Sylvie Thiébaux. 2016. Heuristics for Numeric Planning via Subgoaling. InIJCAI. 3228–3234

2016
[34]

Enrico Scala, Patrik Haslum, Sylvie Thiébaux, and Miquel Ramirez. 2020. Subgoal- ing techniques for satisficing and optimal numeric planning.Journal of Artificial Intelligence Research68 (2020), 691–752

2020
[35]

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov
[36]

Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[37]

José Á Segura-Muros, Raúl Pérez, and Juan Fernández-Olivares. 2021. Discovering relational and numerical expressions from plan traces for learning action models. Applied Intelligence(2021), 1–17

2021
[38]

Tom Silver and Rohan Chitnis. 2020. PDDLGym: Gym environments from PDDL problems.arXiv preprint arXiv:2002.06432(2020)

work page arXiv 2020
[39]

Sarath Sreedharan and Michael Katz. 2023. Optimistic exploration in reinforce- ment learning using symbolic model estimates.Advances in Neural Information Processing Systems36 (2023), 34519–34535

2023
[40]

Roni Stern and Brendan Juba. 2017. Efficient, Safe, and Probably Approximately Complete Learning of Action Models. Inthe International Joint Conference on Artificial Intelligence (IJCAI). 4405–4411

2017
[41]

Roni Stern, Leonardo Lamanna, Argaman Mordoch, Yarin Benyamin, Pascal Lauer, Brendan Juba, Gregor Behnke, Christian Muise, Pascal Bercher, Mauro Vallati, Kai Xi, Omar Wattad, and Omer Eliyahu. 2025. Evaluating Planning Model Learning Algorithms. InWorkshop on Knowledge Engineering for Planning and Scheduling (KEPS) at ICAPS

2025
[42]

2018.Reinforcement learning: An intro- duction

Richard S Sutton and Andrew G Barto. 2018.Reinforcement learning: An intro- duction. MIT press

2018
[43]

Marcus Tantakoun, Christian Muise, and Xiaodan Zhu. 2025. LLMs as plan- ning formalizers: A survey for leveraging large language models to construct automated planning models. InFindings of the Association for Computational Linguistics: ACL 2025. 25167–25188

2025
[44]

Mark Towers, Ariel Kwiatkowski, Jordan Terry, John U Balis, Gianluca De Cola, Tristan Deleu, Manuel Goulão, Andreas Kallinteris, Markus Krimmel, Arjun KG, et al. 2024. Gymnasium: A Standard Interface for Reinforcement Learning Environments.arXiv preprint arXiv:2407.17032(2024)

work page internal anchor Pith review arXiv 2024
[45]

Pulkit Verma, Rushang Karia, and Siddharth Srivastava. 2023. Autonomous capa- bility assessment of sequential decision-making systems in stochastic settings. Advances in Neural Information Processing Systems36 (2023), 54727–54739

2023
[46]

Oriol Vinyals, Timo Ewalds, Sergey Bartunov, P Georgiev, Alexander S Vezhn- evets, Michelle Yeo, Alireza Makhzani, Heinrich Küttler, John Agapiou, Julian Schrittwieser, et al. 2017. StarCraft II: A new challenge for reinforcement learning. arXiv preprint arXiv:1708.04782(2017)

work page arXiv 2017
[47]

Christopher JCH Watkins and Peter Dayan. 1992. Q-learning.Machine learning 8 (1992), 279–292

1992
[48]

1989.Learning from delayed rewards

Christopher John Cornish Hellaby Watkins. 1989.Learning from delayed rewards. Ph.D. Dissertation. Oxford: King’s College

1989
[49]

Kai Xi, Stephen Gould, and Sylvie Thiébaux. 2024. Neuro-Symbolic Learning of Lifted Action Models from Visual Traces. InProceedings of the International Conference on Automated Planning and Scheduling, Vol. 34. 653–662

2024
[50]

Deheng Ye, Zhao Liu, Mingfei Sun, Bei Shi, Peilin Zhao, Hao Wu, Hongsheng Yu, Shaojie Yang, Xipeng Wu, Qingwei Guo, Qiaobo Chen, Yinyuting Yin, Hao Zhang, Tengfei Shi, Liang Wang, Qiang Fu, Wei Yang, and Lanxiao Huang. 2020. Mastering complex control in MOBA games with deep reinforcement learning. InAAAI, Vol. 34. 6672–6679. https://doi.org/10.1609/aaai.v...

work page doi:10.1609/aaai.v34i04.6144 2020
[51]

Håkan LS Younes and Michael L Littman. 2004. PPDDL1. 0: An extension to PDDL for expressing planning domains with probabilistic effects.Techn. Rep. CMU-CS-04-1622, 99 (2004), 12

2004