pith. machine review for the scientific record. sign in

arxiv: 2601.02850 · v2 · submitted 2026-01-06 · 💻 cs.AI

Recognition: unknown

Sample-Efficient Neurosymbolic Deep Reinforcement Learning

Authors on Pith no claims yet
classification 💻 cs.AI
keywords environmentslearningtrainingchallengingcomplexdeepduringneuro-symbolic
0
0 comments X
read the original abstract

Reinforcement Learning (RL) is a well-established framework for sequential decision-making in complex environments. However, state-of-the-art Deep RL (DRL) algorithms typically require large training datasets and often struggle to generalize beyond small-scale training scenarios, even within standard benchmarks. We propose a neuro-symbolic DRL approach that integrates background symbolic knowledge to improve sample efficiency and generalization to more challenging, unseen tasks. Partial policies defined for simple domain instances, where high performance is easily attained, are transferred as useful priors to accelerate learning in more complex settings and avoid tuning DRL parameters from scratch. To do so, partial policies are represented as logical rules, and online reasoning is performed to guide the training process through two mechanisms: (i) biasing the action distribution during exploration, and (ii) rescaling Q-values during exploitation. This neuro-symbolic integration enhances interpretability and trustworthiness while accelerating convergence, particularly in sparse-reward environments and tasks with long planning horizons. We empirically validate our methodology on challenging variants of gridworld environments, both in the fully observable and partially observable setting. We show improved performance over a state-of-the-art reward machine baseline.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Sample-efficient Neuro-symbolic Proximal Policy Optimization

    cs.AI 2026-04 unverdicted novelty 6.0

    H-PPO-Product and H-PPO-SymLoss achieve faster learning and higher final returns than standard PPO and Reward Machine baselines on OfficeWorld, WaterWorld, and DoorKey by transferring imperfect logical policy specific...