In offline RL, the structure of pessimism (set by dataset coverage) matters more for generalization than its amount; a symmetric overly pessimistic value function can outperform a non-symmetric mildly pessimistic one.
Contextual Markov Decision Processes
13 Pith papers cite this work. Polarity classification is still indexing.
abstract
We consider a planning problem where the dynamics and rewards of the environment depend on a hidden static parameter referred to as the context. The objective is to learn a strategy that maximizes the accumulated reward across all contexts. The new model, called Contextual Markov Decision Process (CMDP), can model a customer's behavior when interacting with a website (the learner). The customer's behavior depends on gender, age, location, device, etc. Based on that behavior, the website objective is to determine customer characteristics, and to optimize the interaction between them. Our work focuses on one basic scenario--finite horizon with a small known number of possible contexts. We suggest a family of algorithms with provable guarantees that learn the underlying models and the latent contexts, and optimize the CMDPs. Bounds are obtained for specific naive implementations, and extensions of the framework are discussed, laying the ground for future research.
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
Introduces signed divergence to bound generalization gaps and defines task-space complexity as the minimum source contexts needed for ε-coverage under local smoothness, with set-cover reduction and empirical validation on LQR and DRL systems.
FCGraft synthesizes code policies for embodied agents by grafting KV caches from a library of validated functions, claiming 18.31% higher success rate and 2.3x faster synthesis than prompt-level caching.
A VAE-based latent task representation enables automatic curriculum generation in CRL for non-Euclidean navigation tasks, outperforming interpolation and GAN-based methods in experiments.
Ada-Diffuser is a causal diffusion model that jointly learns observed interaction structure and underlying latent dynamics from minimal observations for adaptive planning and policy learning.
Extends policy graphs for decision-dependent uncertainty in MDPs and develops SDDP variants for multi-stage stochastic programs with continuous state and action spaces.
A behavior-constrained RL framework with receding-horizon credit assignment learns high-performance control policies that stay aligned with expert behavior in race car simulation.
MATE uses permutation-invariant sum-aggregated memory of transition embeddings to solve CMDPs with online adaptation and computational advantages over Transformers and RNNs.
Reinforcement learning agents can generalize better by treating context as a first-class primitive that distinguishes slow-changing external factors from fast-changing internal ones and incorporates abstract high-level descriptors.
DAC models fully decentralized cooperative MARL as a context modeling problem, using latent variables for joint policies to fix non-stationarity in value updates and relative overgeneralization in value estimation.
Contextual multi-task RL for underwater navigation uses just 1.5% of network weights for task differentiation, mostly from context-variable connections to the first hidden layer.
Hybrid RL-PID controllers track angle of attack better and show greater robustness than PID alone within a defined operational envelope for re-entry attitude control.