Hypernetworks map a forcing parameter directly to policy weights in an RL framework, enabling unified stabilization of the Kuramoto-Sivashinsky equation across regimes with KAN architectures showing strongest extrapolation.
Fasttd3: Simple, fast, and capable reinforcement learning for humanoid control
8 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
FlashSAC improves training speed and final performance of off-policy RL on high-dimensional robot tasks by reducing update frequency, increasing model scale, and bounding norms to limit critic error accumulation.
ExpertGen generates high-success expert policies in simulation from imperfect priors by freezing a diffusion behavior model and optimizing its initial noise via RL, then distills them for real-robot deployment.
FastDSAC enables state-of-the-art maximum entropy RL for high-dimensional humanoid control via entropy redistribution per dimension and improved continuous value estimation.
ViTacFormer learns a cross-modal visuo-tactile latent space with autoregressive tactile prediction and an easy-to-hard curriculum, then uses the representation for imitation learning that yields ~50% higher success and the first reported 11-stage, 2.5-minute autonomous dexterous tasks.
Non-uniform replay helps most when replay volume is low; high-entropy sampling remains important, and a truncated geometric distribution delivers better sample efficiency with negligible overhead.
Dyna-SAuR learns scalable safety filters and policies from an uncertainty-aware model, cutting failures by two orders of magnitude on CartPole and MuJoCo Walker tasks.
REPPO is an on-policy RL method that combines pathwise policy gradients with relative entropy constraints to achieve stable training and high sample efficiency without replay buffers.
citing papers explorer
-
Hyperfastrl: Hypernetwork-based reinforcement learning for unified control of parametric chaotic PDEs
Hypernetworks map a forcing parameter directly to policy weights in an RL framework, enabling unified stabilization of the Kuramoto-Sivashinsky equation across regimes with KAN architectures showing strongest extrapolation.
-
FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control
FlashSAC improves training speed and final performance of off-policy RL on high-dimensional robot tasks by reducing update frequency, increasing model scale, and bounding norms to limit critic error accumulation.
-
ExpertGen: Scalable Sim-to-Real Expert Policy Learning from Imperfect Behavior Priors
ExpertGen generates high-success expert policies in simulation from imperfect priors by freezing a diffusion behavior model and optimizing its initial noise via RL, then distills them for real-robot deployment.
-
FastDSAC: Unlocking the Potential of Maximum Entropy RL in High-Dimensional Humanoid Control
FastDSAC enables state-of-the-art maximum entropy RL for high-dimensional humanoid control via entropy redistribution per dimension and improved continuous value estimation.
-
ViTacFormer: Learning Cross-Modal Representation for Visuo-Tactile Dexterous Manipulation
ViTacFormer learns a cross-modal visuo-tactile latent space with autoregressive tactile prediction and an easy-to-hard curriculum, then uses the representation for imitation learning that yields ~50% higher success and the first reported 11-stage, 2.5-minute autonomous dexterous tasks.
-
When Does Non-Uniform Replay Matter in Reinforcement Learning?
Non-uniform replay helps most when replay volume is low; high-entropy sampling remains important, and a truncated geometric distribution delivers better sample efficiency with negligible overhead.
-
Dyna-Style Safety Augmented Reinforcement Learning: Staying Safe in the Face of Uncertainty
Dyna-SAuR learns scalable safety filters and policies from an uncertainty-aware model, cutting failures by two orders of magnitude on CartPole and MuJoCo Walker tasks.
-
Relative Entropy Pathwise Policy Optimization
REPPO is an on-policy RL method that combines pathwise policy gradients with relative entropy constraints to achieve stable training and high sample efficiency without replay buffers.