Synthetic data augmentation helps channel-mixing time series models but degrades channel-independent ones, with reliable gains only from seasonal-trend generators and gradual schedules in low-resource settings.
Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine
8 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
method 2polarities
use method 2representative citing papers
BRRL derives an analytic optimal policy for regularized constrained RL that guarantees monotonic improvement and yields the BPO algorithm that matches or exceeds PPO.
Response times modeled as drift-diffusion processes enable consistent estimation of population-average preferences from heterogeneous anonymous binary choices.
ANO derives a robust policy optimizer from geometric principles that replaces clipping with a smooth redescending gradient, showing better performance and stability than PPO, SPO, and GRPO in MuJoCo, Atari, and RLHF experiments.
A DRL-based event-triggered controller for networked artificial pancreas systems uses blood glucose change rules to formulate control as a semi-Markov decision process, improving communication efficiency.
Agent Q integrates MCTS-guided search, self-critique, and off-policy DPO to train LLM agents that outperform behavior cloning and reinforced fine-tuning baselines in WebShop and achieve up to 95.4% success in real-world booking scenarios.
TOPPO reformulates PPO with critic balancing to address gradient ill-conditioning in multi-task RL and reports stronger mean and tail performance than SAC baselines on Meta-World+ using fewer parameters and steps.
RAFT aligns generative models by ranking samples with a reward model and fine-tuning only on the top-ranked outputs, reporting gains on reward scores and automated metrics for LLMs and diffusion models.
citing papers explorer
-
Response Time Enhances Alignment with Heterogeneous Preferences
Response times modeled as drift-diffusion processes enable consistent estimation of population-average preferences from heterogeneous anonymous binary choices.
-
TOPPO: Rethinking PPO for Multi-Task Reinforcement Learning with Critic Balancing
TOPPO reformulates PPO with critic balancing to address gradient ill-conditioning in multi-task RL and reports stronger mean and tail performance than SAC baselines on Meta-World+ using fewer parameters and steps.