Langevin soft actor-critic: Efficient exploration through uncertainty-driven critic learning.arXiv preprint arXiv:2501.17827

Haque Ishfaq, Guangyuan Wang, Sami Nur Islam, Doina Precup · arXiv 2501.17827

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

representative citing papers

Provably avoiding over-optimization in Direct Preference Optimization without knowing the data distribution

cs.LG · 2026-02-05 · unverdicted · novelty 5.0 · 2 refs

PEPO is a single-step pessimistic ensemble algorithm for direct preference optimization that provably avoids over-optimization by depending only on single-policy concentrability without knowing the data distribution or learning an explicit reward model.

An Information-Theoretic Analysis of OOD Generalization in Meta-Reinforcement Learning

cs.LG · 2025-10-27 · unverdicted · novelty 5.0

The work establishes OOD generalization bounds for meta-supervised learning and meta-RL that exploit MDP structure, then analyzes a gradient-based meta-RL algorithm.

citing papers explorer

Showing 2 of 2 citing papers.

Provably avoiding over-optimization in Direct Preference Optimization without knowing the data distribution cs.LG · 2026-02-05 · unverdicted · none · ref 16 · 2 links
PEPO is a single-step pessimistic ensemble algorithm for direct preference optimization that provably avoids over-optimization by depending only on single-policy concentrability without knowing the data distribution or learning an explicit reward model.
An Information-Theoretic Analysis of OOD Generalization in Meta-Reinforcement Learning cs.LG · 2025-10-27 · unverdicted · none · ref 5
The work establishes OOD generalization bounds for meta-supervised learning and meta-RL that exploit MDP structure, then analyzes a gradient-based meta-RL algorithm.

Langevin soft actor-critic: Efficient exploration through uncertainty-driven critic learning.arXiv preprint arXiv:2501.17827

fields

years

verdicts

representative citing papers

citing papers explorer