International Conference on Machine Learning , pages=

Principled reinforcement learning with human feedback from pairwise or k-wise comparisons , author= · 2023

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

browse 5 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

Variance-aware Reward Modeling with Anchor Guidance

stat.ML · 2026-05-12 · unverdicted · novelty 7.0

Anchor-guided variance-aware reward modeling uses two response-level anchors to resolve non-identifiability in Gaussian models of pluralistic preferences, yielding provable identification, a joint training objective, and improved RLHF performance.

Fast Rates for Offline Contextual Bandits with Forward-KL Regularization under Single-Policy Concentrability

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

The paper establishes the first tilde O(epsilon^{-1}) upper bounds and matching lower bounds for forward-KL-regularized offline contextual bandits under single-policy concentrability in both tabular and general function approximation settings.

Policy Gradient Primal-Dual Method for Safe Reinforcement Learning from Human Feedback

cs.LG · 2026-04-21 · unverdicted · novelty 7.0

Primal-dual policy gradient algorithms achieve global non-asymptotic convergence for safe RLHF cast as infinite-horizon discounted CMDPs without fitting reward models.

Spurious Correlation Learning in Preference Optimization: Mechanisms, Consequences, and Mitigation via Tie Training

cs.LG · 2026-05-11 · unverdicted · novelty 6.0

Standard preference learning induces spurious feature reliance via mean bias and correlation leakage, creating irreducible distribution shift vulnerabilities that tie training mitigates without degrading causal learning.

From Passive Reuse to Active Reasoning: Grounding Large Language Models for Neuro-Symbolic Experience Replay

cs.AI · 2026-05-10 · unverdicted · novelty 6.0

NSER uses zero-shot LLMs to induce behavioral rules from RL trajectories, grounds them in differentiable first-order logic, and applies the symbolic structures to dynamically reweight experience replay for better sample efficiency.

citing papers explorer

Showing 5 of 5 citing papers.

Variance-aware Reward Modeling with Anchor Guidance stat.ML · 2026-05-12 · unverdicted · none · ref 14
Anchor-guided variance-aware reward modeling uses two response-level anchors to resolve non-identifiability in Gaussian models of pluralistic preferences, yielding provable identification, a joint training objective, and improved RLHF performance.
Fast Rates for Offline Contextual Bandits with Forward-KL Regularization under Single-Policy Concentrability cs.LG · 2026-05-09 · unverdicted · none · ref 26
The paper establishes the first tilde O(epsilon^{-1}) upper bounds and matching lower bounds for forward-KL-regularized offline contextual bandits under single-policy concentrability in both tabular and general function approximation settings.
Policy Gradient Primal-Dual Method for Safe Reinforcement Learning from Human Feedback cs.LG · 2026-04-21 · unverdicted · none · ref 43
Primal-dual policy gradient algorithms achieve global non-asymptotic convergence for safe RLHF cast as infinite-horizon discounted CMDPs without fitting reward models.
Spurious Correlation Learning in Preference Optimization: Mechanisms, Consequences, and Mitigation via Tie Training cs.LG · 2026-05-11 · unverdicted · none · ref 25
Standard preference learning induces spurious feature reliance via mean bias and correlation leakage, creating irreducible distribution shift vulnerabilities that tie training mitigates without degrading causal learning.
From Passive Reuse to Active Reasoning: Grounding Large Language Models for Neuro-Symbolic Experience Replay cs.AI · 2026-05-10 · unverdicted · none · ref 26
NSER uses zero-shot LLMs to induce behavioral rules from RL trajectories, grounds them in differentiable first-order logic, and applies the symbolic structures to dynamically reweight experience replay for better sample efficiency.

International Conference on Machine Learning , pages=

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer