Anchor-guided variance-aware reward modeling uses two response-level anchors to resolve non-identifiability in Gaussian models of pluralistic preferences, yielding provable identification, a joint training objective, and improved RLHF performance.
Journal of econometrics , volume=
2 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 2verdicts
UNVERDICTED 2roles
method 1polarities
use method 1representative citing papers
Establishes a uniform Bahadur representation for sieve M-estimators under temporal dependence and constructs valid simultaneous confidence regions using Gaussian approximation and self-convolved bootstrap.
citing papers explorer
-
Variance-aware Reward Modeling with Anchor Guidance
Anchor-guided variance-aware reward modeling uses two response-level anchors to resolve non-identifiability in Gaussian models of pluralistic preferences, yielding provable identification, a joint training objective, and improved RLHF performance.
-
Simultaneous Inference for Nonlinear Time Series, a Sieve M-regression Approach
Establishes a uniform Bahadur representation for sieve M-estimators under temporal dependence and constructs valid simultaneous confidence regions using Gaussian approximation and self-convolved bootstrap.