arXiv preprint arXiv:2512.21917 , year=

Semiparametric Preference Optimization: Your Language Model is Secretly a Single-Index Model , author= · 2025 · cs.LG · arXiv 2512.21917

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

Policy alignment to preference data typically assumes a known link function between observed preferences and latent rewards (e.g., Bradley-Terry model / logistic link). Misspecification of this link can bias inferred rewards and misalign learned policies. We study policy alignment under an unknown and unrestricted link function. We formulate an $f$-divergence-constrained reward maximization problem and show that realizability in a policy class induces a semiparametric single-index binary choice model, where a scalar policy-induced index captures all dependence on demonstrations and the remaining preference distribution is unrestricted. Rather than impose identifiability of structural parameters of such a model and estimate them, as in econometrics, we develop methods that directly learn policies, with the reward function implicit, analyzing error to the optimal policy and allowing for unidentifiable and nonparametric indices. We prove link-agnostic convergence guarantees in terms of generic function complexity measures and validate the methods and theory empirically. Code is available at https://github.com/causalml/spo/.

representative citing papers

On the Blessing of Pre-training in Weak-to-Strong Generalization

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

Pre-training provides a geometric warm start in a single-index model that enables weak-to-strong generalization up to a supervisor-limited bound, with empirical phase-transition evidence in LLMs.

Causal methods for LLM development and evaluation

cs.LG · 2026-05-25 · unverdicted · novelty 4.0

Position paper mapping causal inference opportunities across the LLM development pipeline from pretraining to evaluation to address confounding and non-stationarity.

citing papers explorer

Showing 2 of 2 citing papers.

On the Blessing of Pre-training in Weak-to-Strong Generalization cs.LG · 2026-05-07 · unverdicted · none · ref 157 · internal anchor
Pre-training provides a geometric warm start in a single-index model that enables weak-to-strong generalization up to a supervisor-limited bound, with empirical phase-transition evidence in LLMs.
Causal methods for LLM development and evaluation cs.LG · 2026-05-25 · unverdicted · none · ref 50 · internal anchor
Position paper mapping causal inference opportunities across the LLM development pipeline from pretraining to evaluation to address confounding and non-stationarity.

arXiv preprint arXiv:2512.21917 , year=

fields

years

verdicts

representative citing papers

citing papers explorer