arXiv preprint arXiv:2310.16048 , year=

· 2023 · arXiv 2310.16048

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

representative citing papers

Variance-aware Reward Modeling with Anchor Guidance

stat.ML · 2026-05-12 · unverdicted · novelty 7.0

Anchor-guided variance-aware reward modeling uses two response-level anchors to resolve non-identifiability in Gaussian models of pluralistic preferences, yielding provable identification, a joint training objective, and improved RLHF performance.

Curated Synthetic Data Doesn't Have to Collapse: A Theoretical Study of Generative Retraining with Pluralistic Preferences

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Recursive generative retraining with pluralistic preferences converges to a stable diverse distribution that satisfies a weighted Nash bargaining solution.

MoralityGym: A Benchmark for Evaluating Hierarchical Moral Alignment in Sequential Decision-Making Agents

cs.AI · 2026-02-13 · unverdicted · novelty 5.0

MoralityGym is a new benchmark using 98 ethical dilemmas in sequential environments to evaluate hierarchical moral alignment in AI agents via Morality Chains and a Morality Metric.

Fair Agents: Balancing Multistakeholder Alignment in Multi-Agent Personalization Systems

cs.IR · 2026-05-04 · unverdicted · novelty 4.0

The authors propose a conceptual framework integrating stakeholder-LLM alignment methods, social choice-based aggregation for collective decisions, and stakeholder-centric evaluations to achieve fair multi-agent personalization.

Reinforcement Learning from Human Feedback

cs.LG · 2025-04-16 · unverdicted · novelty 2.0

The book introduces the origins, mathematical setup, and optimization stages of RLHF including reward modeling, reinforcement learning, rejection sampling, and direct alignment algorithms.

citing papers explorer

Showing 5 of 5 citing papers.

Variance-aware Reward Modeling with Anchor Guidance stat.ML · 2026-05-12 · unverdicted · none · ref 47
Anchor-guided variance-aware reward modeling uses two response-level anchors to resolve non-identifiability in Gaussian models of pluralistic preferences, yielding provable identification, a joint training objective, and improved RLHF performance.
Curated Synthetic Data Doesn't Have to Collapse: A Theoretical Study of Generative Retraining with Pluralistic Preferences cs.LG · 2026-05-08 · unverdicted · none · ref 80
Recursive generative retraining with pluralistic preferences converges to a stable diverse distribution that satisfies a weighted Nash bargaining solution.
MoralityGym: A Benchmark for Evaluating Hierarchical Moral Alignment in Sequential Decision-Making Agents cs.AI · 2026-02-13 · unverdicted · none · ref 66
MoralityGym is a new benchmark using 98 ethical dilemmas in sequential environments to evaluate hierarchical moral alignment in AI agents via Morality Chains and a Morality Metric.
Fair Agents: Balancing Multistakeholder Alignment in Multi-Agent Personalization Systems cs.IR · 2026-05-04 · unverdicted · none · ref 21
The authors propose a conceptual framework integrating stakeholder-LLM alignment methods, social choice-based aggregation for collective decisions, and stakeholder-centric evaluations to achieve fair multi-agent personalization.
Reinforcement Learning from Human Feedback cs.LG · 2025-04-16 · unverdicted · none · ref 218
The book introduces the origins, mathematical setup, and optimization stages of RLHF including reward modeling, reinforcement learning, rejection sampling, and direct alignment algorithms.

arXiv preprint arXiv:2310.16048 , year=

fields

years

verdicts

representative citing papers

citing papers explorer