Active teacher selection for reward learning

Justin Svegliato; Kyle Wray; Rachel Freedman; Stuart Russell

arxiv: 2310.15288 · v3 · submitted 2023-10-23 · 💻 cs.AI · cs.LG

Active teacher selection for reward learning

Rachel Freedman , Justin Svegliato , Kyle Wray , Stuart Russell This is my paper

Pith reviewed 2026-05-24 05:52 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords reward learningactive teacher selectionHidden Utility Bandithuman feedbackmulti-teacher learningpaper recommendationvaccine testing

0 comments

The pith

The Hidden Utility Bandit framework models heterogeneous teachers to enable active selection of which and when to query for reward learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Reward learning systems typically assume feedback comes from a single human teacher, but real applications involve large heterogeneous populations. The paper proposes the Hidden Utility Bandit framework to capture differences in teacher rationality, expertise, and costliness. It develops Active Teacher Selection algorithms that choose teachers optimally and demonstrates their outperformance over baselines in paper recommendation and COVID-19 vaccine testing domains. A sympathetic reader would care because this addresses a core limitation in scaling reward learning to diverse feedback sources.

Core claim

The paper claims that by formalizing the teacher selection problem with the Hidden Utility Bandit framework, which models variations among teachers, one can develop Active Teacher Selection algorithms that actively decide when and which teacher to query, leading to better performance than baselines in two real-world domains.

What carries the argument

The Hidden Utility Bandit (HUB) model, which treats teacher differences as hidden states in a bandit problem to allow tractable computation of optimal query policies.

If this is right

Active Teacher Selection outperforms standard approaches that do not model teacher heterogeneity.
The framework enables applications to domains with complex trade-offs such as paper recommendation systems and vaccine testing.
Modeling teacher rationality and cost allows better balancing of reward learning progress against query expenses.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Extending this to online settings with streaming teachers could further improve adaptability in dynamic environments.
Integration with other active learning techniques might amplify the benefits in high-dimensional reward spaces.
The approach suggests that ignoring teacher variation wastes resources on suboptimal feedback sources.

Load-bearing premise

Differences in teacher rationality, expertise, and costliness can be captured by a Hidden Utility Bandit model that permits tractable optimal selection policies.

What would settle it

Running the Active Teacher Selection algorithm against baselines in the paper recommendation domain and finding no statistically significant improvement in learned rewards or efficiency would falsify the claim of outperformance.

Figures

Figures reproduced from arXiv: 2310.15288 by Justin Svegliato, Kyle Wray, Rachel Freedman, Stuart Russell.

**Figure 1.** Figure 1: A simple Hidden Utility Bandit (HUB) with two arms and two teachers. The agent pulls the first arm, observes an apple, and receives the apple’s utility of 8 without observing it. The agent then pulls the second arm, observes a banana, and receives the banana’s utility of 2 without observing it. Because these utilities are hidden, the agent foregoes the opportunity for utility on the third timestep to ask t… view at source ↗

**Figure 2.** Figure 2: Paper recommendation as a HUB problem. Paper categories (Application, Benchmark, Theory) are items ( [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of ATS, naive and random algorithms. ATS best maximizes discounted reward (a) and identifies the highest-reward arm [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Accuracy of reward learning using ATS (with specific and general teacher selection) and naive algorithms (with exploration [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Mean action frequencies for various algorithms. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: COVID-19 vaccine testing as a HUB problem. Symptoms (None, Cough, Fever) are items ( [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Performance of ATS with specific and general teacher selection. All data is averaged across 25 runs on 20 HUB problems, smoothed [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 8.** Figure 8: Performance of all algorithms and ATS action frequencies on the COVID-19 vaccine testing problem. Random Arms and ATS both [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

**Figure 9.** Figure 9: Performance of ATS with various rollout policies. The best arm rollout policy outperforms the random arm and random action rollout [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗

**Figure 10.** Figure 10: ATS behavior and performance varies with teacher query costs. Data is averaged across 25 runs on 20 paper recommendation HUB [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗

read the original abstract

Reward learning techniques enable machine learning systems to learn objectives from human feedback. A core limitation of these systems is their assumption that all feedback comes from a single human teacher, despite gathering feedback from large and heterogeneous populations. We propose the Hidden Utility Bandit (HUB) framework to model differences in teacher rationality, expertise, and costliness, formalizing the problem of learning from multiple teachers. We develop a variety of solution algorithms and apply them to two real-world domains: paper recommendation systems and COVID-19 vaccine testing. We find that Active Teacher Selection (ATS) algorithms outperform baselines by actively selecting when and which teacher to query. Our key contributions are 1) the HUB framework: a novel mathematical framework for modeling the teacher selection problem, 2) ATS: an active-learning based algorithmic approach that demonstrates the utility of modeling teacher heterogeneity, and 3) proof-of-concept application of the HUB framework and ATS approaches to model and solve multiple real-world problems with complex trade-offs between reward learning and optimization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a clean formalization for picking among multiple heterogeneous teachers in reward learning and reports that active selection beats baselines in two applied domains.

read the letter

The main takeaway is that they model teacher differences in rationality, expertise, and cost with a Hidden Utility Bandit setup, then build active teacher selection algorithms that decide when and whom to query. This directly targets the single-teacher assumption that limits most reward learning work. They test the approach on paper recommendation and COVID-19 vaccine testing, where the active methods show gains over baselines by balancing information gain against query costs. That is the concrete contribution. The framework is new as a bandit-style treatment of the multi-teacher problem, and the applications demonstrate that the model can capture real trade-offs without collapsing into prior single-teacher equations. The applications are a strength because they move beyond toy settings and show the framework handling complex objectives. The math appears set up for tractable policies, which is useful if the derivations hold. The soft spots are that the abstract gives no pseudocode, no derivation steps, and no statistical details on the reported outperformance, so it is impossible to judge how sensitive the results are to parameter choices or how the baselines were implemented. The experiments read as proof-of-concept rather than a full empirical validation. This paper is for researchers working on human feedback, active learning, or multi-source reward models. A reader already thinking about scaling feedback from populations would find the formalization and the two domains worth looking at. It deserves peer review because the core modeling step addresses a stated limitation with a workable structure and relevant applications, even if the details need checking.

Referee Report

0 major / 2 minor

Summary. The manuscript proposes the Hidden Utility Bandit (HUB) framework to model heterogeneity among multiple teachers (differences in rationality, expertise, and costliness) in reward learning settings that traditionally assume a single teacher. It develops Active Teacher Selection (ATS) algorithms within this framework and evaluates them on two real-world domains—paper recommendation systems and COVID-19 vaccine testing—claiming that ATS approaches outperform baselines by actively deciding when and which teacher to query. Key contributions listed are the HUB modeling framework, the ATS algorithmic approach, and proof-of-concept applications demonstrating utility for problems with reward-learning/optimization trade-offs.

Significance. If the empirical and algorithmic claims hold, the work is significant because it directly addresses a practical limitation in reward learning: the unrealistic single-teacher assumption when feedback is collected from large, heterogeneous populations. The HUB framework supplies a structured way to capture teacher variation and enables tractable selection policies; the two domain applications provide initial evidence that the approach can handle real trade-offs. Explicit strengths include the introduction of a new modeling framework and the development of multiple solution algorithms rather than a single method.

minor comments (2)

[Abstract] Abstract: the phrase 'a variety of solution algorithms' is used without naming or characterizing them; a short enumeration of the main ATS variants (e.g., their optimization criteria or approximation methods) would improve immediate readability.
The manuscript would benefit from an explicit statement, early in the introduction or §2, of the precise assumptions under which the HUB model yields tractable optimal policies; this would clarify the scope of the 'tractable' claim.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of our work, recognition of the significance of addressing teacher heterogeneity in reward learning, and recommendation for minor revision. We are pleased that the contributions of the HUB framework, ATS algorithms, and domain applications were viewed favorably.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper introduces the Hidden Utility Bandit (HUB) framework as an independent modeling step to capture teacher heterogeneity in reward learning, then develops ATS algorithms and applies them to domains. No equations, fitted parameters, or derivations are shown that reduce performance claims or the framework itself to self-definitions, renamed inputs, or self-citation chains. The abstract and contributions treat HUB and ATS as novel constructs with external validation via real-world applications, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Abstract-only review; the central modeling choice is treated as a domain assumption rather than derived from prior literature.

axioms (1)

domain assumption Teacher feedback can be modeled via hidden utilities together with parameters for rationality, expertise, and query cost.
This premise is required for the Hidden Utility Bandit framework to formalize the teacher selection problem.

invented entities (1)

Hidden Utility Bandit (HUB) framework no independent evidence
purpose: To represent differences in teacher rationality, expertise, and costliness within a bandit formulation for reward learning.
New modeling construct introduced by the paper; no independent evidence supplied in the abstract.

pith-pipeline@v0.9.0 · 5700 in / 1297 out tokens · 59257 ms · 2026-05-24T05:52:52.966403+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Reinforcement Learning from Human Feedback: A Statistical Perspective
stat.ML 2026-04 accept novelty 2.0

A statistical survey of RLHF for LLM alignment that connects preference learning and policy optimization to models like Bradley-Terry-Luce while reviewing methods, extensions, and open challenges.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · cited by 1 Pith paper · 3 internal anchors

[1]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, J´er´emy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, Tony Wang, Samuel Marks, Charbel-Raphael Segerie, Micah Carroll, Andi Peng, Phillip Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J. Michaud, Jac...

work page internal anchor Pith review Pith/arXiv arXiv
[3]

So- cial choice for ai alignment: Dealing with diverse human feedback

Vincent Conitzer, Rachel Freedman, Jobst Heitzig, Wesley H Holliday, Bob M Jacobs, Nathan Lambert, Milan Moss ´e, Eric Pacuit, Stuart Russell, Hailey Schoelkopf, et al. So- cial choice for ai alignment: Dealing with diverse human feedback. arXiv preprint arXiv:2404.10271,

work page arXiv
[4]

Choice set misspecification in reward inference

Rachel Freedman, Rohin Shah, and Anca Dragan. Choice set misspecification in reward inference. arXiv preprint arXiv:2101.07691,

work page arXiv
[5]

On the sensitivity of reward inference to misspecified human models

Joey Hong, Kush Bhatia, and Anca Dragan. On the sensitivity of reward inference to misspecified human models. arXiv preprint arXiv:2212.04717,

work page arXiv
[6]

PEBBLE: Feedback-efficient interactive reinforcement learning via relabeling experience and unsupervised pre-training

Kimin Lee, Laura Smith, and Pieter Abbeel. PEBBLE: Feedback-efficient interactive reinforcement learning via relabeling experience and unsupervised pre-training. arXiv preprint arXiv:2106.05091, June

work page arXiv
[7]

Un- derstanding learned reward functions

Eric J Michaud, Adam Gleave, and Stuart Russell. Un- derstanding learned reward functions. arXiv preprint arXiv:2012.05862,

work page arXiv 2012
[8]

The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models

Alexander Pan, Kush Bhatia, and Jacob Steinhardt. The ef- fects of reward misspecification: Mapping and mitigating misaligned models. arXiv preprint arXiv:2201.03544,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Misspecification in inverse reinforcement learning

Joar Skalse and Alessandro Abate. Misspecification in inverse reinforcement learning. arXiv preprint arXiv:2212.03201,

work page arXiv
[10]

Skalse, N

Joar Skalse, Nikolaus HR Howe, Dmitrii Krasheninnikov, and David Krueger. Defining and characterizing reward hacking. arXiv preprint arXiv:2209.13085,

work page arXiv
[11]

The problem with metrics is a fundamental problem for AI

Rachel Thomas and David Uminsky. The problem with metrics is a fundamental problem for AI. arXiv preprint arXiv:2002.08512,

work page arXiv 2002
[12]

A Further Related Work Inverse Reinforcement Learning Inverse reinforcement learning (IRL) is a reward learning technique in which the agent infers a reward function given behavioral samples from an optimal policy [Ng and Russell, 2000; Abbeel and Ng, 2004] or a noisy teacher [Ziebart, 2010]. It is similar to RLHF in that reward information comes from a t...

work page 2000
[13]

Cooperative Inverse Reinforcement Learning Cooperative inverse reinforcement learning(CIRL) extends the IRL framework to allow collaboration between the agent and the teacher [Hadfield-Menell et al., 2016; Malik et al., 2018]. HUB problems can be viewed as a specific class of CIRL games in which there are multiple teachers, but they can only act (by provi...

work page 2016

[1] [1]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, J´er´emy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, Tony Wang, Samuel Marks, Charbel-Raphael Segerie, Micah Carroll, Andi Peng, Phillip Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J. Michaud, Jac...

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

So- cial choice for ai alignment: Dealing with diverse human feedback

Vincent Conitzer, Rachel Freedman, Jobst Heitzig, Wesley H Holliday, Bob M Jacobs, Nathan Lambert, Milan Moss ´e, Eric Pacuit, Stuart Russell, Hailey Schoelkopf, et al. So- cial choice for ai alignment: Dealing with diverse human feedback. arXiv preprint arXiv:2404.10271,

work page arXiv

[4] [4]

Choice set misspecification in reward inference

Rachel Freedman, Rohin Shah, and Anca Dragan. Choice set misspecification in reward inference. arXiv preprint arXiv:2101.07691,

work page arXiv

[5] [5]

On the sensitivity of reward inference to misspecified human models

Joey Hong, Kush Bhatia, and Anca Dragan. On the sensitivity of reward inference to misspecified human models. arXiv preprint arXiv:2212.04717,

work page arXiv

[6] [6]

PEBBLE: Feedback-efficient interactive reinforcement learning via relabeling experience and unsupervised pre-training

Kimin Lee, Laura Smith, and Pieter Abbeel. PEBBLE: Feedback-efficient interactive reinforcement learning via relabeling experience and unsupervised pre-training. arXiv preprint arXiv:2106.05091, June

work page arXiv

[7] [7]

Un- derstanding learned reward functions

Eric J Michaud, Adam Gleave, and Stuart Russell. Un- derstanding learned reward functions. arXiv preprint arXiv:2012.05862,

work page arXiv 2012

[8] [8]

The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models

Alexander Pan, Kush Bhatia, and Jacob Steinhardt. The ef- fects of reward misspecification: Mapping and mitigating misaligned models. arXiv preprint arXiv:2201.03544,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Misspecification in inverse reinforcement learning

Joar Skalse and Alessandro Abate. Misspecification in inverse reinforcement learning. arXiv preprint arXiv:2212.03201,

work page arXiv

[10] [10]

Skalse, N

Joar Skalse, Nikolaus HR Howe, Dmitrii Krasheninnikov, and David Krueger. Defining and characterizing reward hacking. arXiv preprint arXiv:2209.13085,

work page arXiv

[11] [11]

The problem with metrics is a fundamental problem for AI

Rachel Thomas and David Uminsky. The problem with metrics is a fundamental problem for AI. arXiv preprint arXiv:2002.08512,

work page arXiv 2002

[12] [12]

A Further Related Work Inverse Reinforcement Learning Inverse reinforcement learning (IRL) is a reward learning technique in which the agent infers a reward function given behavioral samples from an optimal policy [Ng and Russell, 2000; Abbeel and Ng, 2004] or a noisy teacher [Ziebart, 2010]. It is similar to RLHF in that reward information comes from a t...

work page 2000

[13] [13]

Cooperative Inverse Reinforcement Learning Cooperative inverse reinforcement learning(CIRL) extends the IRL framework to allow collaboration between the agent and the teacher [Hadfield-Menell et al., 2016; Malik et al., 2018]. HUB problems can be viewed as a specific class of CIRL games in which there are multiple teachers, but they can only act (by provi...

work page 2016