Deep Active Learning with Adaptive Acquisition

Fred A. Hamprecht; Manuel Haussmann; Melih Kandemir

arxiv: 1906.11471 · v1 · pith:QF6NFXXEnew · submitted 2019-06-27 · 📊 stat.ML · cs.LG

Deep Active Learning with Adaptive Acquisition

Manuel Haussmann , Fred A. Hamprecht , Melih Kandemir This is my paper

Pith reviewed 2026-05-25 14:51 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords active learningacquisition functionreinforcement learningBayesian neural networkpolicy networkadaptive acquisitionmodel selection

0 comments

The pith

A Bayesian policy network learns acquisition functions from reinforcement feedback during active learning rounds instead of fixing a heuristic beforehand.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Active learning normally requires picking an acquisition function heuristic in advance, yet no single choice proves best across datasets and its quality is only known after the labeling budget is spent. The paper reframes the acquisition function as a trainable predictor that receives reinforcement signals from the results of each labeling round. To cope with scarce labels, the system begins with a bootstrap heuristic that safely discards points where all methods agree and then trains a policy to adjust only the top-ranked candidates. Experiments on three benchmark datasets show the trained system either discovers a new superior function or automatically converges to the strongest existing heuristic for that particular data distribution.

Core claim

The acquisition function is defined as a learning predictor and trained by reinforcement feedback collected from each labeling round. As active learning is a scarce data regime, we bootstrap from a well-known heuristic that filters the bulk of data points on which all heuristics would agree, and learn a policy to warp the top portion of this ranking in the most beneficial way for the character of a specific data distribution. Our system consists of a Bayesian neural net, the predictor, a bootstrap acquisition function, a probabilistic state definition, and another Bayesian policy network that can effectively incorporate this input distribution. We observe on three benchmark data sets thatour

What carries the argument

Bayesian policy network that receives a probabilistic state from the current labeled set and learns to warp the ranking produced by a bootstrap acquisition heuristic.

If this is right

Acquisition functions no longer need to be chosen before any performance feedback is available.
The same pipeline can discover dataset-specific strategies without exhaustive pre-testing of heuristics.
Model selection for the acquisition step becomes part of the active learning loop rather than a separate validation procedure.
Bootstrapping from points where heuristics agree allows learning to begin even when labeled data is extremely scarce.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same reinforcement adaptation idea could be tested on other selection tasks that currently rely on fixed heuristics, such as hyperparameter search or data augmentation choice.
If the policy network generalizes across domains, active learning pipelines could be deployed with less manual tuning for new data distributions.
A natural next measurement would be to quantify how many labeling rounds are needed before the learned policy reliably surpasses the bootstrap baseline.

Load-bearing premise

The reinforcement feedback collected after each labeling round supplies a sufficiently strong and unbiased training signal for the policy network to improve on the bootstrap heuristic despite limited data.

What would settle it

Apply the method to the three reported benchmark datasets and check whether it fails to either invent a new superior acquisition function or match the best fixed heuristic on at least one dataset.

Figures

Figures reproduced from arXiv: 1906.11471 by Fred A. Hamprecht, Manuel Haussmann, Melih Kandemir.

**Figure 1.** Figure 1: The proposed pipeline. The standard active learning pipeline is summarized as the interplay between three parts. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

read the original abstract

Model selection is treated as a standard performance boosting step in many machine learning applications. Once all other properties of a learning problem are fixed, the model is selected by grid search on a held-out validation set. This is strictly inapplicable to active learning. Within the standardized workflow, the acquisition function is chosen among available heuristics a priori, and its success is observed only after the labeling budget is already exhausted. More importantly, none of the earlier studies report a unique consistently successful acquisition heuristic to the extent to stand out as the unique best choice. We present a method to break this vicious circle by defining the acquisition function as a learning predictor and training it by reinforcement feedback collected from each labeling round. As active learning is a scarce data regime, we bootstrap from a well-known heuristic that filters the bulk of data points on which all heuristics would agree, and learn a policy to warp the top portion of this ranking in the most beneficial way for the character of a specific data distribution. Our system consists of a Bayesian neural net, the predictor, a bootstrap acquisition function, a probabilistic state definition, and another Bayesian policy network that can effectively incorporate this input distribution. We observe on three benchmark data sets that our method always manages to either invent a new superior acquisition function or to adapt itself to the a priori unknown best performing heuristic for each specific data set.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The core idea of training an acquisition policy via RL on labeling feedback, bootstrapped from a heuristic, is a reasonable extension but the scarce-data signal looks too weak to support the claim of reliably inventing or adapting to superior functions.

read the letter

The paper's main move is to treat the acquisition function as a learnable policy trained with RL feedback from actual labeling rounds, bootstrapped from a standard heuristic to handle the bulk of the data. This is a direct attempt to make acquisition adaptive to the dataset instead of picking one heuristic upfront. What stands out is the recognition that different heuristics win on different data, and the method tries to either beat them or pick the right one automatically. The Bayesian neural net setup for the policy is a reasonable way to handle uncertainty in this setting. The experiments are reported only at a high level: positive results on three benchmarks, but no numbers, no baselines listed, no ablations, and no mention of how many labeling rounds or how the RL signal is shaped. That makes it hard to judge if the policy learning is actually doing the work or if the bootstrap is carrying most of the load. The stress test concern is real here. In active learning the labeled set grows slowly, so the reward signal for the policy comes from a small number of decisions. The top-ranked points after the bootstrap are exactly the ones where heuristics disagree, and those disagreements are noisy. It's not obvious that this gives enough clean signal to train a policy network reliably, especially without showing that removing the policy hurts performance. This is aimed at active learning researchers who want to move beyond fixed heuristics. If the full paper has proper controls and statistical tests, it could be worth a review. Otherwise the claim that it 'always manages to invent a new superior function' needs stronger backing than the abstract gives. I would send it to review to see the details, but with the expectation that the scarce-data RL part will need careful scrutiny.

Referee Report

2 major / 1 minor

Summary. The paper proposes treating acquisition function selection in active learning as a learning problem solved by a Bayesian neural network predictor trained via reinforcement feedback from each labeling round. It bootstraps from a standard heuristic to filter the bulk of points and uses a Bayesian policy network to adaptively warp the top-ranked portion according to the data distribution. The central empirical claim is that on three benchmark datasets the method either invents a new superior acquisition function or adapts to the a priori unknown best heuristic for that specific dataset.

Significance. If the result holds with proper controls, the approach would address a long-standing practical difficulty in active learning: the lack of a consistently superior acquisition heuristic across datasets and the impossibility of validating the choice on held-out data before the labeling budget is spent. The combination of bootstrapping, probabilistic state representation, and policy learning from actual labeling outcomes is a concrete attempt to make acquisition adaptive rather than fixed a priori. Credit is due for framing the problem explicitly as a scarce-data reinforcement-learning task and for attempting to close the loop with real labeling feedback rather than simulated rewards.

major comments (2)

[Abstract] Abstract and experimental section: the claim that the method 'always manages to either invent a new superior acquisition function or to adapt itself to the a priori unknown best performing heuristic' rests on outcomes from three datasets, yet the manuscript supplies no ablation that isolates the contribution of the Bayesian policy network from the bootstrap heuristic alone. Without such a control it is impossible to determine whether observed gains are attributable to successful policy learning or simply to the initial ranking filter.
[Methods] The central modeling assumption (reinforcement feedback collected after each labeling round supplies a sufficiently strong and unbiased training signal for the policy network) is load-bearing for the entire adaptive claim. In a scarce-data regime the bootstrap already removes the bulk of agreed-upon points, leaving only a small top-ranked subset whose warping must be learned from limited, noisy rewards; no analysis or sensitivity experiment is presented that quantifies whether this signal is adequate to outperform the bootstrap baseline.

minor comments (1)

[Abstract] The abstract refers to 'three benchmark data sets' without naming them or reporting any quantitative metrics, variance, or statistical tests; these details belong in the main text even if space-constrained.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The comments highlight important gaps in experimental validation that we will address. Below we respond point-by-point to the major comments.

read point-by-point responses

Referee: [Abstract] Abstract and experimental section: the claim that the method 'always manages to either invent a new superior acquisition function or to adapt itself to the a priori unknown best performing heuristic' rests on outcomes from three datasets, yet the manuscript supplies no ablation that isolates the contribution of the Bayesian policy network from the bootstrap heuristic alone. Without such a control it is impossible to determine whether observed gains are attributable to successful policy learning or simply to the initial ranking filter.

Authors: We agree that the current experiments do not isolate the policy network's contribution from the bootstrap filter. The manuscript presents the full pipeline (bootstrap + policy) and reports gains relative to standard heuristics, but does not include a direct comparison against the bootstrap alone. In the revised version we will add this ablation on the three benchmark datasets, reporting performance of the bootstrap heuristic with and without the learned warping step. This will allow readers to assess whether the adaptive component provides additional benefit beyond the initial ranking filter. revision: yes
Referee: [Methods] The central modeling assumption (reinforcement feedback collected after each labeling round supplies a sufficiently strong and unbiased training signal for the policy network) is load-bearing for the entire adaptive claim. In a scarce-data regime the bootstrap already removes the bulk of agreed-upon points, leaving only a small top-ranked subset whose warping must be learned from limited, noisy rewards; no analysis or sensitivity experiment is presented that quantifies whether this signal is adequate to outperform the bootstrap baseline.

Authors: The referee correctly identifies that the manuscript lacks a sensitivity study on the quality and quantity of the reinforcement signal. We will add experiments that vary the number of labeling rounds used for policy updates and introduce controlled label noise to test robustness of the learned policy in the scarce-data setting. These results will be included in a new subsection of the experimental evaluation. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper defines its acquisition function via a Bayesian policy network trained on external reinforcement feedback collected after each labeling round. This feedback originates from actual model performance on held-out data points and is independent of the internal definitions or bootstrap heuristic. The bootstrap component is a fixed, known heuristic that is explicitly separated from the learned warping policy; the central claim of adaptation or invention is presented as an empirical outcome on three benchmarks rather than a quantity forced by construction, self-citation, or renaming of inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on the effectiveness of the bootstrap heuristic for initial filtering and on the assumption that RL feedback from labeling rounds can train a superior policy in low-data settings; no free parameters or invented entities are described in the abstract.

axioms (2)

domain assumption A well-known heuristic can reliably filter the bulk of data points on which all heuristics agree
Used as the bootstrap step before policy learning
ad hoc to paper Reinforcement feedback from labeling rounds supplies a usable training signal for the policy network
Central premise that allows the acquisition function to be learned rather than fixed

pith-pipeline@v0.9.0 · 5768 in / 1357 out tokens · 25413 ms · 2026-05-25T14:51:03.996717+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages

[1]

Baram, R

[Baram et al., 2004] Y . Baram, R. Yaniv, and K. Luz. Online choice of active learning algorithms. JMLR,

work page 2004
[2]

Blundell, J

[Blundell et al., 2015] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra. Weight uncertainty in neural network. In ICML,

work page 2015
[3]

Callaway, S

[Callaway et al., 2018] F. Callaway, S. Gul, P.M. Krueger, T.L. Grifﬁths, and F. Lieder. Learning to select compu- tations

work page 2018
[4]

Chu and H

[Chu and Lin, 2016] H. Chu and H. Lin. Can active learning experience be transferred? In ICDM,

work page 2016
[5]

Deisenroth and C

[Deisenroth and Rasmussen, 2011] M. Deisenroth and C. E. Rasmussen. Pilco: A model-based and data-efﬁcient ap- proach to policy search. In ICML,

work page 2011
[6]

Depeweg, J

[Depeweg et al., 2018] S. Depeweg, J. Hernandez-Lobato, F. Doshi-Velez, and S. Udluft. Decomposition of uncer- tainty in Bayesian deep learning for efﬁcient and risk- sensitive learning. In ICML,

work page 2018
[7]

Ebert, M

[Ebert et al., 2012] S. Ebert, M. Fritz, and B. Schiele. Ralf: A reinforced active learning formulation for object class recognition. In CVPR,

work page 2012
[8]

[Fang et al., 2017] M. Fang, Y . Li, and T. Cohn. Learning how to active learn: A deep reinforcement learning ap- proach. In EMNLP,

work page 2017
[9]

[Frey and Hinton, 1999] B. J. Frey and G. E. Hinton. Varia- tional learning in nonlinear gaussian belief networks.Neu- ral Computation,

work page 1999
[10]

Gal and Z

[Gal and Ghahramani, 2016] Y . Gal and Z. Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In ICML,

work page 2016
[11]

[Gal et al., 2017] Y . Gal, R. Islam, and Z. Ghahramani. Deep Bayesian active learning with image data. In ICML,

work page 2017
[12]

Gast and S

[Gast and Roth, 2018] J. Gast and S. Roth. Lightweight probabilistic deep networks. In CVPR,

work page 2018
[13]

Haussmann, F.A

[Haussmann et al., 2019] M. Haussmann, F.A. Hamprecht, and M. Kandemir. Sampling-free variational inference for bayesian neural networks by variance backpropaga- tion. UAI,

work page 2019
[14]

Hern ´andez- Lobato and R

[Hern´andez-Lobato and Adams, 2015] J.M. Hern ´andez- Lobato and R. Adams. Probabilistic backpropagation for scalable learning of bayesian neural networks. In ICML,

work page 2015
[15]

Houlsby, F

[Houlsby et al., 2012] N. Houlsby, F. Huszar, Z. Ghahra- mani, and J.M Hern´andez-Lobato. Collaborative gaussian processes for preference learning. In NIPS,

work page 2012
[16]

Hsu and H

[Hsu and Lin, 2015] W. Hsu and H. Lin. Active learning by learning. In AAAI,

work page 2015
[17]

Kingma and J

[Kingma and Ba, 2015] D.P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR,

work page 2015
[18]

Kingma and M

[Kingma and Welling, 2014] D.P. Kingma and M. Welling. Auto-encoding variational bayes. In ICLR,

work page 2014
[19]

Kingma, T

[Kingma et al., 2015] D.P. Kingma, T. Salimans, and M. Welling. Variational dropout and the local reparame- terization trick. In NIPS,

work page 2015
[20]

Konyushkova, R

[Konyushkova et al., 2017] K. Konyushkova, R. Sznitman, and P. Fua. Learning active learning from data. In NIPS

work page 2017
[21]

Louizos, K

[Louizos et al., 2017] C. Louizos, K. Ullrich, and M. Welling. Bayesian compression for deep learn- ing. In NIPS,

work page 2017
[22]

Molchanov, A

[Molchanov et al., 2017] D. Molchanov, A. Ashukha, and D. Vetrov. Variational dropout sparsiﬁes deep neural net- works. In ICML,

work page 2017
[23]

[Pang et al., 2018] K. Pang, M. Dong, Y . Wu, and T. Hospedales. Meta-learning transferable active learning policies by deep reinforcement learning. arXiv preprint,

work page 2018
[24]

Qiu, D.J

[Qiu et al., 2017] Z. Qiu, D.J. Miller, and G. Kesidis. A max- imum entropy framework for semisupervised and active learning with unknown and label-scarce classes. IEEE transactions on neural networks and learning systems ,

work page 2017
[25]

Schulman, S

[Schulman et al., 2015] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz. Trust region policy optimization. In ICML,

work page 2015
[26]

Schulman, F

[Schulman et al., 2017] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint,

work page 2017
[27]

Sener and S

[Sener and Savarese, 2018] O. Sener and S. Savarese. Ac- tive learning for convolutional neural networks: Acore-set approach. In CVPR,

work page 2018
[28]

[Settles, 2012] B. Settles. Active learning. Synthesis Lec- tures on Artiﬁcial Intelligence and Machine Learning ,

work page 2012
[29]

Srinivas, A

[Srinivas et al., 2012] N. Srinivas, A. Krause, S.M. Kakade, and M.W. Seeger. Information-theoretic regret bounds for gaussian process optimization in the bandit setting. IEEE Transactions on Information Theory,

work page 2012
[30]

Wang and C

[Wang and Manning, 2013] S. Wang and C. Manning. Fast dropout training. In ICML,

work page 2013
[31]

[Wang et al., 2017] K. Wang, D. Zhang, Y . Li, R. Zhang, and L. Lin. Cost-effective active learning for deep image clas- siﬁcation. IEEE Transactions on Circuits and Systems for Video Technology,

work page 2017
[32]

Williams

[Williams, 1992] R.J. Williams. Simple statistical gradient- following algorithms for connectionist reinforcement learning. Machine learning,

work page 1992
[33]

[Wu et al., 2019] A. Wu, S. Nowozin, E. Meeds, R. E. Turner, J.M. Hern ´andez-Lobato, and A.L. Gaunt. Deter- ministic variational inference for robust bayesian neural networks. ICLR, 2019

work page 2019

[1] [1]

Baram, R

[Baram et al., 2004] Y . Baram, R. Yaniv, and K. Luz. Online choice of active learning algorithms. JMLR,

work page 2004

[2] [2]

Blundell, J

[Blundell et al., 2015] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra. Weight uncertainty in neural network. In ICML,

work page 2015

[3] [3]

Callaway, S

[Callaway et al., 2018] F. Callaway, S. Gul, P.M. Krueger, T.L. Grifﬁths, and F. Lieder. Learning to select compu- tations

work page 2018

[4] [4]

Chu and H

[Chu and Lin, 2016] H. Chu and H. Lin. Can active learning experience be transferred? In ICDM,

work page 2016

[5] [5]

Deisenroth and C

[Deisenroth and Rasmussen, 2011] M. Deisenroth and C. E. Rasmussen. Pilco: A model-based and data-efﬁcient ap- proach to policy search. In ICML,

work page 2011

[6] [6]

Depeweg, J

[Depeweg et al., 2018] S. Depeweg, J. Hernandez-Lobato, F. Doshi-Velez, and S. Udluft. Decomposition of uncer- tainty in Bayesian deep learning for efﬁcient and risk- sensitive learning. In ICML,

work page 2018

[7] [7]

Ebert, M

[Ebert et al., 2012] S. Ebert, M. Fritz, and B. Schiele. Ralf: A reinforced active learning formulation for object class recognition. In CVPR,

work page 2012

[8] [8]

[Fang et al., 2017] M. Fang, Y . Li, and T. Cohn. Learning how to active learn: A deep reinforcement learning ap- proach. In EMNLP,

work page 2017

[9] [9]

[Frey and Hinton, 1999] B. J. Frey and G. E. Hinton. Varia- tional learning in nonlinear gaussian belief networks.Neu- ral Computation,

work page 1999

[10] [10]

Gal and Z

[Gal and Ghahramani, 2016] Y . Gal and Z. Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In ICML,

work page 2016

[11] [11]

[Gal et al., 2017] Y . Gal, R. Islam, and Z. Ghahramani. Deep Bayesian active learning with image data. In ICML,

work page 2017

[12] [12]

Gast and S

[Gast and Roth, 2018] J. Gast and S. Roth. Lightweight probabilistic deep networks. In CVPR,

work page 2018

[13] [13]

Haussmann, F.A

[Haussmann et al., 2019] M. Haussmann, F.A. Hamprecht, and M. Kandemir. Sampling-free variational inference for bayesian neural networks by variance backpropaga- tion. UAI,

work page 2019

[14] [14]

Hern ´andez- Lobato and R

[Hern´andez-Lobato and Adams, 2015] J.M. Hern ´andez- Lobato and R. Adams. Probabilistic backpropagation for scalable learning of bayesian neural networks. In ICML,

work page 2015

[15] [15]

Houlsby, F

[Houlsby et al., 2012] N. Houlsby, F. Huszar, Z. Ghahra- mani, and J.M Hern´andez-Lobato. Collaborative gaussian processes for preference learning. In NIPS,

work page 2012

[16] [16]

Hsu and H

[Hsu and Lin, 2015] W. Hsu and H. Lin. Active learning by learning. In AAAI,

work page 2015

[17] [17]

Kingma and J

[Kingma and Ba, 2015] D.P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR,

work page 2015

[18] [18]

Kingma and M

[Kingma and Welling, 2014] D.P. Kingma and M. Welling. Auto-encoding variational bayes. In ICLR,

work page 2014

[19] [19]

Kingma, T

[Kingma et al., 2015] D.P. Kingma, T. Salimans, and M. Welling. Variational dropout and the local reparame- terization trick. In NIPS,

work page 2015

[20] [20]

Konyushkova, R

[Konyushkova et al., 2017] K. Konyushkova, R. Sznitman, and P. Fua. Learning active learning from data. In NIPS

work page 2017

[21] [21]

Louizos, K

[Louizos et al., 2017] C. Louizos, K. Ullrich, and M. Welling. Bayesian compression for deep learn- ing. In NIPS,

work page 2017

[22] [22]

Molchanov, A

[Molchanov et al., 2017] D. Molchanov, A. Ashukha, and D. Vetrov. Variational dropout sparsiﬁes deep neural net- works. In ICML,

work page 2017

[23] [23]

[Pang et al., 2018] K. Pang, M. Dong, Y . Wu, and T. Hospedales. Meta-learning transferable active learning policies by deep reinforcement learning. arXiv preprint,

work page 2018

[24] [24]

Qiu, D.J

[Qiu et al., 2017] Z. Qiu, D.J. Miller, and G. Kesidis. A max- imum entropy framework for semisupervised and active learning with unknown and label-scarce classes. IEEE transactions on neural networks and learning systems ,

work page 2017

[25] [25]

Schulman, S

[Schulman et al., 2015] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz. Trust region policy optimization. In ICML,

work page 2015

[26] [26]

Schulman, F

[Schulman et al., 2017] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint,

work page 2017

[27] [27]

Sener and S

[Sener and Savarese, 2018] O. Sener and S. Savarese. Ac- tive learning for convolutional neural networks: Acore-set approach. In CVPR,

work page 2018

[28] [28]

[Settles, 2012] B. Settles. Active learning. Synthesis Lec- tures on Artiﬁcial Intelligence and Machine Learning ,

work page 2012

[29] [29]

Srinivas, A

[Srinivas et al., 2012] N. Srinivas, A. Krause, S.M. Kakade, and M.W. Seeger. Information-theoretic regret bounds for gaussian process optimization in the bandit setting. IEEE Transactions on Information Theory,

work page 2012

[30] [30]

Wang and C

[Wang and Manning, 2013] S. Wang and C. Manning. Fast dropout training. In ICML,

work page 2013

[31] [31]

[Wang et al., 2017] K. Wang, D. Zhang, Y . Li, R. Zhang, and L. Lin. Cost-effective active learning for deep image clas- siﬁcation. IEEE Transactions on Circuits and Systems for Video Technology,

work page 2017

[32] [32]

Williams

[Williams, 1992] R.J. Williams. Simple statistical gradient- following algorithms for connectionist reinforcement learning. Machine learning,

work page 1992

[33] [33]

[Wu et al., 2019] A. Wu, S. Nowozin, E. Meeds, R. E. Turner, J.M. Hern ´andez-Lobato, and A.L. Gaunt. Deter- ministic variational inference for robust bayesian neural networks. ICLR, 2019

work page 2019