Provable Accelerated Bayesian Optimization with Knowledge Transfer

Boxin Zhao; Chong Liu; Haitao Lin; Mladen Kolar

arxiv: 2511.03125 · v2 · submitted 2025-11-05 · 📊 stat.ML · cs.LG

Provable Accelerated Bayesian Optimization with Knowledge Transfer

Haitao Lin , Boxin Zhao , Mladen Kolar , Chong Liu This is my paper

Pith reviewed 2026-05-18 01:47 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords Bayesian optimizationknowledge transferregret boundsdifference functionreproducing kernel Hilbert spaceinformation gainhyperparameter optimization

0 comments

The pith

Bayesian optimization on a target task can be accelerated by building uncertainty quantification directly on the difference from source tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes DeltaBO to transfer knowledge from source tasks into Bayesian optimization on a target task. It does this by modeling the difference function between source and target, and constructing uncertainty estimates only around that difference even when the two functions live in different reproducing kernel Hilbert spaces. Under mild assumptions the resulting regret scales as the square root of T times (T/N plus the information gain of the difference), which improves over standard BO whenever the source data volume N is large and the tasks are similar enough to make the difference simple. This bound is verified both theoretically and on hyperparameter tuning benchmarks plus synthetic test functions.

Core claim

DeltaBO builds a novel uncertainty-quantification approach on the difference function δ between the source and target functions, which are allowed to belong to different Reproducing Kernel Hilbert Spaces. Under mild assumptions, the regret of DeltaBO is of order O~(sqrt(T (T/N + γ_δ))), where N denotes the number of evaluations from source tasks and typically N ≫ T. In many applications, source and target tasks are similar, which implies that γ_δ can be much smaller than γ_f of the target alone.

What carries the argument

Uncertainty quantification constructed specifically on the difference function δ between source and target functions, allowing each to belong to its own RKHS.

If this is right

With abundant source evaluations N much larger than target evaluations T, the regret improves beyond the non-transfer rate O~(sqrt(T γ_f)).
When tasks are similar the information gain γ_δ of the difference becomes small, directly tightening the regret bound.
The method works even if source and target functions belong to entirely different RKHSs, removing a common restriction in prior transfer BO.
Empirical results on hyperparameter tuning and synthetic functions confirm both the regret scaling and the practical speedup.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same difference-function idea could be tested in sequential decision settings outside BO, such as contextual bandits or reinforcement learning with shared dynamics.
One could define a task-similarity metric directly from the estimated γ_δ and use it to decide whether to transfer from a given source.
Extending the construction to multiple heterogeneous sources would require combining several difference functions while preserving the regret form.

Load-bearing premise

Source and target functions may live in different reproducing kernel Hilbert spaces, yet their difference must have information gain much smaller than the target function whenever the tasks are similar.

What would settle it

Run DeltaBO on a pair of deliberately dissimilar source and target tasks where the information gain of the difference is measured to be comparable to that of the target alone, then check whether the observed regret fails to improve beyond the standard non-transfer bound.

Figures

Figures reproduced from arXiv: 2511.03125 by Boxin Zhao, Chong Liu, Haitao Lin, Mladen Kolar.

**Figure 2.** Figure 2: Average regrets of all compared algorithms. [PITH_FULL_IMAGE:figures/full_fig_p024_2.png] view at source ↗

read the original abstract

We study how to accelerate Bayesian optimization (BO) on a target task by transferring historical knowledge from related source tasks. Existing work on BO with knowledge transfer either lacks theoretical guarantees or achieves the same regret as BO in the non-transfer setting, $\widetilde{O}(\sqrt{T \gamma_f})$, where $T$ is the number of evaluations of the target function and $\gamma_f$ denotes its information gain. In this paper, we propose the DeltaBO algorithm, which builds a novel uncertainty-quantification approach on the difference function $\delta$ between the source and target functions, which are allowed to belong to different Reproducing Kernel Hilbert Spaces (RKHSs). Under mild assumptions, we prove that the regret of DeltaBO is of order $\widetilde{O}(\sqrt{T (T/N + \gamma_\delta)})$, where $N$ denotes the number of evaluations from source tasks and typically $N \gg T$. In many applications, source and target tasks are similar, which implies that $\gamma_\delta$ can be much smaller than $\gamma_f$. Empirical studies on both real-world hyperparameter-tuning tasks and synthetic functions show that DeltaBO outperforms other baseline methods and also verify our theoretical claims. Our code is available on GitHub.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DeltaBO gets a cleaner regret bound for transfer BO by putting uncertainty on the difference function, which can shrink when tasks align, but the proof needs a close look.

read the letter

The main point is that DeltaBO models uncertainty directly on the difference delta between source and target functions, even when those functions sit in separate RKHSs. This leads to a regret of order O~(sqrt(T (T/N + gamma_delta))), where the T/N term accounts for limited source data and gamma_delta gets small if the tasks are similar. That is the actual advance over earlier transfer BO work, which either had no guarantees or just recovered the standard non-transfer bound.

Referee Report

0 major / 4 minor

Summary. The paper proposes DeltaBO, a Bayesian optimization method that accelerates target-task optimization by transferring knowledge from source tasks via a novel uncertainty model on the difference function δ (with source and target functions permitted to lie in distinct RKHSs). Under mild assumptions on information gain, it proves a regret bound of order O~(sqrt(T (T/N + γ_δ))), where N is the number of source evaluations (typically N ≫ T) and γ_δ is the information gain of δ; this improves on the standard non-transfer bound O~(sqrt(T γ_f)) when tasks are similar. Empirical results on hyperparameter tuning and synthetic functions are provided, along with open-source code.

Significance. If the central regret bound holds, the work is significant: it supplies the first explicit improvement in regret rate for knowledge-transfer BO by exploiting reduced complexity of the difference function, rather than merely matching the non-transfer rate. The allowance for distinct RKHSs increases modeling flexibility, the T/N term correctly reflects finite source data, and the open code supports reproducibility and verification.

minor comments (4)

Abstract and §1: the phrase 'mild assumptions' is used repeatedly but never enumerated in one place; a short bullet list of the precise conditions on kernels, information gains, and task similarity would improve readability.
§2 (Related Work): several recent transfer-BO papers (post-2022) are omitted; adding them would better situate the novelty claim.
Notation: γ_δ is introduced in the abstract and theorem statements but its precise definition (as information gain of the difference posterior) appears only later; moving the definition to §3.1 would prevent forward references.
Figure 3 and Table 2: axis labels and legend entries are slightly compressed; increasing font size or adding a short textual explanation of the plotted quantities would aid clarity.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of our manuscript, recognition of its significance in providing the first explicit regret improvement for knowledge-transfer BO, and recommendation for minor revision. We appreciate the acknowledgment of our modeling flexibility with distinct RKHSs, the T/N term, and the open-source code.

Circularity Check

0 steps flagged

No significant circularity in regret bound derivation

full rationale

The paper introduces DeltaBO by constructing a novel uncertainty model on the difference function δ (allowed to lie in a distinct RKHS from the target), then derives the stated regret bound directly from this modeling choice combined with standard information-gain arguments from the BO literature. The T/N term arises from finite source data and γ_δ from the reduced complexity of δ under task similarity; neither quantity is obtained by fitting to the target regret nor by renaming a self-citation. The derivation chain is therefore self-contained and does not reduce any claimed result to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard RKHS assumptions for Bayesian optimization plus the new modeling of the difference function; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Mild assumptions allowing source and target functions to belong to different RKHSs with the difference having information gain gamma_delta.
Invoked to prove the regret bound as stated in the abstract.

pith-pipeline@v0.9.0 · 5749 in / 1275 out tokens · 40953 ms · 2026-05-18T01:47:45.648504+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

regret of DeltaBO is of order O~(sqrt(T (T/N + gamma_delta))) ... difference function delta between the source and target functions, which are allowed to belong to different Reproducing Kernel Hilbert Spaces
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

novel uncertainty-quantification approach built on the difference function

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Regime-Conditioned Evaluation in Multi-Context Bayesian Optimization
cs.LG 2026-05 unverdicted novelty 7.0

The Portable Regime Score PRS=(B/|A|)(1-rho) captures and predicts acquisition function performance reversals in transfer Bayesian optimization, enabling a RegimePlanner that adapts and beats fixed baselines.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · cited by 1 Pith paper

[1]

identity

Activation function (string, “identity”, “logistic”, “tanh”, or “relu”)

work page
[2]

Strength of the L2 regularization term (float,[10−6,10 −2])

work page
[3]

Initial learning rate (float,[10−6,10 −2])

work page
[4]

Maximum number of iterations (integer,[100,300])

work page
[5]

Whether to shuffle samples in each iteration (bool, True or False)

work page
[6]

Exponential decay rate for the first moment vector (float,(0,1))

work page
[7]

Exponential decay rate for the second moment vector (float,(0,1))

work page
[8]

Classification with Gradient Boosting

Maximum number of epochs without tolerance improvement (integer,[1,10]). Classification with Gradient Boosting

work page
[9]

logloss” or “exponential

Loss function (string, “logloss” or “exponential”)

work page
[10]

Learning rate (float,(0,1))

work page
[11]

Number of estimators (integer,[20,200])

work page
[12]

Fraction of samples used for fitting base learners (float,(0,1))

work page
[13]

friedman_mse

Criterion to measure split quality (string, “friedman_mse” or “squared_error”)

work page
[14]

Minimum number of samples required to split an internal node (integer,[2,10])

work page
[15]

Minimum number of samples required to be at a leaf node (integer,[1,10])

work page
[16]

Minimum weighted fraction of the total sum of weights (float,(0,0.5))

work page
[17]

Maximum depth of regression estimators (integer,[1,10])

work page
[18]

sqrt” or “log2

Number of features considered for best split (float, “sqrt” or “log2”)

work page
[19]

C.2 Synthetic Experimental Settings For Gaussian kernel functions, the lengthscale of all kernels is 0.1, with observation noise for both source (σ2

Maximum number of leaf nodes in best-first fashion (integer,[2,10]). C.2 Synthetic Experimental Settings For Gaussian kernel functions, the lengthscale of all kernels is 0.1, with observation noise for both source (σ2

work page
[20]

For Bohachevsky functions, lengthscale of source, target, and difference kernel is 1.6, 0.8, and 1.0 withσ2 0 = 0.24, σ2 = 0.06and τ 2 = 0.32

and target (σ2) being 0.01, andτ 2 = 0.32. For Bohachevsky functions, lengthscale of source, target, and difference kernel is 1.6, 0.8, and 1.0 withσ2 0 = 0.24, σ2 = 0.06and τ 2 = 0.32. In assumption-satisfied setting, σ2 0 = 0.1and σ2 = 0.01. The target GP in the baseline algorithms is modeled using a Matérn kernel with lengthscale 1.0. C.3 Additional Ex...

work page

[1] [1]

identity

Activation function (string, “identity”, “logistic”, “tanh”, or “relu”)

work page

[2] [2]

Strength of the L2 regularization term (float,[10−6,10 −2])

work page

[3] [3]

Initial learning rate (float,[10−6,10 −2])

work page

[4] [4]

Maximum number of iterations (integer,[100,300])

work page

[5] [5]

Whether to shuffle samples in each iteration (bool, True or False)

work page

[6] [6]

Exponential decay rate for the first moment vector (float,(0,1))

work page

[7] [7]

Exponential decay rate for the second moment vector (float,(0,1))

work page

[8] [8]

Classification with Gradient Boosting

Maximum number of epochs without tolerance improvement (integer,[1,10]). Classification with Gradient Boosting

work page

[9] [9]

logloss” or “exponential

Loss function (string, “logloss” or “exponential”)

work page

[10] [10]

Learning rate (float,(0,1))

work page

[11] [11]

Number of estimators (integer,[20,200])

work page

[12] [12]

Fraction of samples used for fitting base learners (float,(0,1))

work page

[13] [13]

friedman_mse

Criterion to measure split quality (string, “friedman_mse” or “squared_error”)

work page

[14] [14]

Minimum number of samples required to split an internal node (integer,[2,10])

work page

[15] [15]

Minimum number of samples required to be at a leaf node (integer,[1,10])

work page

[16] [16]

Minimum weighted fraction of the total sum of weights (float,(0,0.5))

work page

[17] [17]

Maximum depth of regression estimators (integer,[1,10])

work page

[18] [18]

sqrt” or “log2

Number of features considered for best split (float, “sqrt” or “log2”)

work page

[19] [19]

C.2 Synthetic Experimental Settings For Gaussian kernel functions, the lengthscale of all kernels is 0.1, with observation noise for both source (σ2

Maximum number of leaf nodes in best-first fashion (integer,[2,10]). C.2 Synthetic Experimental Settings For Gaussian kernel functions, the lengthscale of all kernels is 0.1, with observation noise for both source (σ2

work page

[20] [20]

For Bohachevsky functions, lengthscale of source, target, and difference kernel is 1.6, 0.8, and 1.0 withσ2 0 = 0.24, σ2 = 0.06and τ 2 = 0.32

and target (σ2) being 0.01, andτ 2 = 0.32. For Bohachevsky functions, lengthscale of source, target, and difference kernel is 1.6, 0.8, and 1.0 withσ2 0 = 0.24, σ2 = 0.06and τ 2 = 0.32. In assumption-satisfied setting, σ2 0 = 0.1and σ2 = 0.01. The target GP in the baseline algorithms is modeled using a Matérn kernel with lengthscale 1.0. C.3 Additional Ex...

work page