In-Context Multi-Objective Optimization

Conor Hassan; Daolang Huang; Julien Martinelli; Samuel Kaski; Xinyu Zhang

arxiv: 2512.11114 · v2 · submitted 2025-12-11 · 💻 cs.LG · cs.AI· stat.ML

In-Context Multi-Objective Optimization

Xinyu Zhang , Conor Hassan , Julien Martinelli , Daolang Huang , Samuel Kaski This is my paper

Pith reviewed 2026-05-16 22:48 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML

keywords multi-objective optimizationtransformer policyin-context learningBayesian optimizationhypervolume improvementblack-box optimizationamortized inferencereinforcement learning

0 comments

The pith

A pretrained transformer proposes next designs for any multi-objective black-box problem in one forward pass without retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TAMO, a transformer architecture pretrained across synthetic and real optimization tasks to serve as a universal policy for multi-objective black-box optimization. It is trained with reinforcement learning to maximize cumulative hypervolume improvement while conditioning on the full history of queries and evaluations. This setup lets the model approximate the Pareto frontier directly at test time. The result is fast proposals that match or exceed the quality of traditional methods under tight evaluation budgets, without needing to fit surrogates or tune acquisition functions for each new problem. The approach removes per-task engineering and points toward plug-and-play optimizers for expensive design tasks.

Core claim

TAMO is a fully amortized universal policy for multi-objective black-box optimization. It uses a transformer that accepts varying input and objective dimensions, pretrained via reinforcement learning to maximize cumulative hypervolume improvement over complete trajectories by conditioning on the entire query history. At test time the pretrained model generates the next design candidate with a single forward pass, without any retraining or fine-tuning on the target problem.

What carries the argument

TAMO, a transformer policy pretrained with reinforcement learning to maximize cumulative hypervolume improvement while conditioning on query history for in-context Pareto approximation.

If this is right

Eliminates surrogate model fitting and acquisition engineering for each new multi-objective task.
Cuts proposal generation time by 50-1000x while preserving or improving Pareto quality under limited evaluations.
Supports optimization loops that require parallel or real-time decisions without refitting overhead.
Handles problems with different numbers of decision variables and objectives through its architecture.
Enables transfer to new domains such as drug design or autonomous systems without task-specific retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same in-context training approach could be applied to single-objective or constrained optimization problems.
Expanding the pretraining corpus with more diverse real-world tasks might further strengthen generalization.
The policy could be combined with existing foundation models to create end-to-end design pipelines.
Deployment in time-critical settings like online control would become feasible due to the low per-step cost.

Load-bearing premise

A single pretrained transformer policy trained on synthetic and real tasks will generalize to new unseen multi-objective problems without any retraining or fine-tuning at test time.

What would settle it

Evaluating TAMO on a novel multi-objective benchmark and observing that its hypervolume after a fixed number of evaluations falls substantially below that of a standard method such as ParEGO or NSGA-II.

Figures

Figures reproduced from arXiv: 2512.11114 by Conor Hassan, Daolang Huang, Julien Martinelli, Samuel Kaski, Xinyu Zhang.

**Figure 1.** Figure 1: Comparison of multi-objective optimization workflows. (Top left) Previous methods like traditional MOBO or acquisition-only amortized BOFormer (Hung et al., 2025) are bottlenecked by a slow process of fitting a GP surrogate. (Top right) TAMO is fully amortized: a dimension-agnostic transformer policy is trained once, offline, on diverse synthetic tasks, and at deployment maps the history to the next query … view at source ↗

**Figure 2.** Figure 2: Dimension-agnostic embedder for a single observation. (II) Transformer encoder–decoder. We stack B := B1 + B2 transformer layers and split them into two phases. For the first B1 layers, the observed tokens interact. The history (or context) tokens undergo self-attention to produce Eˆh (or Eˆc ), capturing intra-set structure. The query (or target) tokens then use crossattention with the keys/values provi… view at source ↗

**Figure 3.** Figure 3: Synthetic and real-world multi-objective benchmarks: simple regret (top) and cumulative inference time (bottom) vs. oracle calls (mean ± 95% CIs over 30 runs). TAMO achieves competitive regret while cutting proposal time by 50×–1000×. with success under scalar observations (Volpp et al., 2020; Chen et al., 2022; Yang et al., 2023; Maraval et al., 2023; Song et al., 2024; Huang et al., 2024), and even bina… view at source ↗

**Figure 4.** Figure 4: Out-of-distribution evaluations. (a) Dimensionality: simple regret (top) and cumulative inference time (bottom) on tasks whose input/output dimensions are unseen at pretraining. (b) Decoupled observations: regret vs. cumulative cost when, at step t, the optimizer may observe both objectives at cost 2 (dark blue) or only one at cost 1 (cyan). Curves show means with 95% confidence intervals over 60 runs (GP-… view at source ↗

**Figure 5.** Figure 5: Effect of batch size on synthetic problems: simple regret for TAMO with q ∈ {1, 2, 5, 10}. Curves show means with 95% CIs over 30 runs. Smaller q converges fastest; larger q incurs a mild slowdown, compatible with wall-clock savings for parallel evaluations. improves over BOFormer (which amortizes only the acquisition) but trails conventional MOBO baselines in regret. Across all cases, TAMO retains orders-… view at source ↗

read the original abstract

Balancing competing objectives is omnipresent across disciplines, from drug design to autonomous systems. Multi-objective Bayesian optimization is a promising solution for such expensive, black-box problems: it fits probabilistic surrogates and selects new designs via an acquisition function that balances exploration and exploitation. In practice, it requires tailored choices of surrogate and acquisition that rarely transfer to the next problem, is myopic when multi-step planning is often required, and adds refitting overhead, particularly in parallel or time-sensitive loops. We present TAMO, a fully amortized, universal policy for multi-objective black-box optimization. TAMO uses a transformer architecture that operates across varying input and objective dimensions, enabling pretraining on diverse corpora and transfer to new problems without retraining: at test time, the pretrained model proposes the next design with a single forward pass. We pretrain the policy with reinforcement learning to maximize cumulative hypervolume improvement over full trajectories, conditioning on the entire query history to approximate the Pareto frontier. Across synthetic benchmarks and real tasks, TAMO produces fast proposals, reducing proposal time by 50-1000x versus alternatives while matching or improving Pareto quality under tight evaluation budgets. These results show that transformers can perform multi-objective optimization entirely in-context, eliminating per-task surrogate fitting and acquisition engineering, and open a path to foundation-style, plug-and-play optimizers for scientific discovery workflows.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TAMO introduces a pretrained transformer policy for amortized multi-objective optimization that skips per-task fitting, but its generalization to truly new problems rests on unverified distribution assumptions.

read the letter

TAMO is a transformer policy trained end-to-end with RL to maximize cumulative hypervolume over full trajectories. At test time it proposes the next point in one forward pass, handling variable input and objective dimensions without refitting a surrogate or choosing an acquisition function for each new task. That removes the usual per-problem overhead in multi-objective black-box optimization. The variable-dimension architecture and the non-myopic RL objective are the concrete novelties relative to prior Bayesian optimization work. On the synthetic and real benchmarks shown, the method delivers the reported 50-1000x speed-up in proposal time while keeping Pareto quality competitive under limited evaluation budgets. That practical speed advantage is the clearest win if the numbers hold. The soft spot is generalization. The claim that a single pretrained policy transfers to unseen problems without fine-tuning depends on the test tasks lying outside the pretraining distribution in ways that matter for Pareto approximation. The experiments need to demonstrate that the reported benchmarks actually enforce such a shift rather than staying inside the training support; without that detail the speed gains could be explained by interpolation. Minor additional gaps are the lack of reported ablations on training corpus diversity and statistical significance tests on the quality comparisons. This paper is for people working on amortized or foundation-style methods for expensive optimization, especially in drug design and engineering loops. A reader who already follows transformer applications to sequential decision problems will see the most value. It deserves a serious referee because the core idea is technically coherent and the empirical direction is worth checking in detail.

Referee Report

2 major / 2 minor

Summary. The paper introduces TAMO, a transformer-based policy pretrained via reinforcement learning on diverse synthetic and real tasks to perform in-context multi-objective black-box optimization. At test time the model proposes the next design via a single forward pass without retraining or surrogate fitting, conditioning on query history to maximize cumulative hypervolume improvement and thereby approximate the Pareto frontier. Empirical claims include 50-1000x reductions in proposal time versus standard methods while matching or improving Pareto quality under tight evaluation budgets.

Significance. If the generalization results hold, the work would demonstrate that a single pretrained transformer can serve as a plug-and-play optimizer for multi-objective problems, removing per-task surrogate fitting, acquisition engineering, and refitting overhead. This amortized approach could accelerate scientific workflows in domains such as drug design and autonomous systems where rapid Pareto approximation under limited budgets is required.

major comments (2)

[§4 (Experiments)] The central generalization claim (abstract and §4) rests on performance on 'new' test tasks, yet no quantitative metrics or analysis of distribution shift (e.g., objective correlation structure, input dimensionality ranges, or noise characteristics) between the pretraining corpus and the reported benchmarks are provided. Without this, it is impossible to distinguish true out-of-distribution in-context optimization from interpolation within the training support.
[§3.2] §3.2 (RL training objective): the cumulative hypervolume improvement reward is defined over full trajectories, but the manuscript provides no ablation or sensitivity analysis on how the reward scales with varying numbers of objectives or input dimensions, nor on whether the transformer’s positional encodings and attention masks correctly handle these variable sizes during both pretraining and test-time inference.

minor comments (2)

[Figures/Tables] Figure 2 and Table 1: axis labels and legend entries for hypervolume trajectories are too small for readability; consider increasing font size and adding error bands with explicit statistical significance tests.
[§3.1] The description of the transformer architecture in §3.1 uses non-standard notation for the conditioning on history length; a short appendix table mapping symbols to tensor shapes would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major point below and will revise the manuscript to strengthen the presentation of generalization and training details.

read point-by-point responses

Referee: [§4 (Experiments)] The central generalization claim (abstract and §4) rests on performance on 'new' test tasks, yet no quantitative metrics or analysis of distribution shift (e.g., objective correlation structure, input dimensionality ranges, or noise characteristics) between the pretraining corpus and the reported benchmarks are provided. Without this, it is impossible to distinguish true out-of-distribution in-context optimization from interpolation within the training support.

Authors: We agree that explicit quantitative metrics on distribution shift would better support the generalization claims. In the revision we will add a new subsection (or appendix) reporting statistics on objective correlation matrices, input dimension ranges, and noise characteristics for both the pretraining corpus and each test benchmark. We will also include a brief discussion of how the synthetic task generator was designed to produce diverse correlation structures and dimensionality ranges, thereby providing evidence that the reported benchmarks lie outside the bulk of the training support. revision: yes
Referee: [§3.2] §3.2 (RL training objective): the cumulative hypervolume improvement reward is defined over full trajectories, but the manuscript provides no ablation or sensitivity analysis on how the reward scales with varying numbers of objectives or input dimensions, nor on whether the transformer’s positional encodings and attention masks correctly handle these variable sizes during both pretraining and test-time inference.

Authors: We acknowledge the value of such ablations. In the revised version we will add sensitivity plots showing cumulative hypervolume improvement for 2–5 objectives and for input dimensions ranging from 5 to 50. We will also clarify in §3.2 that variable-length sequences are handled via padding to a fixed maximum length together with causal attention masks that ignore padded tokens; positional encodings are applied only to the actual query tokens. Empirical verification that these mechanisms function correctly across the tested ranges will be included in the new ablation section. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results rest on external benchmarks

full rationale

The paper presents TAMO as a pretrained transformer policy trained via RL to maximize cumulative hypervolume improvement over full trajectories, with claims supported by empirical speedups (50-1000x) and Pareto quality on synthetic and real tasks. No derivation chain reduces a prediction to its inputs by construction, no self-citation load-bears a uniqueness theorem, and no fitted parameter is renamed as a prediction. The method is validated against external benchmarks rather than internal self-consistency loops, making the central claims self-contained and falsifiable outside the training distribution.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

Review based solely on abstract; full details of architecture, training corpus, and RL formulation unavailable. The approach assumes standard transformer and RL components plus the novel claim that in-context conditioning suffices for generalization.

free parameters (2)

transformer architecture hyperparameters
Number of layers, attention heads, and embedding dimensions chosen during pretraining; not specified in abstract.
RL training hyperparameters
Reward scaling, discount factor, and trajectory length used to maximize cumulative hypervolume improvement; not reported.

axioms (2)

domain assumption The hypervolume improvement metric correctly captures progress toward the Pareto frontier across varying objective dimensions.
Invoked implicitly when the policy is trained to maximize cumulative hypervolume.
domain assumption A single forward pass on query history is sufficient to approximate optimal multi-step planning.
Core premise of the amortized policy replacing myopic acquisition functions.

invented entities (1)

TAMO policy no independent evidence
purpose: Universal amortized optimizer that conditions on entire query history to propose next designs.
New transformer-based policy introduced in the paper.

pith-pipeline@v0.9.0 · 5547 in / 1457 out tokens · 36418 ms · 2026-05-16T22:48:07.422704+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

7 extracted references · 7 canonical work pages

[1]

Fantasizing with dual gps in bayesian optimization and active learning.arXiv preprint arXiv:2211.01053,

Paul E Chang, Prakhar Verma, ST John, Victor Picheny, Henry Moss, and Arno Solin. Fantasizing with dual gps in bayesian optimization and active learning.arXiv preprint arXiv:2211.01053,

work page arXiv
[2]

doi: https://doi.org/10.1016/ j.patter.2023.100678

ISSN 2666-3899. doi: https://doi.org/10.1016/ j.patter.2023.100678. URL https://www.sciencedirect.com/science/article/ pii/S2666389923000016. Marta Garnelo, Dan Rosenbaum, Christopher Maddison, Tiago Ramalho, David Saxton, Murray Shanahan, Yee Whye Teh, Danilo Rezende, and S. M. Ali Eslami. Conditional Neural Processes. InInternational Conference on Machi...

work page arXiv 2023
[3]

Aline: Joint amortization for bayesian inference and active data acquisition.arXiv preprint arXiv:2506.07259,

Daolang Huang, Xinyi Wen, Ayush Bharti, Samuel Kaski, and Luigi Acerbi. Aline: Joint amortization for bayesian inference and active data acquisition.arXiv preprint arXiv:2506.07259,

work page arXiv
[4]

Reinforced in-context black-box optimization.arXiv preprint arXiv:2402.17423, 2024a

Lei Song, Chenxiao Gao, Ke Xue, Chenyang Wu, Dong Li, Jianye Hao, Zongzhang Zhang, and Chao Qian. Reinforced in-context black-box optimization.arXiv preprint arXiv:2402.17423,

work page arXiv
[5]

Mongoose: Path-wise smooth bayesian optimisation via meta-learning.arXiv preprint arXiv:2302.11533,

Adam X Yang, Laurence Aitchison, and Henry B Moss. Mongoose: Path-wise smooth bayesian optimisation via meta-learning.arXiv preprint arXiv:2302.11533,

work page arXiv
[6]

22:end if 23:end for Algorithm S2TAMOTest-Time Algorithm Require: Pre-trainedTAMOmodel, new task τtest, query budget T , initial history set Dh 0 :={x h, yh} (with random samples if empty), 1:D h ← D h 0 ▷Initialize the history set 2:P ← {y h}▷Initialize the Pareto set 3:fort= 1, . . . , Tdo 4:x t ∼π θ(· | D h, t, T)▷Sample the next query location 5:y t ←...

work page 2020
[7]

Mean with 95% confidence intervals computed across 30 runs with random initial observations

17 E Additional Experiments 0 25 50 75 100 0.0 0.2 0.4 Simple Regret GP-DX2-DY1 0 10 20 30 Forrester 0 10 20 30 Branin 0 25 50 75 100 EggHolder 0 25 50 75 100 Oracle Calls 10 1 100 101 102 Cumulative Inference Time (s) 0 10 20 30 Oracle Calls 0 10 20 30 Oracle Calls 0 25 50 75 100 Oracle Calls TAMO Random qEI Figure S2: Simple regret and inference time on...

work page 2048

[1] [1]

Fantasizing with dual gps in bayesian optimization and active learning.arXiv preprint arXiv:2211.01053,

Paul E Chang, Prakhar Verma, ST John, Victor Picheny, Henry Moss, and Arno Solin. Fantasizing with dual gps in bayesian optimization and active learning.arXiv preprint arXiv:2211.01053,

work page arXiv

[2] [2]

doi: https://doi.org/10.1016/ j.patter.2023.100678

ISSN 2666-3899. doi: https://doi.org/10.1016/ j.patter.2023.100678. URL https://www.sciencedirect.com/science/article/ pii/S2666389923000016. Marta Garnelo, Dan Rosenbaum, Christopher Maddison, Tiago Ramalho, David Saxton, Murray Shanahan, Yee Whye Teh, Danilo Rezende, and S. M. Ali Eslami. Conditional Neural Processes. InInternational Conference on Machi...

work page arXiv 2023

[3] [3]

Aline: Joint amortization for bayesian inference and active data acquisition.arXiv preprint arXiv:2506.07259,

Daolang Huang, Xinyi Wen, Ayush Bharti, Samuel Kaski, and Luigi Acerbi. Aline: Joint amortization for bayesian inference and active data acquisition.arXiv preprint arXiv:2506.07259,

work page arXiv

[4] [4]

Reinforced in-context black-box optimization.arXiv preprint arXiv:2402.17423, 2024a

Lei Song, Chenxiao Gao, Ke Xue, Chenyang Wu, Dong Li, Jianye Hao, Zongzhang Zhang, and Chao Qian. Reinforced in-context black-box optimization.arXiv preprint arXiv:2402.17423,

work page arXiv

[5] [5]

Mongoose: Path-wise smooth bayesian optimisation via meta-learning.arXiv preprint arXiv:2302.11533,

Adam X Yang, Laurence Aitchison, and Henry B Moss. Mongoose: Path-wise smooth bayesian optimisation via meta-learning.arXiv preprint arXiv:2302.11533,

work page arXiv

[6] [6]

22:end if 23:end for Algorithm S2TAMOTest-Time Algorithm Require: Pre-trainedTAMOmodel, new task τtest, query budget T , initial history set Dh 0 :={x h, yh} (with random samples if empty), 1:D h ← D h 0 ▷Initialize the history set 2:P ← {y h}▷Initialize the Pareto set 3:fort= 1, . . . , Tdo 4:x t ∼π θ(· | D h, t, T)▷Sample the next query location 5:y t ←...

work page 2020

[7] [7]

Mean with 95% confidence intervals computed across 30 runs with random initial observations

17 E Additional Experiments 0 25 50 75 100 0.0 0.2 0.4 Simple Regret GP-DX2-DY1 0 10 20 30 Forrester 0 10 20 30 Branin 0 25 50 75 100 EggHolder 0 25 50 75 100 Oracle Calls 10 1 100 101 102 Cumulative Inference Time (s) 0 10 20 30 Oracle Calls 0 10 20 30 Oracle Calls 0 25 50 75 100 Oracle Calls TAMO Random qEI Figure S2: Simple regret and inference time on...

work page 2048