SURF: Steering the Scalarization Weight to Uniformly Traverse the Pareto Front

Chentong Huang; Lisha Chen; Liuyuan Jiang

arxiv: 2605.20619 · v1 · pith:TZDM3KSQnew · submitted 2026-05-20 · 💻 cs.LG · math.OC· stat.ML

SURF: Steering the Scalarization Weight to Uniformly Traverse the Pareto Front

Liuyuan Jiang , Chentong Huang , Lisha Chen This is my paper

Pith reviewed 2026-05-21 06:38 UTC · model grok-4.3

classification 💻 cs.LG math.OCstat.ML

keywords multi-objective optimizationPareto frontscalarizationuniform samplingarc-lengthbanditsreinforcement learning

0 comments

The pith

Inverting the arc-length cumulative distribution function of the scalarization path produces weights that uniformly cover the Pareto front.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper explains that simply sampling scalarization weights uniformly often leaves gaps in the Pareto front because the solutions move along the front at varying speeds. By measuring the arc length along this path and building its cumulative distribution function, the authors derive a way to choose weights that space solutions evenly. They introduce SURF, which for some problems like bi-objective bandits has closed-form solutions and for others rebuilds the distribution iteratively while sampling. Theory shows linear convergence to a limit set by finite samples, and tests on bandits, environments, and language model alignment confirm better uniformity than standard approaches.

Core claim

The central claim is that the scalarization path traces the Pareto front with non-uniform speed, inducing an arc-length CDF; inverting this CDF map yields a principled rule for selecting weights that produce uniform PF coverage. SURF implements this rule, deriving closed forms where possible and alternating reconstruction and sampling otherwise.

What carries the argument

The arc-length cumulative distribution function (CDF) of the scalarization path, which tracks how far along the Pareto front the solutions have traveled as the weight changes; inverting it corrects for the uneven traversal speed to select weights evenly.

If this is right

Closed-form expressions exist for the CDF map in structured problems like bi-objective bandits.
SURF converges linearly to an unavoidable finite-sampling floor under provable conditions.
Empirical results show more uniform coverage on bandits, multi-objective gymnasium environments, and multi-objective LLM alignment tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method might extend to other multi-objective algorithms that use scalarization but currently ignore path geometry.
In very high-dimensional objective spaces, repeatedly reconstructing the CDF could add significant computational overhead.
This geometric view suggests designing new scalarization functions that inherently traverse the front more uniformly.

Load-bearing premise

The scalarization path traces the Pareto front continuously enough to define a meaningful arc-length and to reconstruct its cumulative distribution function from samples.

What would settle it

Running the method on a simple bi-objective problem with a known curved Pareto front and checking whether the achieved solution spacing is significantly more uniform than uniform weight sampling would confirm or refute the benefit of the inversion rule.

Figures

Figures reproduced from arXiv: 2605.20619 by Chentong Huang, Lisha Chen, Liuyuan Jiang.

**Figure 1.** Figure 1: LS results for multiarm bandit (see Appendix F.1 for the detailed settings). Uniform weights produce uneven points on PF (stars), whereas PF-aware sampling yields a uniform placement (colored dots). Formally, let u denote the decision variable in the feasible set U ⊆ R d , MOO seeks to minimize M objectives hm : R d → R: min u∈U n h(u) := [h1(u), · · · , hM(u)] ⊤ o . (1) We consider optimizing the above … view at source ↗

**Figure 2.** Figure 2: SURF overview: bridging w and PF points via CDF Φ yields two rules. Rule 1 directly samples {wn} N n=0 when Φ is known (Section 2). Rule 2 handles the general case by iteratively approximating Φ (Algorithm 1). We answer this by proposing Sampling Uniformly along the Pareto Front (SURF; [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of approaches for obtaining diverse PF solutions, including NBI method [ [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Original parameter space U and hidden space X = g(U). Bijection g maps non-convex structures in U to a convex space X . Modern machine-learning objectives are highly non-convex in their original parameterization. Nevertheless, many problems admit an equivalent hidden representation where the objective becomes convex or strongly convex [72, 31]. Based on this viewpoint, we impose the following assumption. … view at source ↗

**Figure 5.** Figure 5: Error vs. number of pulls T in bandit. Error bars show standard deviations across 100 trials. This closed form makes the PF geometry explicit and provides a tractable testbed for the geometry behind SURF, as visualized in [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: PFs obtained by SURF and baselines on DST. DST. DST [89, 101] is a classical bi-objective RL benchmark, where the agent trades off treasure value and travel time. We evaluate SURF Algorithm 1 with 30 outer iterations and α = 0.3 [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Theoretical analysis roadmap for SURF. The solid-line blocks summarize the main results [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: DST environment. Problem Setup. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p050_8.png] view at source ↗

**Figure 9.** Figure 9: Trajectory distributions induced by policies obtained via our SURF Algorithm 1, [PITH_FULL_IMAGE:figures/full_fig_p052_9.png] view at source ↗

**Figure 10.** Figure 10: Convergence behavior of the SURF Algorithm 1 under different parameter choices. Left: [PITH_FULL_IMAGE:figures/full_fig_p053_10.png] view at source ↗

**Figure 12.** Figure 12: ∥Φt − Φ∥∞ vs. N ∈ {4, 6, . . . , 20} on DST. spectrum. Overall, the SURF Algorithm 1 method provides a more gradual progression of policies along the Pareto trade-off [PITH_FULL_IMAGE:figures/full_fig_p053_12.png] view at source ↗

read the original abstract

Scalarization is widely used in multi-objective optimization owing to its simplicity and scalability. In many applications, the goal is to generate solutions that represent diverse user preferences, ideally with uniform coverage of the Pareto front (PF). However, uniformly sampling scalarization weights usually induces non-uniform coverage of the PF. We explain this mismatch through a geometric analysis of the scalarization path. As the scalarization weight varies, the corresponding solutions trace the PF with a generally non-uniform traversal speed. This speed induces an arc-length cumulative distribution function (CDF); inverting this CDF map yields a principled rule for selecting weights that produce uniform PF coverage. Building on this insight, we propose SURF (Sampling Uniformly along the PaReto Front). For structured problems, including bi-objective bandits, we derive closed-form expressions for this CDF map and the resulting PF-aware weight sampling rule. For general problems, SURF alternates between CDF reconstruction and weight sampling. Theoretically, we show that under provable conditions, SURF converges linearly to an unavoidable finite-sampling floor. Empirically, experiments on bandits, multi-objective-gymnasium, and multi-objective LLM alignment demonstrate that SURF efficiently achieves more uniform PF coverage than baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SURF gives a geometric rule for sampling scalarization weights via arc-length CDF inversion to improve uniformity on the Pareto front, but it only uniformizes the reachable convex subset.

read the letter

The main thing here is that SURF derives a weight sampling distribution by inverting the arc-length CDF of the scalarization path so that solutions cover the Pareto front more evenly than uniform weights do. For bi-objective bandits they supply closed forms, and for general cases they alternate between reconstructing the CDF and sampling weights. They also prove linear convergence to a finite sampling floor under their conditions, and the experiments on bandits, multi-objective gym tasks, and LLM alignment show clearer uniformity gains than the baselines they compare against. That part is straightforward and useful where it applies. The soft spot is reachability. Weighted-sum scalarization only hits supported points on the convex hull, so any non-convex segments of the front are never visited no matter how the weights are chosen. The arc-length CDF can only be inverted over the part the path actually traces. The paper's provable conditions do not appear to secure coverage of the full front, and it is unclear whether the tested problems simply have convex fronts or whether the authors intend the method only for convex cases. If the latter, that scope should be stated earlier and more explicitly. The alternating reconstruction step for general problems also looks workable but adds some overhead that the experiments do not fully quantify. This paper is for researchers who already use scalarization in multi-objective ML settings and want a more principled default for weight selection. Readers working on bandits or preference alignment will see the most direct value. It has a focused technical idea, concrete derivations in special cases, and supporting experiments, so it deserves a serious referee rather than a desk reject. The review can clarify the convexity assumption and ask for tighter statements on when the uniformity guarantee holds.

Referee Report

2 major / 2 minor

Summary. The paper claims that uniformly sampling scalarization weights in multi-objective optimization produces non-uniform Pareto front (PF) coverage because the scalarization path traverses the PF at varying speeds; inverting the arc-length CDF of this path yields a weight-sampling rule (SURF) that achieves uniform coverage. For structured cases such as bi-objective bandits the authors derive closed-form CDF maps and sampling rules; for general problems SURF alternates between CDF reconstruction and weight sampling. They prove linear convergence to a finite-sampling floor under stated conditions and report empirical improvements over baselines on bandits, multi-objective gymnasium environments, and LLM alignment tasks.

Significance. If the geometric analysis and reachability assumptions hold, the work supplies a principled, geometry-driven method for uniform PF sampling that improves on ad-hoc weight selection. The closed-form derivations for bandits and the linear-convergence guarantee are concrete strengths; the empirical results across three distinct domains further support practical utility for applications that require diverse, preference-representing solutions.

major comments (2)

[Abstract / geometric analysis] Abstract and geometric-analysis section: the central claim that CDF inversion produces uniform coverage of the full PF rests on the unstated premise that the scalarization path reaches every point on the PF. Standard weighted-sum scalarization recovers only supported points on the convex hull; non-convex segments are unreachable. This assumption is load-bearing for the uniform-coverage guarantee and must be explicitly stated, justified, or restricted to convex fronts.
[Theoretical analysis] Theoretical-convergence paragraph: the linear convergence result is stated to hold 'under provable conditions,' yet the precise conditions (e.g., Lipschitz continuity of the PF, strict convexity, or bounded curvature) are not enumerated. Without an explicit list or a theorem statement that isolates these conditions, it is impossible to verify applicability to the general alternating-reconstruction case.

minor comments (2)

Notation for the arc-length parameter and the CDF inversion operator should be introduced once with a clear diagram or equation reference rather than re-defined inline in multiple places.
The description of the alternating reconstruction procedure for general problems would benefit from a short pseudocode block or numbered steps to clarify the loop termination criterion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major comment below, providing clarifications on the scope of our claims and indicating the revisions we will make to improve the manuscript.

read point-by-point responses

Referee: [Abstract / geometric analysis] Abstract and geometric-analysis section: the central claim that CDF inversion produces uniform coverage of the full PF rests on the unstated premise that the scalarization path reaches every point on the PF. Standard weighted-sum scalarization recovers only supported points on the convex hull; non-convex segments are unreachable. This assumption is load-bearing for the uniform-coverage guarantee and must be explicitly stated, justified, or restricted to convex fronts.

Authors: We agree that weighted-sum scalarization recovers only supported solutions on the convex hull and cannot reach non-convex segments of the Pareto front. Our geometric analysis of the scalarization path and the SURF sampling rule are developed specifically for the reachable (supported) portion of the front. We will revise the abstract and geometric-analysis section to state this premise explicitly, restrict the uniform-coverage guarantee to supported points, and note that alternative scalarizations would be required for unsupported regions. This change clarifies the scope without affecting the core technical contributions. revision: yes
Referee: [Theoretical analysis] Theoretical-convergence paragraph: the linear convergence result is stated to hold 'under provable conditions,' yet the precise conditions (e.g., Lipschitz continuity of the PF, strict convexity, or bounded curvature) are not enumerated. Without an explicit list or a theorem statement that isolates these conditions, it is impossible to verify applicability to the general alternating-reconstruction case.

Authors: We appreciate the request for greater precision. The current manuscript refers to 'provable conditions' without enumerating them in the main text. In the revision we will insert an explicit theorem statement that lists the required assumptions, including Lipschitz continuity of the Pareto-front mapping, bounded curvature ensuring a well-defined arc-length, and the technical conditions on the CDF reconstruction step for the general alternating procedure. This will make the applicability of the linear-convergence result verifiable. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation rests on independent geometric analysis

full rationale

The paper derives the SURF weight-sampling rule by first performing a geometric analysis of the scalarization path's traversal speed along the PF, which induces an arc-length CDF; inverting that CDF then supplies the weight rule. This chain is presented as following from the path geometry rather than from any fitted parameter, self-definition, or prior self-citation that encodes the target result. Closed-form CDF expressions for bi-objective bandits and the alternating reconstruction procedure for general cases are derived directly from the same geometric model. Theoretical linear convergence is shown under explicitly stated provable conditions, and empirical comparisons are external to the derivation. No step reduces by construction to its own inputs, and the central claim retains independent content from the geometric premise.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the geometric characterization of the scalarization path and the invertibility of its arc-length CDF; no explicit free parameters or new entities are introduced in the abstract.

axioms (1)

domain assumption The scalarization path traces the Pareto front with a traversal speed that admits a well-defined arc-length cumulative distribution function.
This premise underpins the CDF inversion step described in the abstract.

pith-pipeline@v0.9.0 · 5752 in / 1243 out tokens · 40177 ms · 2026-05-21T06:38:48.077103+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

As the scalarization weight varies, the corresponding solutions trace the PF with a generally non-uniform traversal speed. This speed induces an arc-length cumulative distribution function (CDF); inverting this CDF map yields a principled rule for selecting weights that produce uniform PF coverage.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Under Assumption 1, ... every Pareto-optimal point can be obtained by (3) with some w ∈ Δ^M.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

117 extracted references · 117 canonical work pages · 3 internal anchors

[1]

Optimistic linear support and successor features as a basis for optimal policy transfer

Lucas Nunes Alegre, Ana Bazzan, and Bruno C Da Silva. Optimistic linear support and successor features as a basis for optimal policy transfer. InInternational conference on machine learning, pages 394–413. PMLR, 2022

work page 2022
[2]

Routledge, 2021

Eitan Altman.Constrained Markov decision processes. Routledge, 2021

work page 2021
[3]

System level synthesis.Annual Reviews in Control, 47:364–393, 2019

James Anderson, John C Doyle, Steven H Low, and Nikolai Matni. System level synthesis.Annual Reviews in Control, 47:364–393, 2019

work page 2019
[4]

Finite-time analysis of the multiarmed bandit problem

Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2):235–256, 2002

work page 2002
[5]

Learning all optimal policies with multiple criteria

Leon Barrett and Srini Narayanan. Learning all optimal policies with multiple criteria. InProceedings of the 25th international conference on Machine learning, pages 41–47, 2008

work page 2008
[6]

The pareto tracer for general inequality constrained multi-objective optimization problems.Mathematical and Computational Applications, 25(4):80, 2020

Fernanda Beltrán, Oliver Cuate, and Oliver Schütze. The pareto tracer for general inequality constrained multi-objective optimization problems.Mathematical and Computational Applications, 25(4):80, 2020

work page 2020
[7]

Occupation measures for controlled markov processes: Characteriza- tion and optimality.The Annals of Probability, pages 1531–1562, 1996

Abhay G Bhatt and Vivek S Borkar. Occupation measures for controlled markov processes: Characteriza- tion and optimality.The Annals of Probability, pages 1531–1562, 1996

work page 1996
[8]

John Wiley & Sons, 2017

Patrick Billingsley.Probability and measure. John Wiley & Sons, 2017

work page 2017
[9]

The averaged hausdorff distances in multi-objective optimization: A review.Mathematics, 7(10):894, 2019

José M Bogoya, Andrés Vargas, and Oliver Schütze. The averaged hausdorff distances in multi-objective optimization: A review.Mathematics, 7(10):894, 2019

work page 2019
[10]

Tracing locally pareto- optimal points by numerical integration.SIAM Journal on Control and Optimization, 59(5):3302–3328, 2021

Matthias Bolten, Onur Tanil Doganay, Hanno Gottschalk, and Kathrin Klamroth. Tracing locally pareto- optimal points by numerical integration.SIAM Journal on Control and Optimization, 59(5):3302–3328, 2021

work page 2021
[11]

A tutorial on geometric programming.Optimization and engineering, 8(1):67–127, 2007

Stephen Boyd, Seung-Jean Kim, Lieven Vandenberghe, and Arash Hassibi. A tutorial on geometric programming.Optimization and engineering, 8(1):67–127, 2007

work page 2007
[12]

Pure exploration in multi-armed bandits problems

Sébastien Bubeck, Rémi Munos, and Gilles Stoltz. Pure exploration in multi-armed bandits problems. In International conference on Algorithmic learning theory, pages 23–37. Springer, 2009

work page 2009
[13]

Moving mesh generation using the parabolic monge–ampère equation

Chris J Budd and JF Williams. Moving mesh generation using the parabolic monge–ampère equation. SIAM Journal on Scientific Computing, 31(5):3438–3465, 2009

work page 2009
[14]

Courier Dover Publications, 2008

Vira Chankong and Yacov Y Haimes.Multiobjective decision making: theory and methodology. Courier Dover Publications, 2008

work page 2008
[15]

Three-way trade-off in multi-objective learning: Optimization, generalization and conflict-avoidance.J

Lisha Chen, Heshan Fernando, Yiming Ying, and Tianyi Chen. Three-way trade-off in multi-objective learning: Optimization, generalization and conflict-avoidance.J. Machine Learn. Res., 25(193):1–53, 5 2024

work page 2024
[16]

Ferero: A flexible framework for preference- guided multi-objective learning.Advances in Neural Information Processing Systems, 37:18758–18805, 2024

Lisha Chen, AFM Saif, Yanning Shen, and Tianyi Chen. Ferero: A flexible framework for preference- guided multi-objective learning.Advances in Neural Information Processing Systems, 37:18758–18805, 2024. 10

work page 2024
[17]

Efficient first-order optimization on the Pareto set for multi-objective learning under preference guidance

Lisha Chen, Quan Xiao, Ellen Hidemi Fukuda, Xinyi Chen, Kun Yuan, and Tianyi Chen. Efficient first-order optimization on the Pareto set for multi-objective learning under preference guidance. In International Conference on Machine Learning (ICML), as spotlight, 2025

work page 2025
[18]

Pareto merging: Multi-objective optimization for preference-aware model merging.arXiv preprint arXiv:2408.12105, 2024

Weiyu Chen and James Kwok. Pareto merging: Multi-objective optimization for preference-aware model merging.arXiv preprint arXiv:2408.12105, 2024

work page arXiv 2024
[19]

Gradient-based multi-objective deep learning: Algorithms, theories, applications, and beyond.arXiv preprint arXiv:2501.10945, 2025

Weiyu Chen, Xiaoyuan Zhang, Baijiong Lin, Xi Lin, Han Zhao, Qingfu Zhang, and James T Kwok. Gradient-based multi-objective deep learning: Algorithms, theories, applications, and beyond.arXiv preprint arXiv:2501.10945, 2025

work page arXiv 2025
[20]

Efficient algorithms for a class of stochastic hidden convex optimization and its applications in network revenue management.Operations Research, 73(2):704–719, 2025

Xin Chen, Niao He, Yifan Hu, and Zikun Ye. Efficient algorithms for a class of stochastic hidden convex optimization and its applications in network revenue management.Operations Research, 73(2):704–719, 2025

work page 2025
[21]

Network revenue management with online inverse batch gradient descent method.Production and Operations Management, 32(7):2123–2137, 2023

Yiwei Chen and Cong Shi. Network revenue management with online inverse batch gradient descent method.Production and Operations Management, 32(7):2123–2137, 2023

work page 2023
[22]

Multi-objective neural bandits with random scalarization

Ji Cheng, Bo Xue, Chengyu Lu, Ziqiang Cui, and Qingfu Zhang. Multi-objective neural bandits with random scalarization. InProceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, pages 4914–4922, 2025

work page 2025
[23]

Safe RLHF: Safe Reinforcement Learning from Human Feedback

Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, and Yaodong Yang. Safe rlhf: Safe reinforcement learning from human feedback.arXiv preprint arXiv:2310.12773, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Normal-boundary intersection: A new method for generating the pareto surface in nonlinear multicriteria optimization problems.SIAM journal on optimization, 8(3):631–657, 1998

Indraneel Das and John E Dennis. Normal-boundary intersection: A new method for generating the pareto surface in nonlinear multicriteria optimization problems.SIAM journal on optimization, 8(3):631–657, 1998

work page 1998
[25]

Designing multi-objective multi-armed bandits algorithms: A study

Madalina M Drugan and Ann Nowe. Designing multi-objective multi-armed bandits algorithms: A study. InThe 2013 international joint conference on neural networks (IJCNN), pages 1–8. IEEE, 2013

work page 2013
[26]

Scalarization based pareto optimal set of arms identification algorithms

Madalina M Drugan and Ann Nowé. Scalarization based pareto optimal set of arms identification algorithms. In2014 International Joint Conference on Neural Networks (IJCNN), pages 2690–2697. IEEE, 2014

work page 2014
[27]

Efficiency of minimizing compositions of convex functions and smooth maps.Mathematical Programming, 178(1):503–558, 2019

Dmitriy Drusvyatskiy and Courtney Paquette. Efficiency of minimizing compositions of convex functions and smooth maps.Mathematical Programming, 178(1):503–558, 2019

work page 2019
[28]

Springer, 2005

Matthias Ehrgott.Multicriteria optimization. Springer, 2005

work page 2005
[29]

Trade-off analysis approach for interactive nonlinear multiobjective optimization.OR spectrum, 34(4):803–816, 2012

Petri Eskelinen and Kaisa Miettinen. Trade-off analysis approach for interactive nonlinear multiobjective optimization.OR spectrum, 34(4):803–816, 2012

work page 2012
[30]

Cambridge university press Cambridge, UK, 2002

Brian S Everitt and Anders Skrondal.The Cambridge dictionary of statistics, volume 140. Cambridge university press Cambridge, UK, 2002

work page 2002
[31]

Stochastic optimization under hidden convexity.SIAM Journal on Optimization, 35(4):2544–2571, 2025

Ilyas Fatkhullin, Niao He, and Yifan Hu. Stochastic optimization under hidden convexity.SIAM Journal on Optimization, 35(4):2544–2571, 2025

work page 2025
[32]

Alegre, Ann Nowé, Ana L

Florian Felten, Lucas N. Alegre, Ann Nowé, Ana L. C. Bazzan, El Ghazali Talbi, Grégoire Danoy, and Bruno C. da Silva. A toolkit for reliable benchmarking and research in multi-objective reinforcement learning. InProceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS 2023), 2023

work page 2023
[33]

Multi-objective reinforcement learning based on decomposition: A taxonomy and framework.Journal of Artificial Intelligence Research, 79:679–723, 2024

Florian Felten, El-Ghazali Talbi, and Grégoire Danoy. Multi-objective reinforcement learning based on decomposition: A taxonomy and framework.Journal of Artificial Intelligence Research, 79:679–723, 2024

work page 2024
[34]

Mitigating gradient bias in multi-objective learning: A provably convergent stochastic approach

Heshan Fernando, Han Shen, Miao Liu, Subhajit Chaudhury, Keerthiram Murugesan, and Tianyi Chen. Mitigating gradient bias in multi-objective learning: A provably convergent stochastic approach. InProc. Int. Conf. Learn. Representations, Kigali, Rwanda, May 2023

work page 2023
[35]

Sensitivity analysis for nonlinear programming using penalty methods.Mathematical programming, 10(1):287–311, 1976

Anthony V Fiacco. Sensitivity analysis for nonlinear programming using penalty methods.Mathematical programming, 10(1):287–311, 1976

work page 1976
[36]

Monotone piecewise cubic interpolation.SIAM Journal on Numerical Analysis, 17(2):238–246, 1980

Frederick N Fritsch and Ralph E Carlson. Monotone piecewise cubic interpolation.SIAM Journal on Numerical Analysis, 17(2):238–246, 1980. 11

work page 1980
[37]

Sequential learning of the pareto front for multi-objective bandits

Aurélien Garivier and Wouter M Koolen. Sequential learning of the pareto front for multi-objective bandits. InInternational Conference on Artificial Intelligence and Statistics, pages 3583–3591. PMLR, 2024

work page 2024
[38]

A theory of regularized markov decision processes

Matthieu Geist, Bruno Scherrer, and Olivier Pietquin. A theory of regularized markov decision processes. InInternational conference on machine learning, pages 2160–2169. PMLR, 2019

work page 2019
[39]

User’s guide for snopt version 7: software for large-scale nonlinear programming

Philip E Gill, Walter Murray, and Michael A Saunders. User’s guide for snopt version 7: software for large-scale nonlinear programming. Technical report, Systems Optimization Laboratory, Stanford University, 2006

work page 2006
[40]

Pearson education india, 2009

Rafael C Gonzalez.Digital image processing. Pearson education india, 2009

work page 2009
[41]

Controllable preference optimization: Toward controllable multi-objective alignment

Yiju Guo, Ganqu Cui, Lifan Yuan, Ning Ding, Zexu Sun, Bowen Sun, Huimin Chen, Ruobing Xie, Jie Zhou, Yankai Lin, et al. Controllable preference optimization: Toward controllable multi-objective alignment. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 1437–1454, 2024

work page 2024
[42]

Topology of pareto sets of strongly convex problems.SIAM Journal on Optimization, 30(3):2659–2686, 2020

Naoki Hamada, Kenta Hayano, Shunsuke Ichiki, Yutaro Kabata, and Hiroshi Teramoto. Topology of pareto sets of strongly convex problems.SIAM Journal on Optimization, 30(3):2659–2686, 2020

work page 2020
[43]

Cambridge university press, 1952

Godfrey Harold Hardy, John Edensor Littlewood, and George Pólya.Inequalities. Cambridge university press, 1952

work page 1952
[44]

Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

work page 2022
[45]

Springer Science & Business Media, 2010

Weizhang Huang and Robert D Russell.Adaptive moving mesh methods, volume 174. Springer Science & Business Media, 2010

work page 2010
[46]

SIAM, 2008

Kazufumi Ito and Karl Kunisch.Lagrange multiplier approach to variational problems and applications. SIAM, 2008

work page 2008
[47]

Scalarization in multi objective optimization

Johannes Jahn. Scalarization in multi objective optimization. InMathematics of multi objective optimiza- tion, pages 45–88. Springer, 1985

work page 1985
[48]

On the complexity of best-arm identification in multi-armed bandit models.The Journal of Machine Learning Research, 17(1):1–42, 2016

Emilie Kaufmann, Olivier Cappé, and Aurélien Garivier. On the complexity of best-arm identification in multi-armed bandit models.The Journal of Machine Learning Research, 17(1):1–42, 2016

work page 2016
[49]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[50]

Gradient-adaptive policy optimization: Towards multi-objective alignment of large language models

Chengao Li, Hanyu Zhang, Yunkun Xu, Hongyan Xue, Xiang Ao, and Qing He. Gradient-adaptive policy optimization: Towards multi-objective alignment of large language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11214–11232, 2025

work page 2025
[51]

Multi-objective large language model alignment with hierarchical experts.arXiv preprint arXiv:2505.20925, 2025

Zhuo Li, Guodong Du, Weiyang Guo, Yigeng Zhou, Xiucheng Li, Wenya Wang, Fangming Liu, Yequan Wang, Deheng Ye, Min Zhang, et al. Multi-objective large language model alignment with hierarchical experts.arXiv preprint arXiv:2505.20925, 2025

work page arXiv 2025
[52]

Parm: Multi-objective test-time alignment via preference-aware autoregressive reward model.arXiv preprint arXiv:2505.06274, 2025

Baijiong Lin, Weisen Jiang, Yuancheng Xu, Hao Chen, and Ying-Cong Chen. Parm: Multi-objective test-time alignment via preference-aware autoregressive reward model.arXiv preprint arXiv:2505.06274, 2025

work page arXiv 2025
[53]

Policy-regularized offline multi-objective reinforcement learning.arXiv preprint arXiv:2401.02244, 2024

Qian Lin, Chao Yu, Zongkai Liu, and Zifan Wu. Policy-regularized offline multi-objective reinforcement learning.arXiv preprint arXiv:2401.02244, 2024

work page arXiv 2024
[54]

Smooth Tchebycheff scalarization for multi-objective optimization.arXiv preprint arXiv:2402.19078, 2024

Xi Lin, Xiaoyuan Zhang, Zhiyuan Yang, Fei Liu, Zhenkun Wang, and Qingfu Zhang. Smooth Tchebycheff scalarization for multi-objective optimization.arXiv preprint arXiv:2402.19078, 2024

work page arXiv 2024
[55]

Pareto multi-task learning

Xi Lin, Hui-Ling Zhen, Zhenhua Li, Qing-Fu Zhang, and Sam Kwong. Pareto multi-task learning. In Proc. Advances Neural Inf. Process. Syst., Vancouver, Canada, December 2019

work page 2019
[56]

Conflict-Averse Gradient Descent for Multi-task Learning

Bo Liu, Xingchao Liu, Xiaojie Jin, Peter Stone, and Qiang Liu. Conflict-Averse Gradient Descent for Multi-task Learning. InProc. Advances Neural Inf. Process. Syst., virtual, December 2021

work page 2021
[57]

Dora: Weight-decomposed low-rank adaptation

Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. Dora: Weight-decomposed low-rank adaptation. InForty-first International Conference on Machine Learning, 2024. 12

work page 2024
[58]

The stochastic multi-gradient algorithm for multi-objective opti- mization and its application to supervised machine learning.Annals of Operations Research, pages 1–30, 2021

Suyun Liu and Luis Nunes Vicente. The stochastic multi-gradient algorithm for multi-objective opti- mization and its application to supervised machine learning.Annals of Operations Research, pages 1–30, 2021

work page 2021
[59]

Singular continuation: generating piecewise linear approximations to pareto sets via global analysis.SIAM Journal on Optimization, 21(2):463–490, 2011

Alberto Lovison. Singular continuation: generating piecewise linear approximations to pareto sets via global analysis.SIAM Journal on Optimization, 21(2):463–490, 2011

work page 2011
[60]

Multi-task learning with user preferences: Gradient descent with controlled ascent in Pareto optimization

Debabrata Mahapatra and Vaibhav Rajan. Multi-task learning with user preferences: Gradient descent with controlled ascent in Pareto optimization. InProc. Int. Conf. Machine Learn., virtual, 2020

work page 2020
[61]

The weighted sum method for multi-objective optimization: new insights.Structural and multidisciplinary optimization, 41(6):853–862, 2010

R Timothy Marler and Jasbir S Arora. The weighted sum method for multi-objective optimization: new insights.Structural and multidisciplinary optimization, 41(6):853–862, 2010

work page 2010
[62]

On continuation methods for non-linear bi-objective optimization: towards a certified interval-based approach.Journal of Global Optimization, 64(1):3–16, 2016

Brice Martin, Alexandre Goldsztejn, Laurent Granvilliers, and Christophe Jermann. On continuation methods for non-linear bi-objective optimization: towards a certified interval-based approach.Journal of Global Optimization, 64(1):3–16, 2016

work page 2016
[63]

Springer US, Boston, MA, 1998

Kaisa Miettinen.Nonlinear Multiobjective Optimization, volume 12. Springer US, Boston, MA, 1998

work page 1998
[64]

A multi-objective/multi-task learning framework induced by Pareto stationarity

Michinari Momma, Chaosheng Dong, and Jia Liu. A multi-objective/multi-task learning framework induced by Pareto stationarity. InProc. Int. Conf. Machine Learn., Baltimore, MD, USA, 2022

work page 2022
[65]

Efficient memory-based learning for robot control

Andrew William Moore. Efficient memory-based learning for robot control. Technical report, University of Cambridge, 1990

work page 1990
[66]

Optimal rates of convergence for entropy regularization in discounted markov decision processes.Information and Inference: A Journal of the IMA, 15(1), 2026

Johannes Müller and Semih Cayci. Optimal rates of convergence for entropy regularization in discounted markov decision processes.Information and Inference: A Journal of the IMA, 15(1), 2026

work page 2026
[67]

Learning the pareto front with hypernet- works

Aviv Navon, Aviv Shamsian, Ethan Fetaya, and Gal Chechik. Learning the pareto front with hypernet- works. InProc. Int. Conf. Learn. Representations, virtual, April 2020

work page 2020
[68]

Cubic regularization of newton method and its global performance

Yurii Nesterov and Boris T Polyak. Cubic regularization of newton method and its global performance. Mathematical programming, 108(1):177–205, 2006

work page 2006
[69]

Asynchronous distributed solution of large scale nonlinear inversion problems.Applied numerical mathematics, 30(1):31–40, 1999

Victor Pereyra. Asynchronous distributed solution of large scale nonlinear inversion problems.Applied numerical mathematics, 30(1):31–40, 1999

work page 1999
[70]

Fast computation of equispaced pareto manifolds and pareto fronts for multiobjective optimization problems.Mathematics and Computers in Simulation, 79(6):1935–1947, 2009

Victor Pereyra. Fast computation of equispaced pareto manifolds and pareto fronts for multiobjective optimization problems.Mathematics and Computers in Simulation, 79(6):1935–1947, 2009

work page 1935
[71]

Equispaced pareto front construction for constrained bi-objective optimization.Mathematical and Computer Modelling, 57(9-10):2122–2131, 2013

Victor Pereyra, Michael Saunders, and José Castillo. Equispaced pareto front construction for constrained bi-objective optimization.Mathematical and Computer Modelling, 57(9-10):2122–2131, 2013

work page 2013
[72]

Functional bilevel optimization for machine learning

Ieva Petrulionyte, Julien Mairal, and Michael Arbel. Functional bilevel optimization for machine learning. Advances in Neural Information Processing Systems, 37:14016–14065, 2024

work page 2024
[73]

Traversing pareto optimal policies: Provably efficient multi-objective reinforcement learning.arXiv preprint arXiv:2407.17466, 2024

Shuang Qiu, Dake Zhang, Rui Yang, Boxiang Lyu, and Tong Zhang. Traversing pareto optimal policies: Provably efficient multi-objective reinforcement learning.arXiv preprint arXiv:2407.17466, 2024

work page arXiv 2024
[74]

Qwen2.5 technical report, 2025

Qwen Team, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao L...

work page 2025
[75]

Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

work page 2023
[76]

Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards.Advances in Neural Information Processing Systems, 36:71095–71134, 2023

Alexandre Rame, Guillaume Couairon, Corentin Dancette, Jean-Baptiste Gaya, Mustafa Shukor, Laure Soulier, and Matthieu Cord. Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards.Advances in Neural Information Processing Systems, 36:71095–71134, 2023

work page 2023
[77]

Pareto conditioned networks.arXiv preprint arXiv:2204.05036, 2022

Mathieu Reymond, Eugenio Bargiacchi, and Ann Nowé. Pareto conditioned networks.arXiv preprint arXiv:2204.05036, 2022. 13

work page arXiv 2022
[78]

Multi-objective reinforcement learning for the expected utility of the return

Diederik M Roijers, Denis Steckelmacher, and Ann Nowé. Multi-objective reinforcement learning for the expected utility of the return. InProceedings of the Adaptive and Learning Agents workshop at FAIM, volume 2018, 2018

work page 2018
[79]

A survey of multi-objective sequential decision-making.Journal of Artificial Intelligence Research, 48:67–113, 2013

Diederik M Roijers, Peter Vamplew, Shimon Whiteson, and Richard Dazeley. A survey of multi-objective sequential decision-making.Journal of Artificial Intelligence Research, 48:67–113, 2013

work page 2013
[80]

Optimization on pareto sets: On a theory of multi-objective optimization.arXiv preprint arXiv:2308.02145, 2023

Abhishek Roy, Geelon So, and Yi-An Ma. Optimization on pareto sets: On a theory of multi-objective optimization.arXiv preprint arXiv:2308.02145, 2023

work page arXiv 2023

Showing first 80 references.

[1] [1]

Optimistic linear support and successor features as a basis for optimal policy transfer

Lucas Nunes Alegre, Ana Bazzan, and Bruno C Da Silva. Optimistic linear support and successor features as a basis for optimal policy transfer. InInternational conference on machine learning, pages 394–413. PMLR, 2022

work page 2022

[2] [2]

Routledge, 2021

Eitan Altman.Constrained Markov decision processes. Routledge, 2021

work page 2021

[3] [3]

System level synthesis.Annual Reviews in Control, 47:364–393, 2019

James Anderson, John C Doyle, Steven H Low, and Nikolai Matni. System level synthesis.Annual Reviews in Control, 47:364–393, 2019

work page 2019

[4] [4]

Finite-time analysis of the multiarmed bandit problem

Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2):235–256, 2002

work page 2002

[5] [5]

Learning all optimal policies with multiple criteria

Leon Barrett and Srini Narayanan. Learning all optimal policies with multiple criteria. InProceedings of the 25th international conference on Machine learning, pages 41–47, 2008

work page 2008

[6] [6]

The pareto tracer for general inequality constrained multi-objective optimization problems.Mathematical and Computational Applications, 25(4):80, 2020

Fernanda Beltrán, Oliver Cuate, and Oliver Schütze. The pareto tracer for general inequality constrained multi-objective optimization problems.Mathematical and Computational Applications, 25(4):80, 2020

work page 2020

[7] [7]

Occupation measures for controlled markov processes: Characteriza- tion and optimality.The Annals of Probability, pages 1531–1562, 1996

Abhay G Bhatt and Vivek S Borkar. Occupation measures for controlled markov processes: Characteriza- tion and optimality.The Annals of Probability, pages 1531–1562, 1996

work page 1996

[8] [8]

John Wiley & Sons, 2017

Patrick Billingsley.Probability and measure. John Wiley & Sons, 2017

work page 2017

[9] [9]

The averaged hausdorff distances in multi-objective optimization: A review.Mathematics, 7(10):894, 2019

José M Bogoya, Andrés Vargas, and Oliver Schütze. The averaged hausdorff distances in multi-objective optimization: A review.Mathematics, 7(10):894, 2019

work page 2019

[10] [10]

Tracing locally pareto- optimal points by numerical integration.SIAM Journal on Control and Optimization, 59(5):3302–3328, 2021

Matthias Bolten, Onur Tanil Doganay, Hanno Gottschalk, and Kathrin Klamroth. Tracing locally pareto- optimal points by numerical integration.SIAM Journal on Control and Optimization, 59(5):3302–3328, 2021

work page 2021

[11] [11]

A tutorial on geometric programming.Optimization and engineering, 8(1):67–127, 2007

Stephen Boyd, Seung-Jean Kim, Lieven Vandenberghe, and Arash Hassibi. A tutorial on geometric programming.Optimization and engineering, 8(1):67–127, 2007

work page 2007

[12] [12]

Pure exploration in multi-armed bandits problems

Sébastien Bubeck, Rémi Munos, and Gilles Stoltz. Pure exploration in multi-armed bandits problems. In International conference on Algorithmic learning theory, pages 23–37. Springer, 2009

work page 2009

[13] [13]

Moving mesh generation using the parabolic monge–ampère equation

Chris J Budd and JF Williams. Moving mesh generation using the parabolic monge–ampère equation. SIAM Journal on Scientific Computing, 31(5):3438–3465, 2009

work page 2009

[14] [14]

Courier Dover Publications, 2008

Vira Chankong and Yacov Y Haimes.Multiobjective decision making: theory and methodology. Courier Dover Publications, 2008

work page 2008

[15] [15]

Three-way trade-off in multi-objective learning: Optimization, generalization and conflict-avoidance.J

Lisha Chen, Heshan Fernando, Yiming Ying, and Tianyi Chen. Three-way trade-off in multi-objective learning: Optimization, generalization and conflict-avoidance.J. Machine Learn. Res., 25(193):1–53, 5 2024

work page 2024

[16] [16]

Ferero: A flexible framework for preference- guided multi-objective learning.Advances in Neural Information Processing Systems, 37:18758–18805, 2024

Lisha Chen, AFM Saif, Yanning Shen, and Tianyi Chen. Ferero: A flexible framework for preference- guided multi-objective learning.Advances in Neural Information Processing Systems, 37:18758–18805, 2024. 10

work page 2024

[17] [17]

Efficient first-order optimization on the Pareto set for multi-objective learning under preference guidance

Lisha Chen, Quan Xiao, Ellen Hidemi Fukuda, Xinyi Chen, Kun Yuan, and Tianyi Chen. Efficient first-order optimization on the Pareto set for multi-objective learning under preference guidance. In International Conference on Machine Learning (ICML), as spotlight, 2025

work page 2025

[18] [18]

Pareto merging: Multi-objective optimization for preference-aware model merging.arXiv preprint arXiv:2408.12105, 2024

Weiyu Chen and James Kwok. Pareto merging: Multi-objective optimization for preference-aware model merging.arXiv preprint arXiv:2408.12105, 2024

work page arXiv 2024

[19] [19]

Gradient-based multi-objective deep learning: Algorithms, theories, applications, and beyond.arXiv preprint arXiv:2501.10945, 2025

Weiyu Chen, Xiaoyuan Zhang, Baijiong Lin, Xi Lin, Han Zhao, Qingfu Zhang, and James T Kwok. Gradient-based multi-objective deep learning: Algorithms, theories, applications, and beyond.arXiv preprint arXiv:2501.10945, 2025

work page arXiv 2025

[20] [20]

Efficient algorithms for a class of stochastic hidden convex optimization and its applications in network revenue management.Operations Research, 73(2):704–719, 2025

Xin Chen, Niao He, Yifan Hu, and Zikun Ye. Efficient algorithms for a class of stochastic hidden convex optimization and its applications in network revenue management.Operations Research, 73(2):704–719, 2025

work page 2025

[21] [21]

Network revenue management with online inverse batch gradient descent method.Production and Operations Management, 32(7):2123–2137, 2023

Yiwei Chen and Cong Shi. Network revenue management with online inverse batch gradient descent method.Production and Operations Management, 32(7):2123–2137, 2023

work page 2023

[22] [22]

Multi-objective neural bandits with random scalarization

Ji Cheng, Bo Xue, Chengyu Lu, Ziqiang Cui, and Qingfu Zhang. Multi-objective neural bandits with random scalarization. InProceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, pages 4914–4922, 2025

work page 2025

[23] [23]

Safe RLHF: Safe Reinforcement Learning from Human Feedback

Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, and Yaodong Yang. Safe rlhf: Safe reinforcement learning from human feedback.arXiv preprint arXiv:2310.12773, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[24] [24]

Normal-boundary intersection: A new method for generating the pareto surface in nonlinear multicriteria optimization problems.SIAM journal on optimization, 8(3):631–657, 1998

Indraneel Das and John E Dennis. Normal-boundary intersection: A new method for generating the pareto surface in nonlinear multicriteria optimization problems.SIAM journal on optimization, 8(3):631–657, 1998

work page 1998

[25] [25]

Designing multi-objective multi-armed bandits algorithms: A study

Madalina M Drugan and Ann Nowe. Designing multi-objective multi-armed bandits algorithms: A study. InThe 2013 international joint conference on neural networks (IJCNN), pages 1–8. IEEE, 2013

work page 2013

[26] [26]

Scalarization based pareto optimal set of arms identification algorithms

Madalina M Drugan and Ann Nowé. Scalarization based pareto optimal set of arms identification algorithms. In2014 International Joint Conference on Neural Networks (IJCNN), pages 2690–2697. IEEE, 2014

work page 2014

[27] [27]

Efficiency of minimizing compositions of convex functions and smooth maps.Mathematical Programming, 178(1):503–558, 2019

Dmitriy Drusvyatskiy and Courtney Paquette. Efficiency of minimizing compositions of convex functions and smooth maps.Mathematical Programming, 178(1):503–558, 2019

work page 2019

[28] [28]

Springer, 2005

Matthias Ehrgott.Multicriteria optimization. Springer, 2005

work page 2005

[29] [29]

Trade-off analysis approach for interactive nonlinear multiobjective optimization.OR spectrum, 34(4):803–816, 2012

Petri Eskelinen and Kaisa Miettinen. Trade-off analysis approach for interactive nonlinear multiobjective optimization.OR spectrum, 34(4):803–816, 2012

work page 2012

[30] [30]

Cambridge university press Cambridge, UK, 2002

Brian S Everitt and Anders Skrondal.The Cambridge dictionary of statistics, volume 140. Cambridge university press Cambridge, UK, 2002

work page 2002

[31] [31]

Stochastic optimization under hidden convexity.SIAM Journal on Optimization, 35(4):2544–2571, 2025

Ilyas Fatkhullin, Niao He, and Yifan Hu. Stochastic optimization under hidden convexity.SIAM Journal on Optimization, 35(4):2544–2571, 2025

work page 2025

[32] [32]

Alegre, Ann Nowé, Ana L

Florian Felten, Lucas N. Alegre, Ann Nowé, Ana L. C. Bazzan, El Ghazali Talbi, Grégoire Danoy, and Bruno C. da Silva. A toolkit for reliable benchmarking and research in multi-objective reinforcement learning. InProceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS 2023), 2023

work page 2023

[33] [33]

Multi-objective reinforcement learning based on decomposition: A taxonomy and framework.Journal of Artificial Intelligence Research, 79:679–723, 2024

Florian Felten, El-Ghazali Talbi, and Grégoire Danoy. Multi-objective reinforcement learning based on decomposition: A taxonomy and framework.Journal of Artificial Intelligence Research, 79:679–723, 2024

work page 2024

[34] [34]

Mitigating gradient bias in multi-objective learning: A provably convergent stochastic approach

Heshan Fernando, Han Shen, Miao Liu, Subhajit Chaudhury, Keerthiram Murugesan, and Tianyi Chen. Mitigating gradient bias in multi-objective learning: A provably convergent stochastic approach. InProc. Int. Conf. Learn. Representations, Kigali, Rwanda, May 2023

work page 2023

[35] [35]

Sensitivity analysis for nonlinear programming using penalty methods.Mathematical programming, 10(1):287–311, 1976

Anthony V Fiacco. Sensitivity analysis for nonlinear programming using penalty methods.Mathematical programming, 10(1):287–311, 1976

work page 1976

[36] [36]

Monotone piecewise cubic interpolation.SIAM Journal on Numerical Analysis, 17(2):238–246, 1980

Frederick N Fritsch and Ralph E Carlson. Monotone piecewise cubic interpolation.SIAM Journal on Numerical Analysis, 17(2):238–246, 1980. 11

work page 1980

[37] [37]

Sequential learning of the pareto front for multi-objective bandits

Aurélien Garivier and Wouter M Koolen. Sequential learning of the pareto front for multi-objective bandits. InInternational Conference on Artificial Intelligence and Statistics, pages 3583–3591. PMLR, 2024

work page 2024

[38] [38]

A theory of regularized markov decision processes

Matthieu Geist, Bruno Scherrer, and Olivier Pietquin. A theory of regularized markov decision processes. InInternational conference on machine learning, pages 2160–2169. PMLR, 2019

work page 2019

[39] [39]

User’s guide for snopt version 7: software for large-scale nonlinear programming

Philip E Gill, Walter Murray, and Michael A Saunders. User’s guide for snopt version 7: software for large-scale nonlinear programming. Technical report, Systems Optimization Laboratory, Stanford University, 2006

work page 2006

[40] [40]

Pearson education india, 2009

Rafael C Gonzalez.Digital image processing. Pearson education india, 2009

work page 2009

[41] [41]

Controllable preference optimization: Toward controllable multi-objective alignment

Yiju Guo, Ganqu Cui, Lifan Yuan, Ning Ding, Zexu Sun, Bowen Sun, Huimin Chen, Ruobing Xie, Jie Zhou, Yankai Lin, et al. Controllable preference optimization: Toward controllable multi-objective alignment. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 1437–1454, 2024

work page 2024

[42] [42]

Topology of pareto sets of strongly convex problems.SIAM Journal on Optimization, 30(3):2659–2686, 2020

Naoki Hamada, Kenta Hayano, Shunsuke Ichiki, Yutaro Kabata, and Hiroshi Teramoto. Topology of pareto sets of strongly convex problems.SIAM Journal on Optimization, 30(3):2659–2686, 2020

work page 2020

[43] [43]

Cambridge university press, 1952

Godfrey Harold Hardy, John Edensor Littlewood, and George Pólya.Inequalities. Cambridge university press, 1952

work page 1952

[44] [44]

Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

work page 2022

[45] [45]

Springer Science & Business Media, 2010

Weizhang Huang and Robert D Russell.Adaptive moving mesh methods, volume 174. Springer Science & Business Media, 2010

work page 2010

[46] [46]

SIAM, 2008

Kazufumi Ito and Karl Kunisch.Lagrange multiplier approach to variational problems and applications. SIAM, 2008

work page 2008

[47] [47]

Scalarization in multi objective optimization

Johannes Jahn. Scalarization in multi objective optimization. InMathematics of multi objective optimiza- tion, pages 45–88. Springer, 1985

work page 1985

[48] [48]

On the complexity of best-arm identification in multi-armed bandit models.The Journal of Machine Learning Research, 17(1):1–42, 2016

Emilie Kaufmann, Olivier Cappé, and Aurélien Garivier. On the complexity of best-arm identification in multi-armed bandit models.The Journal of Machine Learning Research, 17(1):1–42, 2016

work page 2016

[49] [49]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[50] [50]

Gradient-adaptive policy optimization: Towards multi-objective alignment of large language models

Chengao Li, Hanyu Zhang, Yunkun Xu, Hongyan Xue, Xiang Ao, and Qing He. Gradient-adaptive policy optimization: Towards multi-objective alignment of large language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11214–11232, 2025

work page 2025

[51] [51]

Multi-objective large language model alignment with hierarchical experts.arXiv preprint arXiv:2505.20925, 2025

Zhuo Li, Guodong Du, Weiyang Guo, Yigeng Zhou, Xiucheng Li, Wenya Wang, Fangming Liu, Yequan Wang, Deheng Ye, Min Zhang, et al. Multi-objective large language model alignment with hierarchical experts.arXiv preprint arXiv:2505.20925, 2025

work page arXiv 2025

[52] [52]

Parm: Multi-objective test-time alignment via preference-aware autoregressive reward model.arXiv preprint arXiv:2505.06274, 2025

Baijiong Lin, Weisen Jiang, Yuancheng Xu, Hao Chen, and Ying-Cong Chen. Parm: Multi-objective test-time alignment via preference-aware autoregressive reward model.arXiv preprint arXiv:2505.06274, 2025

work page arXiv 2025

[53] [53]

Policy-regularized offline multi-objective reinforcement learning.arXiv preprint arXiv:2401.02244, 2024

Qian Lin, Chao Yu, Zongkai Liu, and Zifan Wu. Policy-regularized offline multi-objective reinforcement learning.arXiv preprint arXiv:2401.02244, 2024

work page arXiv 2024

[54] [54]

Smooth Tchebycheff scalarization for multi-objective optimization.arXiv preprint arXiv:2402.19078, 2024

Xi Lin, Xiaoyuan Zhang, Zhiyuan Yang, Fei Liu, Zhenkun Wang, and Qingfu Zhang. Smooth Tchebycheff scalarization for multi-objective optimization.arXiv preprint arXiv:2402.19078, 2024

work page arXiv 2024

[55] [55]

Pareto multi-task learning

Xi Lin, Hui-Ling Zhen, Zhenhua Li, Qing-Fu Zhang, and Sam Kwong. Pareto multi-task learning. In Proc. Advances Neural Inf. Process. Syst., Vancouver, Canada, December 2019

work page 2019

[56] [56]

Conflict-Averse Gradient Descent for Multi-task Learning

Bo Liu, Xingchao Liu, Xiaojie Jin, Peter Stone, and Qiang Liu. Conflict-Averse Gradient Descent for Multi-task Learning. InProc. Advances Neural Inf. Process. Syst., virtual, December 2021

work page 2021

[57] [57]

Dora: Weight-decomposed low-rank adaptation

Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. Dora: Weight-decomposed low-rank adaptation. InForty-first International Conference on Machine Learning, 2024. 12

work page 2024

[58] [58]

The stochastic multi-gradient algorithm for multi-objective opti- mization and its application to supervised machine learning.Annals of Operations Research, pages 1–30, 2021

Suyun Liu and Luis Nunes Vicente. The stochastic multi-gradient algorithm for multi-objective opti- mization and its application to supervised machine learning.Annals of Operations Research, pages 1–30, 2021

work page 2021

[59] [59]

Singular continuation: generating piecewise linear approximations to pareto sets via global analysis.SIAM Journal on Optimization, 21(2):463–490, 2011

Alberto Lovison. Singular continuation: generating piecewise linear approximations to pareto sets via global analysis.SIAM Journal on Optimization, 21(2):463–490, 2011

work page 2011

[60] [60]

Multi-task learning with user preferences: Gradient descent with controlled ascent in Pareto optimization

Debabrata Mahapatra and Vaibhav Rajan. Multi-task learning with user preferences: Gradient descent with controlled ascent in Pareto optimization. InProc. Int. Conf. Machine Learn., virtual, 2020

work page 2020

[61] [61]

The weighted sum method for multi-objective optimization: new insights.Structural and multidisciplinary optimization, 41(6):853–862, 2010

R Timothy Marler and Jasbir S Arora. The weighted sum method for multi-objective optimization: new insights.Structural and multidisciplinary optimization, 41(6):853–862, 2010

work page 2010

[62] [62]

On continuation methods for non-linear bi-objective optimization: towards a certified interval-based approach.Journal of Global Optimization, 64(1):3–16, 2016

Brice Martin, Alexandre Goldsztejn, Laurent Granvilliers, and Christophe Jermann. On continuation methods for non-linear bi-objective optimization: towards a certified interval-based approach.Journal of Global Optimization, 64(1):3–16, 2016

work page 2016

[63] [63]

Springer US, Boston, MA, 1998

Kaisa Miettinen.Nonlinear Multiobjective Optimization, volume 12. Springer US, Boston, MA, 1998

work page 1998

[64] [64]

A multi-objective/multi-task learning framework induced by Pareto stationarity

Michinari Momma, Chaosheng Dong, and Jia Liu. A multi-objective/multi-task learning framework induced by Pareto stationarity. InProc. Int. Conf. Machine Learn., Baltimore, MD, USA, 2022

work page 2022

[65] [65]

Efficient memory-based learning for robot control

Andrew William Moore. Efficient memory-based learning for robot control. Technical report, University of Cambridge, 1990

work page 1990

[66] [66]

Optimal rates of convergence for entropy regularization in discounted markov decision processes.Information and Inference: A Journal of the IMA, 15(1), 2026

Johannes Müller and Semih Cayci. Optimal rates of convergence for entropy regularization in discounted markov decision processes.Information and Inference: A Journal of the IMA, 15(1), 2026

work page 2026

[67] [67]

Learning the pareto front with hypernet- works

Aviv Navon, Aviv Shamsian, Ethan Fetaya, and Gal Chechik. Learning the pareto front with hypernet- works. InProc. Int. Conf. Learn. Representations, virtual, April 2020

work page 2020

[68] [68]

Cubic regularization of newton method and its global performance

Yurii Nesterov and Boris T Polyak. Cubic regularization of newton method and its global performance. Mathematical programming, 108(1):177–205, 2006

work page 2006

[69] [69]

Asynchronous distributed solution of large scale nonlinear inversion problems.Applied numerical mathematics, 30(1):31–40, 1999

Victor Pereyra. Asynchronous distributed solution of large scale nonlinear inversion problems.Applied numerical mathematics, 30(1):31–40, 1999

work page 1999

[70] [70]

Fast computation of equispaced pareto manifolds and pareto fronts for multiobjective optimization problems.Mathematics and Computers in Simulation, 79(6):1935–1947, 2009

Victor Pereyra. Fast computation of equispaced pareto manifolds and pareto fronts for multiobjective optimization problems.Mathematics and Computers in Simulation, 79(6):1935–1947, 2009

work page 1935

[71] [71]

Equispaced pareto front construction for constrained bi-objective optimization.Mathematical and Computer Modelling, 57(9-10):2122–2131, 2013

Victor Pereyra, Michael Saunders, and José Castillo. Equispaced pareto front construction for constrained bi-objective optimization.Mathematical and Computer Modelling, 57(9-10):2122–2131, 2013

work page 2013

[72] [72]

Functional bilevel optimization for machine learning

Ieva Petrulionyte, Julien Mairal, and Michael Arbel. Functional bilevel optimization for machine learning. Advances in Neural Information Processing Systems, 37:14016–14065, 2024

work page 2024

[73] [73]

Traversing pareto optimal policies: Provably efficient multi-objective reinforcement learning.arXiv preprint arXiv:2407.17466, 2024

Shuang Qiu, Dake Zhang, Rui Yang, Boxiang Lyu, and Tong Zhang. Traversing pareto optimal policies: Provably efficient multi-objective reinforcement learning.arXiv preprint arXiv:2407.17466, 2024

work page arXiv 2024

[74] [74]

Qwen2.5 technical report, 2025

Qwen Team, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao L...

work page 2025

[75] [75]

Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

work page 2023

[76] [76]

Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards.Advances in Neural Information Processing Systems, 36:71095–71134, 2023

Alexandre Rame, Guillaume Couairon, Corentin Dancette, Jean-Baptiste Gaya, Mustafa Shukor, Laure Soulier, and Matthieu Cord. Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards.Advances in Neural Information Processing Systems, 36:71095–71134, 2023

work page 2023

[77] [77]

Pareto conditioned networks.arXiv preprint arXiv:2204.05036, 2022

Mathieu Reymond, Eugenio Bargiacchi, and Ann Nowé. Pareto conditioned networks.arXiv preprint arXiv:2204.05036, 2022. 13

work page arXiv 2022

[78] [78]

Multi-objective reinforcement learning for the expected utility of the return

Diederik M Roijers, Denis Steckelmacher, and Ann Nowé. Multi-objective reinforcement learning for the expected utility of the return. InProceedings of the Adaptive and Learning Agents workshop at FAIM, volume 2018, 2018

work page 2018

[79] [79]

A survey of multi-objective sequential decision-making.Journal of Artificial Intelligence Research, 48:67–113, 2013

Diederik M Roijers, Peter Vamplew, Shimon Whiteson, and Richard Dazeley. A survey of multi-objective sequential decision-making.Journal of Artificial Intelligence Research, 48:67–113, 2013

work page 2013

[80] [80]

Optimization on pareto sets: On a theory of multi-objective optimization.arXiv preprint arXiv:2308.02145, 2023

Abhishek Roy, Geelon So, and Yi-An Ma. Optimization on pareto sets: On a theory of multi-objective optimization.arXiv preprint arXiv:2308.02145, 2023

work page arXiv 2023