pith. sign in

arxiv: 2605.20619 · v1 · pith:TZDM3KSQnew · submitted 2026-05-20 · 💻 cs.LG · math.OC· stat.ML

SURF: Steering the Scalarization Weight to Uniformly Traverse the Pareto Front

Pith reviewed 2026-05-21 06:38 UTC · model grok-4.3

classification 💻 cs.LG math.OCstat.ML
keywords multi-objective optimizationPareto frontscalarizationuniform samplingarc-lengthbanditsreinforcement learning
0
0 comments X

The pith

Inverting the arc-length cumulative distribution function of the scalarization path produces weights that uniformly cover the Pareto front.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper explains that simply sampling scalarization weights uniformly often leaves gaps in the Pareto front because the solutions move along the front at varying speeds. By measuring the arc length along this path and building its cumulative distribution function, the authors derive a way to choose weights that space solutions evenly. They introduce SURF, which for some problems like bi-objective bandits has closed-form solutions and for others rebuilds the distribution iteratively while sampling. Theory shows linear convergence to a limit set by finite samples, and tests on bandits, environments, and language model alignment confirm better uniformity than standard approaches.

Core claim

The central claim is that the scalarization path traces the Pareto front with non-uniform speed, inducing an arc-length CDF; inverting this CDF map yields a principled rule for selecting weights that produce uniform PF coverage. SURF implements this rule, deriving closed forms where possible and alternating reconstruction and sampling otherwise.

What carries the argument

The arc-length cumulative distribution function (CDF) of the scalarization path, which tracks how far along the Pareto front the solutions have traveled as the weight changes; inverting it corrects for the uneven traversal speed to select weights evenly.

If this is right

  • Closed-form expressions exist for the CDF map in structured problems like bi-objective bandits.
  • SURF converges linearly to an unavoidable finite-sampling floor under provable conditions.
  • Empirical results show more uniform coverage on bandits, multi-objective gymnasium environments, and multi-objective LLM alignment tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method might extend to other multi-objective algorithms that use scalarization but currently ignore path geometry.
  • In very high-dimensional objective spaces, repeatedly reconstructing the CDF could add significant computational overhead.
  • This geometric view suggests designing new scalarization functions that inherently traverse the front more uniformly.

Load-bearing premise

The scalarization path traces the Pareto front continuously enough to define a meaningful arc-length and to reconstruct its cumulative distribution function from samples.

What would settle it

Running the method on a simple bi-objective problem with a known curved Pareto front and checking whether the achieved solution spacing is significantly more uniform than uniform weight sampling would confirm or refute the benefit of the inversion rule.

Figures

Figures reproduced from arXiv: 2605.20619 by Chentong Huang, Lisha Chen, Liuyuan Jiang.

Figure 1
Figure 1. Figure 1: LS results for multi￾arm bandit (see Appendix F.1 for the detailed settings). Uniform weights produce uneven points on PF (stars), whereas PF-aware sampling yields a uniform place￾ment (colored dots). Formally, let u denote the decision variable in the feasible set U ⊆ R d , MOO seeks to minimize M objectives hm : R d → R: min u∈U n h(u) := [h1(u), · · · , hM(u)] ⊤ o . (1) We consider optimizing the above … view at source ↗
Figure 2
Figure 2. Figure 2: SURF overview: bridging w and PF points via CDF Φ yields two rules. Rule 1 directly samples {wn} N n=0 when Φ is known (Section 2). Rule 2 handles the general case by iteratively approximating Φ (Algorithm 1). We answer this by proposing Sampling Uniformly along the Pareto Front (SURF; [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of approaches for obtaining diverse PF solutions, including NBI method [ [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Original parameter space U and hidden space X = g(U). Bijection g maps non-convex struc￾tures in U to a convex space X . Modern machine-learning objectives are highly non-convex in their original parameterization. Nevertheless, many problems admit an equivalent hidden representation where the objective becomes convex or strongly convex [72, 31]. Based on this viewpoint, we impose the following assumption. … view at source ↗
Figure 5
Figure 5. Figure 5: Error vs. number of pulls T in bandit. Error bars show stan￾dard deviations across 100 trials. This closed form makes the PF geometry explicit and provides a tractable testbed for the geometry behind SURF, as visualized in [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: PFs obtained by SURF and baselines on DST. DST. DST [89, 101] is a classical bi-objective RL benchmark, where the agent trades off treasure value and travel time. We evaluate SURF Algorithm 1 with 30 outer iterations and α = 0.3 [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Theoretical analysis roadmap for SURF. The solid-line blocks summarize the main results [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: DST environment. Problem Setup. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p050_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Trajectory distributions induced by policies obtained via our SURF Algorithm 1, [PITH_FULL_IMAGE:figures/full_fig_p052_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Convergence behavior of the SURF Algorithm 1 under different parameter choices. Left: [PITH_FULL_IMAGE:figures/full_fig_p053_10.png] view at source ↗
Figure 12
Figure 12. Figure 12: ∥Φt − Φ∥∞ vs. N ∈ {4, 6, . . . , 20} on DST. spectrum. Overall, the SURF Algorithm 1 method provides a more gradual progression of policies along the Pareto trade-off [PITH_FULL_IMAGE:figures/full_fig_p053_12.png] view at source ↗
read the original abstract

Scalarization is widely used in multi-objective optimization owing to its simplicity and scalability. In many applications, the goal is to generate solutions that represent diverse user preferences, ideally with uniform coverage of the Pareto front (PF). However, uniformly sampling scalarization weights usually induces non-uniform coverage of the PF. We explain this mismatch through a geometric analysis of the scalarization path. As the scalarization weight varies, the corresponding solutions trace the PF with a generally non-uniform traversal speed. This speed induces an arc-length cumulative distribution function (CDF); inverting this CDF map yields a principled rule for selecting weights that produce uniform PF coverage. Building on this insight, we propose SURF (Sampling Uniformly along the PaReto Front). For structured problems, including bi-objective bandits, we derive closed-form expressions for this CDF map and the resulting PF-aware weight sampling rule. For general problems, SURF alternates between CDF reconstruction and weight sampling. Theoretically, we show that under provable conditions, SURF converges linearly to an unavoidable finite-sampling floor. Empirically, experiments on bandits, multi-objective-gymnasium, and multi-objective LLM alignment demonstrate that SURF efficiently achieves more uniform PF coverage than baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that uniformly sampling scalarization weights in multi-objective optimization produces non-uniform Pareto front (PF) coverage because the scalarization path traverses the PF at varying speeds; inverting the arc-length CDF of this path yields a weight-sampling rule (SURF) that achieves uniform coverage. For structured cases such as bi-objective bandits the authors derive closed-form CDF maps and sampling rules; for general problems SURF alternates between CDF reconstruction and weight sampling. They prove linear convergence to a finite-sampling floor under stated conditions and report empirical improvements over baselines on bandits, multi-objective gymnasium environments, and LLM alignment tasks.

Significance. If the geometric analysis and reachability assumptions hold, the work supplies a principled, geometry-driven method for uniform PF sampling that improves on ad-hoc weight selection. The closed-form derivations for bandits and the linear-convergence guarantee are concrete strengths; the empirical results across three distinct domains further support practical utility for applications that require diverse, preference-representing solutions.

major comments (2)
  1. [Abstract / geometric analysis] Abstract and geometric-analysis section: the central claim that CDF inversion produces uniform coverage of the full PF rests on the unstated premise that the scalarization path reaches every point on the PF. Standard weighted-sum scalarization recovers only supported points on the convex hull; non-convex segments are unreachable. This assumption is load-bearing for the uniform-coverage guarantee and must be explicitly stated, justified, or restricted to convex fronts.
  2. [Theoretical analysis] Theoretical-convergence paragraph: the linear convergence result is stated to hold 'under provable conditions,' yet the precise conditions (e.g., Lipschitz continuity of the PF, strict convexity, or bounded curvature) are not enumerated. Without an explicit list or a theorem statement that isolates these conditions, it is impossible to verify applicability to the general alternating-reconstruction case.
minor comments (2)
  1. Notation for the arc-length parameter and the CDF inversion operator should be introduced once with a clear diagram or equation reference rather than re-defined inline in multiple places.
  2. The description of the alternating reconstruction procedure for general problems would benefit from a short pseudocode block or numbered steps to clarify the loop termination criterion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major comment below, providing clarifications on the scope of our claims and indicating the revisions we will make to improve the manuscript.

read point-by-point responses
  1. Referee: [Abstract / geometric analysis] Abstract and geometric-analysis section: the central claim that CDF inversion produces uniform coverage of the full PF rests on the unstated premise that the scalarization path reaches every point on the PF. Standard weighted-sum scalarization recovers only supported points on the convex hull; non-convex segments are unreachable. This assumption is load-bearing for the uniform-coverage guarantee and must be explicitly stated, justified, or restricted to convex fronts.

    Authors: We agree that weighted-sum scalarization recovers only supported solutions on the convex hull and cannot reach non-convex segments of the Pareto front. Our geometric analysis of the scalarization path and the SURF sampling rule are developed specifically for the reachable (supported) portion of the front. We will revise the abstract and geometric-analysis section to state this premise explicitly, restrict the uniform-coverage guarantee to supported points, and note that alternative scalarizations would be required for unsupported regions. This change clarifies the scope without affecting the core technical contributions. revision: yes

  2. Referee: [Theoretical analysis] Theoretical-convergence paragraph: the linear convergence result is stated to hold 'under provable conditions,' yet the precise conditions (e.g., Lipschitz continuity of the PF, strict convexity, or bounded curvature) are not enumerated. Without an explicit list or a theorem statement that isolates these conditions, it is impossible to verify applicability to the general alternating-reconstruction case.

    Authors: We appreciate the request for greater precision. The current manuscript refers to 'provable conditions' without enumerating them in the main text. In the revision we will insert an explicit theorem statement that lists the required assumptions, including Lipschitz continuity of the Pareto-front mapping, bounded curvature ensuring a well-defined arc-length, and the technical conditions on the CDF reconstruction step for the general alternating procedure. This will make the applicability of the linear-convergence result verifiable. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation rests on independent geometric analysis

full rationale

The paper derives the SURF weight-sampling rule by first performing a geometric analysis of the scalarization path's traversal speed along the PF, which induces an arc-length CDF; inverting that CDF then supplies the weight rule. This chain is presented as following from the path geometry rather than from any fitted parameter, self-definition, or prior self-citation that encodes the target result. Closed-form CDF expressions for bi-objective bandits and the alternating reconstruction procedure for general cases are derived directly from the same geometric model. Theoretical linear convergence is shown under explicitly stated provable conditions, and empirical comparisons are external to the derivation. No step reduces by construction to its own inputs, and the central claim retains independent content from the geometric premise.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the geometric characterization of the scalarization path and the invertibility of its arc-length CDF; no explicit free parameters or new entities are introduced in the abstract.

axioms (1)
  • domain assumption The scalarization path traces the Pareto front with a traversal speed that admits a well-defined arc-length cumulative distribution function.
    This premise underpins the CDF inversion step described in the abstract.

pith-pipeline@v0.9.0 · 5752 in / 1243 out tokens · 40177 ms · 2026-05-21T06:38:48.077103+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

117 extracted references · 117 canonical work pages · 3 internal anchors

  1. [1]

    Optimistic linear support and successor features as a basis for optimal policy transfer

    Lucas Nunes Alegre, Ana Bazzan, and Bruno C Da Silva. Optimistic linear support and successor features as a basis for optimal policy transfer. InInternational conference on machine learning, pages 394–413. PMLR, 2022

  2. [2]

    Routledge, 2021

    Eitan Altman.Constrained Markov decision processes. Routledge, 2021

  3. [3]

    System level synthesis.Annual Reviews in Control, 47:364–393, 2019

    James Anderson, John C Doyle, Steven H Low, and Nikolai Matni. System level synthesis.Annual Reviews in Control, 47:364–393, 2019

  4. [4]

    Finite-time analysis of the multiarmed bandit problem

    Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2):235–256, 2002

  5. [5]

    Learning all optimal policies with multiple criteria

    Leon Barrett and Srini Narayanan. Learning all optimal policies with multiple criteria. InProceedings of the 25th international conference on Machine learning, pages 41–47, 2008

  6. [6]

    The pareto tracer for general inequality constrained multi-objective optimization problems.Mathematical and Computational Applications, 25(4):80, 2020

    Fernanda Beltrán, Oliver Cuate, and Oliver Schütze. The pareto tracer for general inequality constrained multi-objective optimization problems.Mathematical and Computational Applications, 25(4):80, 2020

  7. [7]

    Occupation measures for controlled markov processes: Characteriza- tion and optimality.The Annals of Probability, pages 1531–1562, 1996

    Abhay G Bhatt and Vivek S Borkar. Occupation measures for controlled markov processes: Characteriza- tion and optimality.The Annals of Probability, pages 1531–1562, 1996

  8. [8]

    John Wiley & Sons, 2017

    Patrick Billingsley.Probability and measure. John Wiley & Sons, 2017

  9. [9]

    The averaged hausdorff distances in multi-objective optimization: A review.Mathematics, 7(10):894, 2019

    José M Bogoya, Andrés Vargas, and Oliver Schütze. The averaged hausdorff distances in multi-objective optimization: A review.Mathematics, 7(10):894, 2019

  10. [10]

    Tracing locally pareto- optimal points by numerical integration.SIAM Journal on Control and Optimization, 59(5):3302–3328, 2021

    Matthias Bolten, Onur Tanil Doganay, Hanno Gottschalk, and Kathrin Klamroth. Tracing locally pareto- optimal points by numerical integration.SIAM Journal on Control and Optimization, 59(5):3302–3328, 2021

  11. [11]

    A tutorial on geometric programming.Optimization and engineering, 8(1):67–127, 2007

    Stephen Boyd, Seung-Jean Kim, Lieven Vandenberghe, and Arash Hassibi. A tutorial on geometric programming.Optimization and engineering, 8(1):67–127, 2007

  12. [12]

    Pure exploration in multi-armed bandits problems

    Sébastien Bubeck, Rémi Munos, and Gilles Stoltz. Pure exploration in multi-armed bandits problems. In International conference on Algorithmic learning theory, pages 23–37. Springer, 2009

  13. [13]

    Moving mesh generation using the parabolic monge–ampère equation

    Chris J Budd and JF Williams. Moving mesh generation using the parabolic monge–ampère equation. SIAM Journal on Scientific Computing, 31(5):3438–3465, 2009

  14. [14]

    Courier Dover Publications, 2008

    Vira Chankong and Yacov Y Haimes.Multiobjective decision making: theory and methodology. Courier Dover Publications, 2008

  15. [15]

    Three-way trade-off in multi-objective learning: Optimization, generalization and conflict-avoidance.J

    Lisha Chen, Heshan Fernando, Yiming Ying, and Tianyi Chen. Three-way trade-off in multi-objective learning: Optimization, generalization and conflict-avoidance.J. Machine Learn. Res., 25(193):1–53, 5 2024

  16. [16]

    Ferero: A flexible framework for preference- guided multi-objective learning.Advances in Neural Information Processing Systems, 37:18758–18805, 2024

    Lisha Chen, AFM Saif, Yanning Shen, and Tianyi Chen. Ferero: A flexible framework for preference- guided multi-objective learning.Advances in Neural Information Processing Systems, 37:18758–18805, 2024. 10

  17. [17]

    Efficient first-order optimization on the Pareto set for multi-objective learning under preference guidance

    Lisha Chen, Quan Xiao, Ellen Hidemi Fukuda, Xinyi Chen, Kun Yuan, and Tianyi Chen. Efficient first-order optimization on the Pareto set for multi-objective learning under preference guidance. In International Conference on Machine Learning (ICML), as spotlight, 2025

  18. [18]

    Pareto merging: Multi-objective optimization for preference-aware model merging.arXiv preprint arXiv:2408.12105, 2024

    Weiyu Chen and James Kwok. Pareto merging: Multi-objective optimization for preference-aware model merging.arXiv preprint arXiv:2408.12105, 2024

  19. [19]

    Gradient-based multi-objective deep learning: Algorithms, theories, applications, and beyond.arXiv preprint arXiv:2501.10945, 2025

    Weiyu Chen, Xiaoyuan Zhang, Baijiong Lin, Xi Lin, Han Zhao, Qingfu Zhang, and James T Kwok. Gradient-based multi-objective deep learning: Algorithms, theories, applications, and beyond.arXiv preprint arXiv:2501.10945, 2025

  20. [20]

    Efficient algorithms for a class of stochastic hidden convex optimization and its applications in network revenue management.Operations Research, 73(2):704–719, 2025

    Xin Chen, Niao He, Yifan Hu, and Zikun Ye. Efficient algorithms for a class of stochastic hidden convex optimization and its applications in network revenue management.Operations Research, 73(2):704–719, 2025

  21. [21]

    Network revenue management with online inverse batch gradient descent method.Production and Operations Management, 32(7):2123–2137, 2023

    Yiwei Chen and Cong Shi. Network revenue management with online inverse batch gradient descent method.Production and Operations Management, 32(7):2123–2137, 2023

  22. [22]

    Multi-objective neural bandits with random scalarization

    Ji Cheng, Bo Xue, Chengyu Lu, Ziqiang Cui, and Qingfu Zhang. Multi-objective neural bandits with random scalarization. InProceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, pages 4914–4922, 2025

  23. [23]

    Safe RLHF: Safe Reinforcement Learning from Human Feedback

    Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, and Yaodong Yang. Safe rlhf: Safe reinforcement learning from human feedback.arXiv preprint arXiv:2310.12773, 2023

  24. [24]

    Normal-boundary intersection: A new method for generating the pareto surface in nonlinear multicriteria optimization problems.SIAM journal on optimization, 8(3):631–657, 1998

    Indraneel Das and John E Dennis. Normal-boundary intersection: A new method for generating the pareto surface in nonlinear multicriteria optimization problems.SIAM journal on optimization, 8(3):631–657, 1998

  25. [25]

    Designing multi-objective multi-armed bandits algorithms: A study

    Madalina M Drugan and Ann Nowe. Designing multi-objective multi-armed bandits algorithms: A study. InThe 2013 international joint conference on neural networks (IJCNN), pages 1–8. IEEE, 2013

  26. [26]

    Scalarization based pareto optimal set of arms identification algorithms

    Madalina M Drugan and Ann Nowé. Scalarization based pareto optimal set of arms identification algorithms. In2014 International Joint Conference on Neural Networks (IJCNN), pages 2690–2697. IEEE, 2014

  27. [27]

    Efficiency of minimizing compositions of convex functions and smooth maps.Mathematical Programming, 178(1):503–558, 2019

    Dmitriy Drusvyatskiy and Courtney Paquette. Efficiency of minimizing compositions of convex functions and smooth maps.Mathematical Programming, 178(1):503–558, 2019

  28. [28]

    Springer, 2005

    Matthias Ehrgott.Multicriteria optimization. Springer, 2005

  29. [29]

    Trade-off analysis approach for interactive nonlinear multiobjective optimization.OR spectrum, 34(4):803–816, 2012

    Petri Eskelinen and Kaisa Miettinen. Trade-off analysis approach for interactive nonlinear multiobjective optimization.OR spectrum, 34(4):803–816, 2012

  30. [30]

    Cambridge university press Cambridge, UK, 2002

    Brian S Everitt and Anders Skrondal.The Cambridge dictionary of statistics, volume 140. Cambridge university press Cambridge, UK, 2002

  31. [31]

    Stochastic optimization under hidden convexity.SIAM Journal on Optimization, 35(4):2544–2571, 2025

    Ilyas Fatkhullin, Niao He, and Yifan Hu. Stochastic optimization under hidden convexity.SIAM Journal on Optimization, 35(4):2544–2571, 2025

  32. [32]

    Alegre, Ann Nowé, Ana L

    Florian Felten, Lucas N. Alegre, Ann Nowé, Ana L. C. Bazzan, El Ghazali Talbi, Grégoire Danoy, and Bruno C. da Silva. A toolkit for reliable benchmarking and research in multi-objective reinforcement learning. InProceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS 2023), 2023

  33. [33]

    Multi-objective reinforcement learning based on decomposition: A taxonomy and framework.Journal of Artificial Intelligence Research, 79:679–723, 2024

    Florian Felten, El-Ghazali Talbi, and Grégoire Danoy. Multi-objective reinforcement learning based on decomposition: A taxonomy and framework.Journal of Artificial Intelligence Research, 79:679–723, 2024

  34. [34]

    Mitigating gradient bias in multi-objective learning: A provably convergent stochastic approach

    Heshan Fernando, Han Shen, Miao Liu, Subhajit Chaudhury, Keerthiram Murugesan, and Tianyi Chen. Mitigating gradient bias in multi-objective learning: A provably convergent stochastic approach. InProc. Int. Conf. Learn. Representations, Kigali, Rwanda, May 2023

  35. [35]

    Sensitivity analysis for nonlinear programming using penalty methods.Mathematical programming, 10(1):287–311, 1976

    Anthony V Fiacco. Sensitivity analysis for nonlinear programming using penalty methods.Mathematical programming, 10(1):287–311, 1976

  36. [36]

    Monotone piecewise cubic interpolation.SIAM Journal on Numerical Analysis, 17(2):238–246, 1980

    Frederick N Fritsch and Ralph E Carlson. Monotone piecewise cubic interpolation.SIAM Journal on Numerical Analysis, 17(2):238–246, 1980. 11

  37. [37]

    Sequential learning of the pareto front for multi-objective bandits

    Aurélien Garivier and Wouter M Koolen. Sequential learning of the pareto front for multi-objective bandits. InInternational Conference on Artificial Intelligence and Statistics, pages 3583–3591. PMLR, 2024

  38. [38]

    A theory of regularized markov decision processes

    Matthieu Geist, Bruno Scherrer, and Olivier Pietquin. A theory of regularized markov decision processes. InInternational conference on machine learning, pages 2160–2169. PMLR, 2019

  39. [39]

    User’s guide for snopt version 7: software for large-scale nonlinear programming

    Philip E Gill, Walter Murray, and Michael A Saunders. User’s guide for snopt version 7: software for large-scale nonlinear programming. Technical report, Systems Optimization Laboratory, Stanford University, 2006

  40. [40]

    Pearson education india, 2009

    Rafael C Gonzalez.Digital image processing. Pearson education india, 2009

  41. [41]

    Controllable preference optimization: Toward controllable multi-objective alignment

    Yiju Guo, Ganqu Cui, Lifan Yuan, Ning Ding, Zexu Sun, Bowen Sun, Huimin Chen, Ruobing Xie, Jie Zhou, Yankai Lin, et al. Controllable preference optimization: Toward controllable multi-objective alignment. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 1437–1454, 2024

  42. [42]

    Topology of pareto sets of strongly convex problems.SIAM Journal on Optimization, 30(3):2659–2686, 2020

    Naoki Hamada, Kenta Hayano, Shunsuke Ichiki, Yutaro Kabata, and Hiroshi Teramoto. Topology of pareto sets of strongly convex problems.SIAM Journal on Optimization, 30(3):2659–2686, 2020

  43. [43]

    Cambridge university press, 1952

    Godfrey Harold Hardy, John Edensor Littlewood, and George Pólya.Inequalities. Cambridge university press, 1952

  44. [44]

    Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

  45. [45]

    Springer Science & Business Media, 2010

    Weizhang Huang and Robert D Russell.Adaptive moving mesh methods, volume 174. Springer Science & Business Media, 2010

  46. [46]

    SIAM, 2008

    Kazufumi Ito and Karl Kunisch.Lagrange multiplier approach to variational problems and applications. SIAM, 2008

  47. [47]

    Scalarization in multi objective optimization

    Johannes Jahn. Scalarization in multi objective optimization. InMathematics of multi objective optimiza- tion, pages 45–88. Springer, 1985

  48. [48]

    On the complexity of best-arm identification in multi-armed bandit models.The Journal of Machine Learning Research, 17(1):1–42, 2016

    Emilie Kaufmann, Olivier Cappé, and Aurélien Garivier. On the complexity of best-arm identification in multi-armed bandit models.The Journal of Machine Learning Research, 17(1):1–42, 2016

  49. [49]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

  50. [50]

    Gradient-adaptive policy optimization: Towards multi-objective alignment of large language models

    Chengao Li, Hanyu Zhang, Yunkun Xu, Hongyan Xue, Xiang Ao, and Qing He. Gradient-adaptive policy optimization: Towards multi-objective alignment of large language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11214–11232, 2025

  51. [51]

    Multi-objective large language model alignment with hierarchical experts.arXiv preprint arXiv:2505.20925, 2025

    Zhuo Li, Guodong Du, Weiyang Guo, Yigeng Zhou, Xiucheng Li, Wenya Wang, Fangming Liu, Yequan Wang, Deheng Ye, Min Zhang, et al. Multi-objective large language model alignment with hierarchical experts.arXiv preprint arXiv:2505.20925, 2025

  52. [52]

    Parm: Multi-objective test-time alignment via preference-aware autoregressive reward model.arXiv preprint arXiv:2505.06274, 2025

    Baijiong Lin, Weisen Jiang, Yuancheng Xu, Hao Chen, and Ying-Cong Chen. Parm: Multi-objective test-time alignment via preference-aware autoregressive reward model.arXiv preprint arXiv:2505.06274, 2025

  53. [53]

    Policy-regularized offline multi-objective reinforcement learning.arXiv preprint arXiv:2401.02244, 2024

    Qian Lin, Chao Yu, Zongkai Liu, and Zifan Wu. Policy-regularized offline multi-objective reinforcement learning.arXiv preprint arXiv:2401.02244, 2024

  54. [54]

    Smooth Tchebycheff scalarization for multi-objective optimization.arXiv preprint arXiv:2402.19078, 2024

    Xi Lin, Xiaoyuan Zhang, Zhiyuan Yang, Fei Liu, Zhenkun Wang, and Qingfu Zhang. Smooth Tchebycheff scalarization for multi-objective optimization.arXiv preprint arXiv:2402.19078, 2024

  55. [55]

    Pareto multi-task learning

    Xi Lin, Hui-Ling Zhen, Zhenhua Li, Qing-Fu Zhang, and Sam Kwong. Pareto multi-task learning. In Proc. Advances Neural Inf. Process. Syst., Vancouver, Canada, December 2019

  56. [56]

    Conflict-Averse Gradient Descent for Multi-task Learning

    Bo Liu, Xingchao Liu, Xiaojie Jin, Peter Stone, and Qiang Liu. Conflict-Averse Gradient Descent for Multi-task Learning. InProc. Advances Neural Inf. Process. Syst., virtual, December 2021

  57. [57]

    Dora: Weight-decomposed low-rank adaptation

    Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. Dora: Weight-decomposed low-rank adaptation. InForty-first International Conference on Machine Learning, 2024. 12

  58. [58]

    The stochastic multi-gradient algorithm for multi-objective opti- mization and its application to supervised machine learning.Annals of Operations Research, pages 1–30, 2021

    Suyun Liu and Luis Nunes Vicente. The stochastic multi-gradient algorithm for multi-objective opti- mization and its application to supervised machine learning.Annals of Operations Research, pages 1–30, 2021

  59. [59]

    Singular continuation: generating piecewise linear approximations to pareto sets via global analysis.SIAM Journal on Optimization, 21(2):463–490, 2011

    Alberto Lovison. Singular continuation: generating piecewise linear approximations to pareto sets via global analysis.SIAM Journal on Optimization, 21(2):463–490, 2011

  60. [60]

    Multi-task learning with user preferences: Gradient descent with controlled ascent in Pareto optimization

    Debabrata Mahapatra and Vaibhav Rajan. Multi-task learning with user preferences: Gradient descent with controlled ascent in Pareto optimization. InProc. Int. Conf. Machine Learn., virtual, 2020

  61. [61]

    The weighted sum method for multi-objective optimization: new insights.Structural and multidisciplinary optimization, 41(6):853–862, 2010

    R Timothy Marler and Jasbir S Arora. The weighted sum method for multi-objective optimization: new insights.Structural and multidisciplinary optimization, 41(6):853–862, 2010

  62. [62]

    On continuation methods for non-linear bi-objective optimization: towards a certified interval-based approach.Journal of Global Optimization, 64(1):3–16, 2016

    Brice Martin, Alexandre Goldsztejn, Laurent Granvilliers, and Christophe Jermann. On continuation methods for non-linear bi-objective optimization: towards a certified interval-based approach.Journal of Global Optimization, 64(1):3–16, 2016

  63. [63]

    Springer US, Boston, MA, 1998

    Kaisa Miettinen.Nonlinear Multiobjective Optimization, volume 12. Springer US, Boston, MA, 1998

  64. [64]

    A multi-objective/multi-task learning framework induced by Pareto stationarity

    Michinari Momma, Chaosheng Dong, and Jia Liu. A multi-objective/multi-task learning framework induced by Pareto stationarity. InProc. Int. Conf. Machine Learn., Baltimore, MD, USA, 2022

  65. [65]

    Efficient memory-based learning for robot control

    Andrew William Moore. Efficient memory-based learning for robot control. Technical report, University of Cambridge, 1990

  66. [66]

    Optimal rates of convergence for entropy regularization in discounted markov decision processes.Information and Inference: A Journal of the IMA, 15(1), 2026

    Johannes Müller and Semih Cayci. Optimal rates of convergence for entropy regularization in discounted markov decision processes.Information and Inference: A Journal of the IMA, 15(1), 2026

  67. [67]

    Learning the pareto front with hypernet- works

    Aviv Navon, Aviv Shamsian, Ethan Fetaya, and Gal Chechik. Learning the pareto front with hypernet- works. InProc. Int. Conf. Learn. Representations, virtual, April 2020

  68. [68]

    Cubic regularization of newton method and its global performance

    Yurii Nesterov and Boris T Polyak. Cubic regularization of newton method and its global performance. Mathematical programming, 108(1):177–205, 2006

  69. [69]

    Asynchronous distributed solution of large scale nonlinear inversion problems.Applied numerical mathematics, 30(1):31–40, 1999

    Victor Pereyra. Asynchronous distributed solution of large scale nonlinear inversion problems.Applied numerical mathematics, 30(1):31–40, 1999

  70. [70]

    Fast computation of equispaced pareto manifolds and pareto fronts for multiobjective optimization problems.Mathematics and Computers in Simulation, 79(6):1935–1947, 2009

    Victor Pereyra. Fast computation of equispaced pareto manifolds and pareto fronts for multiobjective optimization problems.Mathematics and Computers in Simulation, 79(6):1935–1947, 2009

  71. [71]

    Equispaced pareto front construction for constrained bi-objective optimization.Mathematical and Computer Modelling, 57(9-10):2122–2131, 2013

    Victor Pereyra, Michael Saunders, and José Castillo. Equispaced pareto front construction for constrained bi-objective optimization.Mathematical and Computer Modelling, 57(9-10):2122–2131, 2013

  72. [72]

    Functional bilevel optimization for machine learning

    Ieva Petrulionyte, Julien Mairal, and Michael Arbel. Functional bilevel optimization for machine learning. Advances in Neural Information Processing Systems, 37:14016–14065, 2024

  73. [73]

    Traversing pareto optimal policies: Provably efficient multi-objective reinforcement learning.arXiv preprint arXiv:2407.17466, 2024

    Shuang Qiu, Dake Zhang, Rui Yang, Boxiang Lyu, and Tong Zhang. Traversing pareto optimal policies: Provably efficient multi-objective reinforcement learning.arXiv preprint arXiv:2407.17466, 2024

  74. [74]

    Qwen2.5 technical report, 2025

    Qwen Team, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao L...

  75. [75]

    Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

  76. [76]

    Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards.Advances in Neural Information Processing Systems, 36:71095–71134, 2023

    Alexandre Rame, Guillaume Couairon, Corentin Dancette, Jean-Baptiste Gaya, Mustafa Shukor, Laure Soulier, and Matthieu Cord. Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards.Advances in Neural Information Processing Systems, 36:71095–71134, 2023

  77. [77]

    Pareto conditioned networks.arXiv preprint arXiv:2204.05036, 2022

    Mathieu Reymond, Eugenio Bargiacchi, and Ann Nowé. Pareto conditioned networks.arXiv preprint arXiv:2204.05036, 2022. 13

  78. [78]

    Multi-objective reinforcement learning for the expected utility of the return

    Diederik M Roijers, Denis Steckelmacher, and Ann Nowé. Multi-objective reinforcement learning for the expected utility of the return. InProceedings of the Adaptive and Learning Agents workshop at FAIM, volume 2018, 2018

  79. [79]

    A survey of multi-objective sequential decision-making.Journal of Artificial Intelligence Research, 48:67–113, 2013

    Diederik M Roijers, Peter Vamplew, Shimon Whiteson, and Richard Dazeley. A survey of multi-objective sequential decision-making.Journal of Artificial Intelligence Research, 48:67–113, 2013

  80. [80]

    Optimization on pareto sets: On a theory of multi-objective optimization.arXiv preprint arXiv:2308.02145, 2023

    Abhishek Roy, Geelon So, and Yi-An Ma. Optimization on pareto sets: On a theory of multi-objective optimization.arXiv preprint arXiv:2308.02145, 2023

Showing first 80 references.