pith. sign in

arxiv: 2605.09094 · v2 · submitted 2026-05-09 · 💻 cs.LG

A Tale of Two Problems: Multi-Task Bilevel Learning Meets Equality Constrained Multi-Objective Optimization

Pith reviewed 2026-05-15 04:53 UTC · model grok-4.3

classification 💻 cs.LG
keywords multi-task bilevel learningequality constrained multi-objective optimizationweighted Chebyshev penaltyPareto stationarityfinite-time convergencestochastic optimizationmachine learning
0
0 comments X

The pith

Reformulating multi-task bilevel learning under general convexity as an equality-constrained multi-objective problem lets a weighted Chebyshev penalty algorithm reach KKT Pareto stationarity at rate O(S T^{-1/2}).

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that multi-task bilevel learning problems, common in modern machine learning, can be turned into equality-constrained multi-objective optimization problems when the lower level obeys only a general convexity condition rather than strong convexity. This change of perspective supplies a new convergence target: KKT-based Pareto stationarity for the multi-objective formulation. A weighted Chebyshev penalty algorithm is introduced that drives the iterates to such stationary points at a finite-time rate of O(S T^{-1/2}) in both deterministic and stochastic regimes. Sweeping the preference vector across the simplex traces different points on the Pareto front, and every ECMO solution maps back to a solution of the original bilevel problem.

Core claim

By recasting the multi-task bilevel learning problem with lower-level general convexity as an equality-constrained multi-objective optimization problem, the weighted Chebyshev penalty algorithm converges in finite time to KKT-based Pareto stationary points at rate O(S T^{-1/2}) in both deterministic and stochastic regimes; varying the preference vector systematically explores the Pareto front, and the ECMO solutions translate directly into solutions for the original bilevel problem.

What carries the argument

The weighted Chebyshev penalty algorithm, which scalarizes the multi-objective objectives with a Chebyshev function and penalizes the equality constraints to drive convergence to KKT-based Pareto stationarity.

If this is right

  • Solutions of the reformulated equality-constrained multi-objective problem are guaranteed to solve the original multi-task bilevel learning problem.
  • The same O(S T^{-1/2}) convergence rate holds for both deterministic and stochastic versions of the problem.
  • Sweeping the preference vector over the simplex produces a systematic sampling of the Pareto front.
  • The KKT-based Pareto stationarity notion serves as a well-defined stopping criterion for algorithm design on this new problem class.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The reformulation may allow existing multi-objective solvers to be applied directly to other bilevel problems that satisfy the same convexity condition.
  • The finite-time rate suggests the method remains practical when the number of tasks S is moderate and the iteration budget T is large.
  • Connecting bilevel and equality-constrained multi-objective frameworks could motivate similar translations for other constrained learning problems.
  • Empirical tests on standard multi-task benchmarks would check whether the direct solution mapping holds in practice.

Load-bearing premise

The multi-task bilevel learning problem with lower-level general convexity can be rewritten exactly as an equality-constrained multi-objective optimization problem whose solutions are also solutions to the original bilevel problem.

What would settle it

A concrete multi-task bilevel instance with lower-level general convexity in which the algorithm either fails to reach KKT Pareto stationarity after T steps or produces a point that does not solve the original bilevel problem.

Figures

Figures reproduced from arXiv: 2605.09094 by Jia Liu, Jiaxiang Li, Myeung Suk Oh, Xin Zhang, Zhen Qin, Zhiyao Zhang.

Figure 1
Figure 1. Figure 1: Roadmap of our proposed approach for solving the MTBL problem under the LLGC assumption. Momma et al., 2022; Fernando et al., 2023)), the major￾ity of existing works only considered unconstrained MOO. Meanwhile, constrained MOO problems, including ECMO, are still in their infancy. To date, although several heuristic algorithms have been proposed for ECMO and empirically validated (Qu & Suganthan, 2011; Yan… view at source ↗
Figure 2
Figure 2. Figure 2: z˜ is Pareto stationary but violates Definition 3. direction exists locally. Note that for unconstrained MOO problems, Pareto stationarity can be defined as follows: Definition 3 (Pareto Stationarity for Unconstrained MOO). For the unconstrained MOO problem minz F(z) ⊤ = (f1(z), . . . , fS(z)), z˜ is a Pareto stationary point if and only if there does not exist a direction d ∈ R k , such that ∇fs(˜z) ⊤d < … view at source ↗
Figure 3
Figure 3. Figure 3: One-to-one mapping between ECMO and its WC￾scalarized problem. the non-smoothness of “min-max” operation introduced by the ℓ∞-norm minimization, we can further reformulate the WC-scalarization for the ECMO problem as follows: min ρ,z ρ, s.t. hi(z) = 0, i = 1, . . . , q, λsfs(z) ≤ ρ, s ∈ [S]. (WC) It is well known in the MOO literature that there exists a one￾to-one mapping between the solutions of WC-scala… view at source ↗
Figure 5
Figure 5. Figure 5: Steps to prove Theorem 3. Pareto stationarity for any given preference weight vector λ. Moreover, according to the previous discussions on WC￾scalarization, by varying λ over ∆ ++ S , Algorithm 1 can systematically explore the entire Pareto stationary front. 5. Returning to MTBL Problems through the Lens of ECMO Finally, we can easily solve the MTBL problem as a special case of the ECMO problem: we first s… view at source ↗
Figure 6
Figure 6. Figure 6: Data weighting for RLHF reward model training. (a) Pareto exploration. (b) Baseline comparison [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: ϵ-metric results in LLM alignment. λs ′ = 0.01, ∀s ′ ̸= s, using 1/loss as our metric for each objective. Fig. 7a shows that Alg. 1 is able to achieve Pareto stationary points with better performance on specific objectives when larger weights are assigned to them, again verifying the Pareto exploration capability of our algorithm. Moreover, Fig. 7b indicates that our algorithm outperforms two bilevel basel… view at source ↗
Figure 9
Figure 9. Figure 9: Pareto stationary and nonstationary examples in ECMO problems. consider Pareto stationarity: PS(z, v, α) = ∇L1(z, v) h(z)  , . . . ,  ∇LS(z, v) h(z)  α =  ∇F(z)α + ∇h(z)v h(z)  = 0, where α ∈ ∆ + S , and v ∈ R q . PS not only takes the feasible direction into account, but also enforces the feasibility directly. It precisely captures both the Pareto stationary and nonstationary scenarios depicted in… view at source ↗
Figure 10
Figure 10. Figure 10: A toy example. Therefore, we obtain minv ∥K(zT , v, λ)∥ 2 2 ≤ L 2 T ∥z0 − z ∗∥ 2 2 . According to the argument about KKT system and Lemma 2, we know that Algorithm 2 converges to weakly Pareto optimal solutions at a rate of O(T −1 ). In addition, we can also traverse λ over ∆ + S to let Algorithm 2 reconstruct the entire weak Pareto front. To give a more concrete example, we provide a concrete example to … view at source ↗
Figure 11
Figure 11. Figure 11: Additional results for Pareto exploration. prioritize specific objectives, our algorithm yields a lower validation for those objectives compared to the case of using alternative preference vectors. 2) Additional Numerical Results. We now provide more numerical results on this data weighting for reward model training task, accompanied by discussions to emphasize the advantages of Algorithm 1 in this subsec… view at source ↗
Figure 12
Figure 12. Figure 12: Additional results for Convergence Performance. (a) ITD with LS. (b) SOBA with LS. (c) Comparison [PITH_FULL_IMAGE:figures/full_fig_p034_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Additional results on bilevel algorithms. we omit the use of implicit gradient methods (Ghadimi & Wang, 2018; Ji et al., 2021) to compute the Hessian inverse, significantly reducing computational costs. The best slope of our approach in Figure 12b further validates its convergence performance. Specifically, as illustrated in Theorem 3, our WC-Penalty algorithm achieves a convergence rate of O(S/T 1 2 ) fo… view at source ↗
Figure 14
Figure 14. Figure 14: Additional results for Pareto exploration. (a) SOBA with LS. (b) ITD with LS [PITH_FULL_IMAGE:figures/full_fig_p036_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Additional results on bilevel algorithms. 1. Pareto Exploration [PITH_FULL_IMAGE:figures/full_fig_p036_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Additional results on MTBL algorithms. (a) Pareto exploration (3B). (b) Baseline comparison (3B). (c) Pareto exploration (8B) [PITH_FULL_IMAGE:figures/full_fig_p037_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Data weighting task in larger-scale (3B & 8B) LLM alignment. mentioned in the setup, we set the inner-loop iterations (if applicable) as 40 for every algorithm. Nevertheless, this leads to “CUDA out of memory” error when implementing the FORUM algorithm, since 1) its workflows are overly complicated, and 2) its maintained values are extremely memory-consuming. In fact, in our GPUs with 94GB of memory each… view at source ↗
Figure 18
Figure 18. Figure 18: ϵ-metric. Moreover, we also compare our Algorithm 1 with MTBL baselines (Ye et al., 2021; Fernando et al., 2023) with two important metrics, hypervolume and ϵ-metric. Ta￾ble 4 demonstrates that our algorithm dramatically out￾performs the baselines even before completing full Pareto exploration (labeled as Helpfulness, etc.) in terms of hypervolume, and the Pareto exploration still leads to bet￾ter perform… view at source ↗
Figure 20
Figure 20. Figure 20 [PITH_FULL_IMAGE:figures/full_fig_p039_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Additional synthetic examples. (i) Consider an LLGC-MTBL problem with x, y ∈ R: min x,y [PITH_FULL_IMAGE:figures/full_fig_p041_21.png] view at source ↗
read the original abstract

In recent years, bilevel optimization (BLO) has attracted significant attention for its broad applications in machine learning. However, most existing works on BLO remain confined to the single-task setting and rely on the lower-level strong convexity assumption, which significantly restricts their applicability to modern machine learning problems of growing complexity. In this paper, we make the first attempt to extend BLO to the multi-task setting under a relaxed lower-level general convexity (LLGC) assumption. To this end, we reformulate the multi-task bilevel learning (MTBL) problem with LLGC into an equality constrained multi-objective optimization (ECMO) problem. However, ECMO itself is a new problem that has not yet been studied in the literature. To address this gap, we first establish a new Karush-Kuhn-Tucker (KKT)-based Pareto stationarity as the convergence criterion for ECMO algorithm design. Based on this foundation, we propose a weighted Chebyshev (WC)-penalty algorithm that achieves a finite-time convergence rate of $O(ST^{-\frac{1}{2})$ to KKT-based Pareto stationarity in both deterministic and stochastic settings, where $S$ denotes the number of objectives, and $T$ is the total iterations. Moreover, by varying the preference vector over the $S$-dimensional simplex, our WC-penalty method systematically explores the Pareto front. Finally, solutions to the ECMO problem translate directly into solutions for the original MTBL problem, thereby closing the loop between these two foundational optimization frameworks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper extends single-task bilevel optimization to the multi-task setting under a relaxed lower-level general convexity (LLGC) assumption. It reformulates the multi-task bilevel learning (MTBL) problem as an equality-constrained multi-objective optimization (ECMO) problem, introduces a KKT-based Pareto stationarity notion for ECMO, and proposes a weighted Chebyshev penalty algorithm that converges at rate O(S T^{-1/2}) to this stationarity in both deterministic and stochastic regimes. The method explores the Pareto front via preference vectors, and claims that ECMO solutions map directly back to MTBL solutions.

Significance. If the reformulation equivalence holds, the work is significant for relaxing strong-convexity assumptions that limit bilevel methods in modern multi-task ML, while providing the first finite-time rate for this new ECMO class and a practical Pareto-front exploration mechanism. The explicit bridging of MTBL and ECMO frameworks could enable new algorithmic designs if the KKT mapping is shown to be bijective.

major comments (2)
  1. [Reformulation section (likely §3)] The central claim that 'solutions to the ECMO problem translate directly into solutions for the original MTBL problem' (abstract and reformulation section) requires an explicit bijection proof. Under LLGC the lower-level argmin set need not be singleton; the equality constraints must therefore encode the entire solution set exactly. If the reformulation only enforces stationarity (rather than global optimality) of the lower level, KKT-based Pareto stationary points of ECMO can include points that are infeasible or suboptimal for MTBL. Provide the full mapping and verification that every ECMO KKT point corresponds to an MTBL solution and vice versa.
  2. [Convergence analysis (likely §4–5)] The O(S T^{-1/2}) finite-time rate (abstract and convergence analysis) is derived for the ECMO formulation. Because the rate is advertised for the motivating MTBL problem, any gap in the equivalence immediately weakens the claim; the analysis must either prove the mapping preserves stationarity exactly or state the precise conditions under which the rate carries over to MTBL.
minor comments (1)
  1. [Abstract] In the abstract the rate is written as O(ST^{-1/2}); confirm that the main text consistently defines S as the number of objectives and clarifies whether the bound is in terms of total iterations T or per-objective iterations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the detailed review and valuable feedback on our work. We have carefully considered the major comments regarding the reformulation equivalence and the transfer of convergence rates. Below, we provide point-by-point responses and outline the revisions we will make to address these concerns.

read point-by-point responses
  1. Referee: [Reformulation section (likely §3)] The central claim that 'solutions to the ECMO problem translate directly into solutions for the original MTBL problem' (abstract and reformulation section) requires an explicit bijection proof. Under LLGC the lower-level argmin set need not be singleton; the equality constraints must therefore encode the entire solution set exactly. If the reformulation only enforces stationarity (rather than global optimality) of the lower level, KKT-based Pareto stationary points of ECMO can include points that are infeasible or suboptimal for MTBL. Provide the full mapping and verification that every ECMO KKT point corresponds to an MTBL solution and vice versa.

    Authors: We thank the referee for highlighting this critical aspect of the reformulation. The manuscript constructs the ECMO equality constraints directly from the variational inequality characterization of the lower-level argmin set under the LLGC assumption (i.e., 0 ∈ ∂f(x,y) + N_Y(y) for each task), which is necessary and sufficient for global optimality when the lower level is convex. This encodes the full (possibly non-singleton) solution set without requiring strong convexity. To make the bijection fully explicit, we will add a dedicated theorem and proof in the revised reformulation section establishing that (i) every MTBL feasible point maps to a feasible ECMO point with identical objective values, and (ii) every KKT-based Pareto stationary point of the ECMO problem corresponds to a point satisfying the MTBL optimality conditions. This revision will eliminate any ambiguity about extraneous stationary points. revision: yes

  2. Referee: [Convergence analysis (likely §4–5)] The O(S T^{-1/2}) finite-time rate (abstract and convergence analysis) is derived for the ECMO formulation. Because the rate is advertised for the motivating MTBL problem, any gap in the equivalence immediately weakens the claim; the analysis must either prove the mapping preserves stationarity exactly or state the precise conditions under which the rate carries over to MTBL.

    Authors: We agree that the finite-time rate is formally derived for the ECMO problem. In the revised manuscript we will insert a corollary immediately following the main convergence theorem. The corollary will state that, under the LLGC assumption and the bijection established in the reformulation section, any sequence converging to KKT-based Pareto stationarity in ECMO at rate O(S T^{-1/2}) yields a sequence converging at the same rate to the corresponding stationarity notion for the original MTBL problem. We will also add a short remark clarifying the exact conditions (convexity of the lower level and the preference-vector parameterization) under which the mapping preserves stationarity exactly. This ensures the advertised rate for MTBL is rigorously justified. revision: yes

Circularity Check

0 steps flagged

No significant circularity; reformulation and convergence analysis are independent contributions

full rationale

The paper defines a reformulation of MTBL under LLGC into ECMO as a modeling step, then introduces a new WC-penalty algorithm and derives its O(ST^{-1/2}) convergence to KKT Pareto stationarity for the ECMO problem. Solutions are asserted to translate back to MTBL by the reformulation itself. No quoted equations show a fitted parameter renamed as prediction, no self-citation chain justifies the central claim, and the convergence rate is presented as a new finite-time result rather than reducing to prior inputs by construction. The derivation chain is self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the LLGC assumption permitting a valid reformulation to ECMO and on the new KKT-based Pareto stationarity serving as a suitable convergence criterion.

axioms (1)
  • domain assumption Lower-level general convexity (LLGC) assumption
    Relaxes the strong convexity typically required for the lower-level problem in bilevel optimization to enable the multi-task extension.

pith-pipeline@v0.9.0 · 5593 in / 1294 out tokens · 48433 ms · 2026-05-15T04:53:13.038199+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages

  1. [1]

    Ye, F., Lin, B., Yue, Z., Guo, P., Xiao, Q., and Zhang, Y

    URL https://openreview.net/forum? id=xJ5N8qrEPl. Ye, F., Lin, B., Yue, Z., Guo, P., Xiao, Q., and Zhang, Y . Multi-objective meta learning.Advances in Neural Information Processing Systems, 34:21338–21351, 2021. Ye, F., Lin, B., Cao, X., Zhang, Y ., and Tsang, I. W. A first-order multi-gradient algorithm for multi-objective bi-level optimization. InECAI 2...

  2. [2]

    Recently, (Zhang et al., 2026), for the first time in the literature, investigates the Pareto front exploration, yet their approach requires the restrictive LLSC condition

    provide algorithms with a convergence rate of O(ST − 1 2 ) and O(ST − 1 4 ), respectively. Recently, (Zhang et al., 2026), for the first time in the literature, investigates the Pareto front exploration, yet their approach requires the restrictive LLSC condition. However, all of these works heavily depend on the LLSC condition: not only is the algorithmic...

  3. [3]

    In addition, we can also traverse λ over ∆+ S to let Algorithm 2 reconstruct the entire weak Pareto front

    According to the argument about KKT system and Lemma 2, we know that Algorithm 2 converges to weakly Pareto optimal solutions at a rate of O(T −1). In addition, we can also traverse λ over ∆+ S to let Algorithm 2 reconstruct the entire weak Pareto front. To give a more concrete example, we provide a concrete example to show the performance of our proposed...

  4. [4]

    X s∈It (|¯ct,s −c t,s|+|c t,s|) #2 ≤4E

    Ifa t,s <0≤b t,s, thenr 2 t,s =a 2 t,s ≤b 2 t,s/v2 =c 2 t,s, and we haves∈ J t. We can follow Step B.2 to obtain: X s∈Jt c2 t,s ≤ X s∈Jt ct,s !2 ≤ 2 √ S∥dt∥+ 1 v !2 . Here, we note that for eachs∈[S], only one of the cases holds. Therefore, we combine these results to get: SX s=1 r2 t,s ≤ ∥d t,δ∥2 + 4S∥dt∥2 + 2 v2 , which implies: 1 T T−1X t=0 SX s=1 r2 t...

  5. [5]

    In other words, The convergence rate isO(S/T 1 2 ). Remark 6.By comparing Algorithm 1 and Algorithm 3, along with their respective analyses, we identify that the key challenge in the stochastic scenario arises from thestochastic gradients. Specifically, due to the gap between the full gradient and its stochastic estimator, the analysis for Algorithm 3 bec...

  6. [6]

    31 A Tale of Two Problems: Multi-Task Bilevel Learning Meets Equality Constrained Multi-Objective Optimization E

    use the Chebyshev Inequality toaccuratelybound the dual feasibility and complementary slackness terms in the KKT system; and 3) carefully select the batch-sizesBandTto ensure finite-time convergence. 31 A Tale of Two Problems: Multi-Task Bilevel Learning Meets Equality Constrained Multi-Objective Optimization E. Setups and Additional Results of Numerical ...

  7. [7]

    Overview.The reward model scores LLM-generated responses to prompts based on human-aligned criteria in Reinforcement Learning from Human Feedback (RLHF)

    Detailed Setup. Overview.The reward model scores LLM-generated responses to prompts based on human-aligned criteria in Reinforcement Learning from Human Feedback (RLHF). The multi-objective data weighting task aims to determine optimal weights over training datasets for training a reward model that maximize multiple validation metrics in Pareto sense. As ...

  8. [8]

    Additional Numerical Results. We now provide more numerical results on this data weighting for reward model training task, accompanied by discussions to emphasize the advantages of Algorithm 1 in this subsection

  9. [9]

    slightly prefer

    Pareto Exploration. In addition to the results demonstrated in Section 5, we select5 more additional preference vectors by setting λ as λs = 0.84 for some s∈[S] and λs′ = 0.04, ∀s′ ̸=s , referring to this as “slightly prefer” some objective in Figure 11a. This further verifies the Pareto exploration capability of Algorithm 1. Furthermore, to provide a cle...

  10. [10]

    Except for the ability on Pareto exploration, we also highlight the good convergence behavior in Figure 12

    Convergence Performance. Except for the ability on Pareto exploration, we also highlight the good convergence behavior in Figure 12. Specifically, we compare the running time of our algorithm with that of all baselines over T= 3,000 steps in Figure 12a. We average the loss over 5 trials for each algorithm and include the standard error bars to ensure stat...

  11. [11]

    irregular

    More Discussions. Finally, we provide some additional discussion for this experiment, focusing on three main aspects as follows.Dataset: The dataset we use (HelpSteer, (Wang et al., 2023)) is almost the “optimal” to validate our algorithm, as it contains5 objectives, whereas most other existing datasets have no more than3. This allows a more realistic sim...

  12. [12]

    Overview.In the Large Language Model (LLM) Alignment task, our goal is to align a pretrained LLM with human preferences

    Detailed Setup. Overview.In the Large Language Model (LLM) Alignment task, our goal is to align a pretrained LLM with human preferences. Instead of relying on a reward model to guide the LLM, we directly utilize the prompt-response data to finetune the language model. In this section, we introduce our data weighting task for multi-objective LLM alignment....

  13. [13]

    Similarly, we provide more numerical results on this data weighting in LLM alignment task along with discussions in this subsection

    Additional Numerical Results. Similarly, we provide more numerical results on this data weighting in LLM alignment task along with discussions in this subsection. 35 A Tale of Two Problems: Multi-Task Bilevel Learning Meets Equality Constrained Multi-Objective Optimization (a)Exploration with more preferences. (b)Different objectives in Alg. 1. Figure 14....

  14. [14]

    slightly prefer

    Pareto Exploration. Figure 14 presents additional numerical results on Pareto exploration. In Figure 14a, “slightly prefer” refers to selecting λs = 0.84 for some s and λs′ = 0.04 for s′ ̸=s . While these preferences do not yield improved performance, they still exhibit regular Pareto exploration behavior, as the loss on the focused objective remains rela...

  15. [15]

    CUDA out of memory

    MTBL Baselines and Discussions. We also consider the aforementioned MTBL algorithms (Ye et al., 2021; Fernando et al., 2023; Ye et al., 2024) as our baselines in Figure 16. Specifically, our algorithm still outperforms in Pareto exploration when compared with MOML and MoCo algorithms, since a larger portion of Pareto front is covered by our approach, as d...

  16. [16]

    Larger-Scale Numerical Experiments and Results. In order to further validate the capability of our Algorithm 1 in large-scale problems, we enlarge the pretrained LLM model fromLlama-3.2-1B-InstructtoLlama-3.2-3B-InstructandLlama-3.1-8B-Instructin this subsection. In Figure 17, we set the preference vector λ as λs = 0.96 for some s∈[S] and λs′ = 0.01, ∀s′ ...

  17. [17]

    Experimental Setup. Overview.We consider a multi-task meta-learning prob- lem (Ye et al., 2021; Ji et al., 2021; Qin et al., 2025), where the goal is to train a single model capable of addressing multiple tasks within the MTBL framework. This task is particularly useful for handling heterogeneous datasets using a relatively small-scale model. Specifically...

  18. [18]

    Equally Prefer

    Numerical Results. Figure 19 demonstrates the effectiveness of our Algorithm 1 in Pareto exploration and its superior performance compared to baselines. Specifically, in Figure 19a, in addition to the preference vectors used in the previous subsections, we also include the “Equally Prefer” preference, where λ= [0.2,0.2,0.2,0.2,0.2] ⊤. The numerical result...