A Tale of Two Problems: Multi-Task Bilevel Learning Meets Equality Constrained Multi-Objective Optimization

Jia Liu; Jiaxiang Li; Myeung Suk Oh; Xin Zhang; Zhen Qin; Zhiyao Zhang

arxiv: 2605.09094 · v2 · submitted 2026-05-09 · 💻 cs.LG

A Tale of Two Problems: Multi-Task Bilevel Learning Meets Equality Constrained Multi-Objective Optimization

Zhiyao Zhang , Myeung Suk Oh , Zhen Qin , Jiaxiang Li , Xin Zhang , Jia Liu This is my paper

Pith reviewed 2026-05-15 04:53 UTC · model grok-4.3

classification 💻 cs.LG

keywords multi-task bilevel learningequality constrained multi-objective optimizationweighted Chebyshev penaltyPareto stationarityfinite-time convergencestochastic optimizationmachine learning

0 comments

The pith

Reformulating multi-task bilevel learning under general convexity as an equality-constrained multi-objective problem lets a weighted Chebyshev penalty algorithm reach KKT Pareto stationarity at rate O(S T^{-1/2}).

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that multi-task bilevel learning problems, common in modern machine learning, can be turned into equality-constrained multi-objective optimization problems when the lower level obeys only a general convexity condition rather than strong convexity. This change of perspective supplies a new convergence target: KKT-based Pareto stationarity for the multi-objective formulation. A weighted Chebyshev penalty algorithm is introduced that drives the iterates to such stationary points at a finite-time rate of O(S T^{-1/2}) in both deterministic and stochastic regimes. Sweeping the preference vector across the simplex traces different points on the Pareto front, and every ECMO solution maps back to a solution of the original bilevel problem.

Core claim

By recasting the multi-task bilevel learning problem with lower-level general convexity as an equality-constrained multi-objective optimization problem, the weighted Chebyshev penalty algorithm converges in finite time to KKT-based Pareto stationary points at rate O(S T^{-1/2}) in both deterministic and stochastic regimes; varying the preference vector systematically explores the Pareto front, and the ECMO solutions translate directly into solutions for the original bilevel problem.

What carries the argument

The weighted Chebyshev penalty algorithm, which scalarizes the multi-objective objectives with a Chebyshev function and penalizes the equality constraints to drive convergence to KKT-based Pareto stationarity.

If this is right

Solutions of the reformulated equality-constrained multi-objective problem are guaranteed to solve the original multi-task bilevel learning problem.
The same O(S T^{-1/2}) convergence rate holds for both deterministic and stochastic versions of the problem.
Sweeping the preference vector over the simplex produces a systematic sampling of the Pareto front.
The KKT-based Pareto stationarity notion serves as a well-defined stopping criterion for algorithm design on this new problem class.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The reformulation may allow existing multi-objective solvers to be applied directly to other bilevel problems that satisfy the same convexity condition.
The finite-time rate suggests the method remains practical when the number of tasks S is moderate and the iteration budget T is large.
Connecting bilevel and equality-constrained multi-objective frameworks could motivate similar translations for other constrained learning problems.
Empirical tests on standard multi-task benchmarks would check whether the direct solution mapping holds in practice.

Load-bearing premise

The multi-task bilevel learning problem with lower-level general convexity can be rewritten exactly as an equality-constrained multi-objective optimization problem whose solutions are also solutions to the original bilevel problem.

What would settle it

A concrete multi-task bilevel instance with lower-level general convexity in which the algorithm either fails to reach KKT Pareto stationarity after T steps or produces a point that does not solve the original bilevel problem.

Figures

Figures reproduced from arXiv: 2605.09094 by Jia Liu, Jiaxiang Li, Myeung Suk Oh, Xin Zhang, Zhen Qin, Zhiyao Zhang.

**Figure 1.** Figure 1: Roadmap of our proposed approach for solving the MTBL problem under the LLGC assumption. Momma et al., 2022; Fernando et al., 2023)), the majority of existing works only considered unconstrained MOO. Meanwhile, constrained MOO problems, including ECMO, are still in their infancy. To date, although several heuristic algorithms have been proposed for ECMO and empirically validated (Qu & Suganthan, 2011; Yan… view at source ↗

**Figure 2.** Figure 2: z˜ is Pareto stationary but violates Definition 3. direction exists locally. Note that for unconstrained MOO problems, Pareto stationarity can be defined as follows: Definition 3 (Pareto Stationarity for Unconstrained MOO). For the unconstrained MOO problem minz F(z) ⊤ = (f1(z), . . . , fS(z)), z˜ is a Pareto stationary point if and only if there does not exist a direction d ∈ R k , such that ∇fs(˜z) ⊤d < … view at source ↗

**Figure 3.** Figure 3: One-to-one mapping between ECMO and its WCscalarized problem. the non-smoothness of “min-max” operation introduced by the ℓ∞-norm minimization, we can further reformulate the WC-scalarization for the ECMO problem as follows: min ρ,z ρ, s.t. hi(z) = 0, i = 1, . . . , q, λsfs(z) ≤ ρ, s ∈ [S]. (WC) It is well known in the MOO literature that there exists a oneto-one mapping between the solutions of WC-scala… view at source ↗

**Figure 5.** Figure 5: Steps to prove Theorem 3. Pareto stationarity for any given preference weight vector λ. Moreover, according to the previous discussions on WCscalarization, by varying λ over ∆ ++ S , Algorithm 1 can systematically explore the entire Pareto stationary front. 5. Returning to MTBL Problems through the Lens of ECMO Finally, we can easily solve the MTBL problem as a special case of the ECMO problem: we first s… view at source ↗

**Figure 6.** Figure 6: Data weighting for RLHF reward model training. (a) Pareto exploration. (b) Baseline comparison [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 8.** Figure 8: ϵ-metric results in LLM alignment. λs ′ = 0.01, ∀s ′ ̸= s, using 1/loss as our metric for each objective. Fig. 7a shows that Alg. 1 is able to achieve Pareto stationary points with better performance on specific objectives when larger weights are assigned to them, again verifying the Pareto exploration capability of our algorithm. Moreover, Fig. 7b indicates that our algorithm outperforms two bilevel basel… view at source ↗

**Figure 9.** Figure 9: Pareto stationary and nonstationary examples in ECMO problems. consider Pareto stationarity: PS(z, v, α) = ∇L1(z, v) h(z) , . . . , ∇LS(z, v) h(z) α = ∇F(z)α + ∇h(z)v h(z) = 0, where α ∈ ∆ + S , and v ∈ R q . PS not only takes the feasible direction into account, but also enforces the feasibility directly. It precisely captures both the Pareto stationary and nonstationary scenarios depicted in… view at source ↗

**Figure 10.** Figure 10: A toy example. Therefore, we obtain minv ∥K(zT , v, λ)∥ 2 2 ≤ L 2 T ∥z0 − z ∗∥ 2 2 . According to the argument about KKT system and Lemma 2, we know that Algorithm 2 converges to weakly Pareto optimal solutions at a rate of O(T −1 ). In addition, we can also traverse λ over ∆ + S to let Algorithm 2 reconstruct the entire weak Pareto front. To give a more concrete example, we provide a concrete example to … view at source ↗

**Figure 11.** Figure 11: Additional results for Pareto exploration. prioritize specific objectives, our algorithm yields a lower validation for those objectives compared to the case of using alternative preference vectors. 2) Additional Numerical Results. We now provide more numerical results on this data weighting for reward model training task, accompanied by discussions to emphasize the advantages of Algorithm 1 in this subsec… view at source ↗

**Figure 12.** Figure 12: Additional results for Convergence Performance. (a) ITD with LS. (b) SOBA with LS. (c) Comparison [PITH_FULL_IMAGE:figures/full_fig_p034_12.png] view at source ↗

**Figure 13.** Figure 13: Additional results on bilevel algorithms. we omit the use of implicit gradient methods (Ghadimi & Wang, 2018; Ji et al., 2021) to compute the Hessian inverse, significantly reducing computational costs. The best slope of our approach in Figure 12b further validates its convergence performance. Specifically, as illustrated in Theorem 3, our WC-Penalty algorithm achieves a convergence rate of O(S/T 1 2 ) fo… view at source ↗

**Figure 14.** Figure 14: Additional results for Pareto exploration. (a) SOBA with LS. (b) ITD with LS [PITH_FULL_IMAGE:figures/full_fig_p036_14.png] view at source ↗

**Figure 15.** Figure 15: Additional results on bilevel algorithms. 1. Pareto Exploration [PITH_FULL_IMAGE:figures/full_fig_p036_15.png] view at source ↗

**Figure 16.** Figure 16: Additional results on MTBL algorithms. (a) Pareto exploration (3B). (b) Baseline comparison (3B). (c) Pareto exploration (8B) [PITH_FULL_IMAGE:figures/full_fig_p037_16.png] view at source ↗

**Figure 17.** Figure 17: Data weighting task in larger-scale (3B & 8B) LLM alignment. mentioned in the setup, we set the inner-loop iterations (if applicable) as 40 for every algorithm. Nevertheless, this leads to “CUDA out of memory” error when implementing the FORUM algorithm, since 1) its workflows are overly complicated, and 2) its maintained values are extremely memory-consuming. In fact, in our GPUs with 94GB of memory each… view at source ↗

**Figure 18.** Figure 18: ϵ-metric. Moreover, we also compare our Algorithm 1 with MTBL baselines (Ye et al., 2021; Fernando et al., 2023) with two important metrics, hypervolume and ϵ-metric. Table 4 demonstrates that our algorithm dramatically outperforms the baselines even before completing full Pareto exploration (labeled as Helpfulness, etc.) in terms of hypervolume, and the Pareto exploration still leads to better perform… view at source ↗

**Figure 20.** Figure 20 [PITH_FULL_IMAGE:figures/full_fig_p039_20.png] view at source ↗

**Figure 21.** Figure 21: Additional synthetic examples. (i) Consider an LLGC-MTBL problem with x, y ∈ R: min x,y [PITH_FULL_IMAGE:figures/full_fig_p041_21.png] view at source ↗

read the original abstract

In recent years, bilevel optimization (BLO) has attracted significant attention for its broad applications in machine learning. However, most existing works on BLO remain confined to the single-task setting and rely on the lower-level strong convexity assumption, which significantly restricts their applicability to modern machine learning problems of growing complexity. In this paper, we make the first attempt to extend BLO to the multi-task setting under a relaxed lower-level general convexity (LLGC) assumption. To this end, we reformulate the multi-task bilevel learning (MTBL) problem with LLGC into an equality constrained multi-objective optimization (ECMO) problem. However, ECMO itself is a new problem that has not yet been studied in the literature. To address this gap, we first establish a new Karush-Kuhn-Tucker (KKT)-based Pareto stationarity as the convergence criterion for ECMO algorithm design. Based on this foundation, we propose a weighted Chebyshev (WC)-penalty algorithm that achieves a finite-time convergence rate of $O(ST^{-\frac{1}{2})$ to KKT-based Pareto stationarity in both deterministic and stochastic settings, where $S$ denotes the number of objectives, and $T$ is the total iterations. Moreover, by varying the preference vector over the $S$-dimensional simplex, our WC-penalty method systematically explores the Pareto front. Finally, solutions to the ECMO problem translate directly into solutions for the original MTBL problem, thereby closing the loop between these two foundational optimization frameworks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper reformulates multi-task bilevel learning under lower-level general convexity as an equality-constrained multi-objective problem and gives a weighted Chebyshev penalty algorithm with an O(S T^{-1/2}) rate, but the reformulation equivalence is the part that needs the closest check.

read the letter

The main thing here is the reformulation step that turns the multi-task bilevel problem with relaxed lower-level convexity into an equality-constrained multi-objective problem, followed by a new KKT-based Pareto stationarity notion and a penalty algorithm that reaches the stated rate in both deterministic and stochastic cases. Varying the preference vector then traces out the Pareto front, and the claim is that solutions map straight back to the original bilevel problem. That is the concrete new piece: extending bilevel work beyond single-task strong convexity while importing tools from multi-objective optimization. The rate is explicit and covers stochastic noise, which matters for machine-learning settings like hyperparameter tuning across tasks. The abstract also keeps the steps readable, which helps when the two frameworks are being connected for the first time. The soft spot is the reformulation itself. Under general convexity the lower-level argmin set need not be a singleton, so the equality constraints have to capture the entire set exactly if the mapping is supposed to be faithful in both directions. If the constraints only enforce stationarity rather than global optimality, or if they implicitly assume uniqueness, then KKT points for the reformulated problem can include points that are not feasible or optimal for the motivating bilevel problem. The convergence guarantee is proved for the equality-constrained version, so any gap there directly affects the claim for multi-task bilevel learning. The abstract does not show numerical checks, so it is also unclear how the method behaves on actual tasks or how it compares with existing bilevel solvers. This is for people working on bilevel optimization for meta-learning or multi-task hyperparameter tuning who want rates without strong convexity. A reader interested in stationarity concepts or penalty methods for multi-objective problems will find the technical development worth looking at. It deserves peer review because the connection is new and the algorithm is concrete enough that referees can verify the equivalence and the proofs.

Referee Report

2 major / 1 minor

Summary. The paper extends single-task bilevel optimization to the multi-task setting under a relaxed lower-level general convexity (LLGC) assumption. It reformulates the multi-task bilevel learning (MTBL) problem as an equality-constrained multi-objective optimization (ECMO) problem, introduces a KKT-based Pareto stationarity notion for ECMO, and proposes a weighted Chebyshev penalty algorithm that converges at rate O(S T^{-1/2}) to this stationarity in both deterministic and stochastic regimes. The method explores the Pareto front via preference vectors, and claims that ECMO solutions map directly back to MTBL solutions.

Significance. If the reformulation equivalence holds, the work is significant for relaxing strong-convexity assumptions that limit bilevel methods in modern multi-task ML, while providing the first finite-time rate for this new ECMO class and a practical Pareto-front exploration mechanism. The explicit bridging of MTBL and ECMO frameworks could enable new algorithmic designs if the KKT mapping is shown to be bijective.

major comments (2)

[Reformulation section (likely §3)] The central claim that 'solutions to the ECMO problem translate directly into solutions for the original MTBL problem' (abstract and reformulation section) requires an explicit bijection proof. Under LLGC the lower-level argmin set need not be singleton; the equality constraints must therefore encode the entire solution set exactly. If the reformulation only enforces stationarity (rather than global optimality) of the lower level, KKT-based Pareto stationary points of ECMO can include points that are infeasible or suboptimal for MTBL. Provide the full mapping and verification that every ECMO KKT point corresponds to an MTBL solution and vice versa.
[Convergence analysis (likely §4–5)] The O(S T^{-1/2}) finite-time rate (abstract and convergence analysis) is derived for the ECMO formulation. Because the rate is advertised for the motivating MTBL problem, any gap in the equivalence immediately weakens the claim; the analysis must either prove the mapping preserves stationarity exactly or state the precise conditions under which the rate carries over to MTBL.

minor comments (1)

[Abstract] In the abstract the rate is written as O(ST^{-1/2}); confirm that the main text consistently defines S as the number of objectives and clarifies whether the bound is in terms of total iterations T or per-objective iterations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the detailed review and valuable feedback on our work. We have carefully considered the major comments regarding the reformulation equivalence and the transfer of convergence rates. Below, we provide point-by-point responses and outline the revisions we will make to address these concerns.

read point-by-point responses

Referee: [Reformulation section (likely §3)] The central claim that 'solutions to the ECMO problem translate directly into solutions for the original MTBL problem' (abstract and reformulation section) requires an explicit bijection proof. Under LLGC the lower-level argmin set need not be singleton; the equality constraints must therefore encode the entire solution set exactly. If the reformulation only enforces stationarity (rather than global optimality) of the lower level, KKT-based Pareto stationary points of ECMO can include points that are infeasible or suboptimal for MTBL. Provide the full mapping and verification that every ECMO KKT point corresponds to an MTBL solution and vice versa.

Authors: We thank the referee for highlighting this critical aspect of the reformulation. The manuscript constructs the ECMO equality constraints directly from the variational inequality characterization of the lower-level argmin set under the LLGC assumption (i.e., 0 ∈ ∂f(x,y) + N_Y(y) for each task), which is necessary and sufficient for global optimality when the lower level is convex. This encodes the full (possibly non-singleton) solution set without requiring strong convexity. To make the bijection fully explicit, we will add a dedicated theorem and proof in the revised reformulation section establishing that (i) every MTBL feasible point maps to a feasible ECMO point with identical objective values, and (ii) every KKT-based Pareto stationary point of the ECMO problem corresponds to a point satisfying the MTBL optimality conditions. This revision will eliminate any ambiguity about extraneous stationary points. revision: yes
Referee: [Convergence analysis (likely §4–5)] The O(S T^{-1/2}) finite-time rate (abstract and convergence analysis) is derived for the ECMO formulation. Because the rate is advertised for the motivating MTBL problem, any gap in the equivalence immediately weakens the claim; the analysis must either prove the mapping preserves stationarity exactly or state the precise conditions under which the rate carries over to MTBL.

Authors: We agree that the finite-time rate is formally derived for the ECMO problem. In the revised manuscript we will insert a corollary immediately following the main convergence theorem. The corollary will state that, under the LLGC assumption and the bijection established in the reformulation section, any sequence converging to KKT-based Pareto stationarity in ECMO at rate O(S T^{-1/2}) yields a sequence converging at the same rate to the corresponding stationarity notion for the original MTBL problem. We will also add a short remark clarifying the exact conditions (convexity of the lower level and the preference-vector parameterization) under which the mapping preserves stationarity exactly. This ensures the advertised rate for MTBL is rigorously justified. revision: yes

Circularity Check

0 steps flagged

No significant circularity; reformulation and convergence analysis are independent contributions

full rationale

The paper defines a reformulation of MTBL under LLGC into ECMO as a modeling step, then introduces a new WC-penalty algorithm and derives its O(ST^{-1/2}) convergence to KKT Pareto stationarity for the ECMO problem. Solutions are asserted to translate back to MTBL by the reformulation itself. No quoted equations show a fitted parameter renamed as prediction, no self-citation chain justifies the central claim, and the convergence rate is presented as a new finite-time result rather than reducing to prior inputs by construction. The derivation chain is self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the LLGC assumption permitting a valid reformulation to ECMO and on the new KKT-based Pareto stationarity serving as a suitable convergence criterion.

axioms (1)

domain assumption Lower-level general convexity (LLGC) assumption
Relaxes the strong convexity typically required for the lower-level problem in bilevel optimization to enable the multi-task extension.

pith-pipeline@v0.9.0 · 5593 in / 1294 out tokens · 48433 ms · 2026-05-15T04:53:13.038199+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages

[1]

Ye, F., Lin, B., Yue, Z., Guo, P., Xiao, Q., and Zhang, Y

URL https://openreview.net/forum? id=xJ5N8qrEPl. Ye, F., Lin, B., Yue, Z., Guo, P., Xiao, Q., and Zhang, Y . Multi-objective meta learning.Advances in Neural Information Processing Systems, 34:21338–21351, 2021. Ye, F., Lin, B., Cao, X., Zhang, Y ., and Tsang, I. W. A first-order multi-gradient algorithm for multi-objective bi-level optimization. InECAI 2...

work page 2021
[2]

Recently, (Zhang et al., 2026), for the first time in the literature, investigates the Pareto front exploration, yet their approach requires the restrictive LLSC condition

provide algorithms with a convergence rate of O(ST − 1 2 ) and O(ST − 1 4 ), respectively. Recently, (Zhang et al., 2026), for the first time in the literature, investigates the Pareto front exploration, yet their approach requires the restrictive LLSC condition. However, all of these works heavily depend on the LLSC condition: not only is the algorithmic...

work page 2026
[3]

In addition, we can also traverse λ over ∆+ S to let Algorithm 2 reconstruct the entire weak Pareto front

According to the argument about KKT system and Lemma 2, we know that Algorithm 2 converges to weakly Pareto optimal solutions at a rate of O(T −1). In addition, we can also traverse λ over ∆+ S to let Algorithm 2 reconstruct the entire weak Pareto front. To give a more concrete example, we provide a concrete example to show the performance of our proposed...

work page
[4]

X s∈It (|¯ct,s −c t,s|+|c t,s|) #2 ≤4E

Ifa t,s <0≤b t,s, thenr 2 t,s =a 2 t,s ≤b 2 t,s/v2 =c 2 t,s, and we haves∈ J t. We can follow Step B.2 to obtain: X s∈Jt c2 t,s ≤ X s∈Jt ct,s !2 ≤ 2 √ S∥dt∥+ 1 v !2 . Here, we note that for eachs∈[S], only one of the cases holds. Therefore, we combine these results to get: SX s=1 r2 t,s ≤ ∥d t,δ∥2 + 4S∥dt∥2 + 2 v2 , which implies: 1 T T−1X t=0 SX s=1 r2 t...

work page
[5]

In other words, The convergence rate isO(S/T 1 2 ). Remark 6.By comparing Algorithm 1 and Algorithm 3, along with their respective analyses, we identify that the key challenge in the stochastic scenario arises from thestochastic gradients. Specifically, due to the gap between the full gradient and its stochastic estimator, the analysis for Algorithm 3 bec...

work page
[6]

31 A Tale of Two Problems: Multi-Task Bilevel Learning Meets Equality Constrained Multi-Objective Optimization E

use the Chebyshev Inequality toaccuratelybound the dual feasibility and complementary slackness terms in the KKT system; and 3) carefully select the batch-sizesBandTto ensure finite-time convergence. 31 A Tale of Two Problems: Multi-Task Bilevel Learning Meets Equality Constrained Multi-Objective Optimization E. Setups and Additional Results of Numerical ...

work page
[7]

Overview.The reward model scores LLM-generated responses to prompts based on human-aligned criteria in Reinforcement Learning from Human Feedback (RLHF)

Detailed Setup. Overview.The reward model scores LLM-generated responses to prompts based on human-aligned criteria in Reinforcement Learning from Human Feedback (RLHF). The multi-objective data weighting task aims to determine optimal weights over training datasets for training a reward model that maximize multiple validation metrics in Pareto sense. As ...

work page 2024
[8]

Additional Numerical Results. We now provide more numerical results on this data weighting for reward model training task, accompanied by discussions to emphasize the advantages of Algorithm 1 in this subsection

work page
[9]

slightly prefer

Pareto Exploration. In addition to the results demonstrated in Section 5, we select5 more additional preference vectors by setting λ as λs = 0.84 for some s∈[S] and λs′ = 0.04, ∀s′ ̸=s , referring to this as “slightly prefer” some objective in Figure 11a. This further verifies the Pareto exploration capability of Algorithm 1. Furthermore, to provide a cle...

work page
[10]

Except for the ability on Pareto exploration, we also highlight the good convergence behavior in Figure 12

Convergence Performance. Except for the ability on Pareto exploration, we also highlight the good convergence behavior in Figure 12. Specifically, we compare the running time of our algorithm with that of all baselines over T= 3,000 steps in Figure 12a. We average the loss over 5 trials for each algorithm and include the standard error bars to ensure stat...

work page 2018
[11]

irregular

More Discussions. Finally, we provide some additional discussion for this experiment, focusing on three main aspects as follows.Dataset: The dataset we use (HelpSteer, (Wang et al., 2023)) is almost the “optimal” to validate our algorithm, as it contains5 objectives, whereas most other existing datasets have no more than3. This allows a more realistic sim...

work page 2023
[12]

Overview.In the Large Language Model (LLM) Alignment task, our goal is to align a pretrained LLM with human preferences

Detailed Setup. Overview.In the Large Language Model (LLM) Alignment task, our goal is to align a pretrained LLM with human preferences. Instead of relying on a reward model to guide the LLM, we directly utilize the prompt-response data to finetune the language model. In this section, we introduce our data weighting task for multi-objective LLM alignment....

work page 2023
[13]

Similarly, we provide more numerical results on this data weighting in LLM alignment task along with discussions in this subsection

Additional Numerical Results. Similarly, we provide more numerical results on this data weighting in LLM alignment task along with discussions in this subsection. 35 A Tale of Two Problems: Multi-Task Bilevel Learning Meets Equality Constrained Multi-Objective Optimization (a)Exploration with more preferences. (b)Different objectives in Alg. 1. Figure 14....

work page
[14]

slightly prefer

Pareto Exploration. Figure 14 presents additional numerical results on Pareto exploration. In Figure 14a, “slightly prefer” refers to selecting λs = 0.84 for some s and λs′ = 0.04 for s′ ̸=s . While these preferences do not yield improved performance, they still exhibit regular Pareto exploration behavior, as the loss on the focused objective remains rela...

work page 2022
[15]

CUDA out of memory

MTBL Baselines and Discussions. We also consider the aforementioned MTBL algorithms (Ye et al., 2021; Fernando et al., 2023; Ye et al., 2024) as our baselines in Figure 16. Specifically, our algorithm still outperforms in Pareto exploration when compared with MOML and MoCo algorithms, since a larger portion of Pareto front is covered by our approach, as d...

work page 2021
[16]

Larger-Scale Numerical Experiments and Results. In order to further validate the capability of our Algorithm 1 in large-scale problems, we enlarge the pretrained LLM model fromLlama-3.2-1B-InstructtoLlama-3.2-3B-InstructandLlama-3.1-8B-Instructin this subsection. In Figure 17, we set the preference vector λ as λs = 0.96 for some s∈[S] and λs′ = 0.01, ∀s′ ...

work page 2021
[17]

Experimental Setup. Overview.We consider a multi-task meta-learning prob- lem (Ye et al., 2021; Ji et al., 2021; Qin et al., 2025), where the goal is to train a single model capable of addressing multiple tasks within the MTBL framework. This task is particularly useful for handling heterogeneous datasets using a relatively small-scale model. Specifically...

work page 2021
[18]

Equally Prefer

Numerical Results. Figure 19 demonstrates the effectiveness of our Algorithm 1 in Pareto exploration and its superior performance compared to baselines. Specifically, in Figure 19a, in addition to the preference vectors used in the previous subsections, we also include the “Equally Prefer” preference, where λ= [0.2,0.2,0.2,0.2,0.2] ⊤. The numerical result...

work page 2017

[1] [1]

Ye, F., Lin, B., Yue, Z., Guo, P., Xiao, Q., and Zhang, Y

URL https://openreview.net/forum? id=xJ5N8qrEPl. Ye, F., Lin, B., Yue, Z., Guo, P., Xiao, Q., and Zhang, Y . Multi-objective meta learning.Advances in Neural Information Processing Systems, 34:21338–21351, 2021. Ye, F., Lin, B., Cao, X., Zhang, Y ., and Tsang, I. W. A first-order multi-gradient algorithm for multi-objective bi-level optimization. InECAI 2...

work page 2021

[2] [2]

Recently, (Zhang et al., 2026), for the first time in the literature, investigates the Pareto front exploration, yet their approach requires the restrictive LLSC condition

provide algorithms with a convergence rate of O(ST − 1 2 ) and O(ST − 1 4 ), respectively. Recently, (Zhang et al., 2026), for the first time in the literature, investigates the Pareto front exploration, yet their approach requires the restrictive LLSC condition. However, all of these works heavily depend on the LLSC condition: not only is the algorithmic...

work page 2026

[3] [3]

In addition, we can also traverse λ over ∆+ S to let Algorithm 2 reconstruct the entire weak Pareto front

According to the argument about KKT system and Lemma 2, we know that Algorithm 2 converges to weakly Pareto optimal solutions at a rate of O(T −1). In addition, we can also traverse λ over ∆+ S to let Algorithm 2 reconstruct the entire weak Pareto front. To give a more concrete example, we provide a concrete example to show the performance of our proposed...

work page

[4] [4]

X s∈It (|¯ct,s −c t,s|+|c t,s|) #2 ≤4E

Ifa t,s <0≤b t,s, thenr 2 t,s =a 2 t,s ≤b 2 t,s/v2 =c 2 t,s, and we haves∈ J t. We can follow Step B.2 to obtain: X s∈Jt c2 t,s ≤ X s∈Jt ct,s !2 ≤ 2 √ S∥dt∥+ 1 v !2 . Here, we note that for eachs∈[S], only one of the cases holds. Therefore, we combine these results to get: SX s=1 r2 t,s ≤ ∥d t,δ∥2 + 4S∥dt∥2 + 2 v2 , which implies: 1 T T−1X t=0 SX s=1 r2 t...

work page

[5] [5]

In other words, The convergence rate isO(S/T 1 2 ). Remark 6.By comparing Algorithm 1 and Algorithm 3, along with their respective analyses, we identify that the key challenge in the stochastic scenario arises from thestochastic gradients. Specifically, due to the gap between the full gradient and its stochastic estimator, the analysis for Algorithm 3 bec...

work page

[6] [6]

31 A Tale of Two Problems: Multi-Task Bilevel Learning Meets Equality Constrained Multi-Objective Optimization E

use the Chebyshev Inequality toaccuratelybound the dual feasibility and complementary slackness terms in the KKT system; and 3) carefully select the batch-sizesBandTto ensure finite-time convergence. 31 A Tale of Two Problems: Multi-Task Bilevel Learning Meets Equality Constrained Multi-Objective Optimization E. Setups and Additional Results of Numerical ...

work page

[7] [7]

Overview.The reward model scores LLM-generated responses to prompts based on human-aligned criteria in Reinforcement Learning from Human Feedback (RLHF)

Detailed Setup. Overview.The reward model scores LLM-generated responses to prompts based on human-aligned criteria in Reinforcement Learning from Human Feedback (RLHF). The multi-objective data weighting task aims to determine optimal weights over training datasets for training a reward model that maximize multiple validation metrics in Pareto sense. As ...

work page 2024

[8] [8]

Additional Numerical Results. We now provide more numerical results on this data weighting for reward model training task, accompanied by discussions to emphasize the advantages of Algorithm 1 in this subsection

work page

[9] [9]

slightly prefer

Pareto Exploration. In addition to the results demonstrated in Section 5, we select5 more additional preference vectors by setting λ as λs = 0.84 for some s∈[S] and λs′ = 0.04, ∀s′ ̸=s , referring to this as “slightly prefer” some objective in Figure 11a. This further verifies the Pareto exploration capability of Algorithm 1. Furthermore, to provide a cle...

work page

[10] [10]

Except for the ability on Pareto exploration, we also highlight the good convergence behavior in Figure 12

Convergence Performance. Except for the ability on Pareto exploration, we also highlight the good convergence behavior in Figure 12. Specifically, we compare the running time of our algorithm with that of all baselines over T= 3,000 steps in Figure 12a. We average the loss over 5 trials for each algorithm and include the standard error bars to ensure stat...

work page 2018

[11] [11]

irregular

More Discussions. Finally, we provide some additional discussion for this experiment, focusing on three main aspects as follows.Dataset: The dataset we use (HelpSteer, (Wang et al., 2023)) is almost the “optimal” to validate our algorithm, as it contains5 objectives, whereas most other existing datasets have no more than3. This allows a more realistic sim...

work page 2023

[12] [12]

Overview.In the Large Language Model (LLM) Alignment task, our goal is to align a pretrained LLM with human preferences

Detailed Setup. Overview.In the Large Language Model (LLM) Alignment task, our goal is to align a pretrained LLM with human preferences. Instead of relying on a reward model to guide the LLM, we directly utilize the prompt-response data to finetune the language model. In this section, we introduce our data weighting task for multi-objective LLM alignment....

work page 2023

[13] [13]

Similarly, we provide more numerical results on this data weighting in LLM alignment task along with discussions in this subsection

Additional Numerical Results. Similarly, we provide more numerical results on this data weighting in LLM alignment task along with discussions in this subsection. 35 A Tale of Two Problems: Multi-Task Bilevel Learning Meets Equality Constrained Multi-Objective Optimization (a)Exploration with more preferences. (b)Different objectives in Alg. 1. Figure 14....

work page

[14] [14]

slightly prefer

Pareto Exploration. Figure 14 presents additional numerical results on Pareto exploration. In Figure 14a, “slightly prefer” refers to selecting λs = 0.84 for some s and λs′ = 0.04 for s′ ̸=s . While these preferences do not yield improved performance, they still exhibit regular Pareto exploration behavior, as the loss on the focused objective remains rela...

work page 2022

[15] [15]

CUDA out of memory

MTBL Baselines and Discussions. We also consider the aforementioned MTBL algorithms (Ye et al., 2021; Fernando et al., 2023; Ye et al., 2024) as our baselines in Figure 16. Specifically, our algorithm still outperforms in Pareto exploration when compared with MOML and MoCo algorithms, since a larger portion of Pareto front is covered by our approach, as d...

work page 2021

[16] [16]

Larger-Scale Numerical Experiments and Results. In order to further validate the capability of our Algorithm 1 in large-scale problems, we enlarge the pretrained LLM model fromLlama-3.2-1B-InstructtoLlama-3.2-3B-InstructandLlama-3.1-8B-Instructin this subsection. In Figure 17, we set the preference vector λ as λs = 0.96 for some s∈[S] and λs′ = 0.01, ∀s′ ...

work page 2021

[17] [17]

Experimental Setup. Overview.We consider a multi-task meta-learning prob- lem (Ye et al., 2021; Ji et al., 2021; Qin et al., 2025), where the goal is to train a single model capable of addressing multiple tasks within the MTBL framework. This task is particularly useful for handling heterogeneous datasets using a relatively small-scale model. Specifically...

work page 2021

[18] [18]

Equally Prefer

Numerical Results. Figure 19 demonstrates the effectiveness of our Algorithm 1 in Pareto exploration and its superior performance compared to baselines. Specifically, in Figure 19a, in addition to the preference vectors used in the previous subsections, we also include the “Equally Prefer” preference, where λ= [0.2,0.2,0.2,0.2,0.2] ⊤. The numerical result...

work page 2017