arxiv: 2601.16399 · v6 · submitted 2026-01-23 · 💻 cs.LG · math.OC

Recognition: no theorem link

A Hessian-Free Actor-Critic Algorithm for Bi-Level Reinforcement Learning with Applications to LLM Fine-Tuning

Sihan Zeng , Sujay Bhatt , Sumitra Ganesh , Alec Koppel

Authors on Pith no claims yet

Pith reviewed 2026-05-16 11:33 UTC · model grok-4.3

classification 💻 cs.LG math.OC

keywords bi-level optimizationactor-critic algorithmreinforcement learningRLHFLLM fine-tuningentropy regularizationPolyak-Lojasiewicz conditionconvergence analysis

0 comments

The pith

A single-loop first-order actor-critic algorithm converges to stationary points of unregularized bi-level RL problems using attenuating entropy regularization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a new algorithm for bi-level reinforcement learning where an upper-level objective depends on the optimal policy of a lower-level MDP whose reward is parameterized by the upper level. Existing approaches often require second-order derivatives or nested optimization loops that are sample-inefficient. The proposed method uses a penalty reformulation and adds an attenuating entropy term to the lower-level objective, allowing a single-loop actor-critic update that yields asymptotically unbiased hypergradients. Finite-time convergence to a stationary point of the original problem is proven using a residual analysis under a Polyak-Lojasiewicz condition on the lower level. This matters for practical applications like fine-tuning large language models with human feedback, as it avoids expensive computations while maintaining theoretical guarantees.

Core claim

The central claim is that the single-loop Hessian-free actor-critic algorithm, equipped with attenuating entropy regularization, converges in finite time and finite samples to a stationary point of the original unregularized bi-level optimization problem. This is established through a novel lower-level residual analysis under a special Polyak-Lojasiewicz condition, without needing to solve the lower-level problem exactly.

What carries the argument

The attenuating entropy regularization that produces asymptotically unbiased upper-level hyper-gradient estimates, combined with the lower-level residual analysis under the special Polyak-Lojasiewicz condition.

If this is right

The algorithm provides a practical single-loop method for bi-level RL that avoids nested loops and Hessian computations.
It applies directly to RLHF for LLM fine-tuning as shown in the GridWorld and happy tweet generation experiments.
Finite-time and finite-sample convergence guarantees hold for the unregularized bi-level objective.
Sample efficiency improves over methods that require exact lower-level solutions or second-order information.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the special PL condition holds in a wider range of MDPs, the method could extend to other bi-level problems in optimization and control.
Applying the algorithm to additional RLHF benchmarks could test how the attenuating regularization affects final policy quality.
Similar regularization schedules might improve efficiency in other actor-critic methods for bi-level settings.
A testable extension is to measure empirical convergence rates on tasks where the PL condition is approximately satisfied.

Load-bearing premise

The lower-level MDP satisfies a special type of Polyak-Lojasiewicz condition that enables the residual analysis.

What would settle it

Observing divergence or persistent bias in upper-level gradients when the lower-level MDP violates the Polyak-Lojasiewicz condition, for example in environments with multiple local optima or high stochasticity.

Figures

Figures reproduced from arXiv: 2601.16399 by Alec Koppel, Sihan Zeng, Sujay Bhatt, Sumitra Ganesh.

**Figure 2.** Figure 2: GridWorld Illustration. The red flag is the goal in the lower-level MDP set by the upper-level decision variable. A state further away from the goal incurs a negative reward with higher magnitude. The green circle indicates the center of the grid, which defines a component of the upper-level objective. The lower-level MDP is defined on a 10×10 grid, where each state corresponds to a position on the grid.… view at source ↗

read the original abstract

We study a structured bi-level optimization problem where the upper-level objective is a smooth function and the lower-level problem is policy optimization in a Markov decision process (MDP). The upper-level decision variable parameterizes the reward of the lower-level MDP, and the upper-level objective depends on the optimal induced policy. Existing methods for bi-level optimization and RL often require second-order information, impose strong regularization at the lower level, or inefficiently use samples through nested-loop procedures. In this work, we propose a single-loop, first-order actor-critic algorithm that optimizes the bi-level objective via a penalty-based reformulation. We introduce into the lower-level RL objective an attenuating entropy regularization, which enables asymptotically unbiased upper-level hyper-gradient estimation without solving the unregularized RL problem exactly. We establish the finite-time and finite-sample convergence of the proposed algorithm to a stationary point of the original, unregularized bi-level optimization problem through a novel lower-level residual analysis under a special type of Polyak-Lojasiewicz condition. We validate the performance of our method through experiments on a GridWorld goal position problem and on happy tweet generation through reinforcement learning from human feedback (RLHF).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a workable single-loop first-order actor-critic for bi-level RL via penalty reformulation and attenuating entropy, but the finite-time convergence claim rests on a special PL condition for the lower MDP that is neither derived for general cases nor checked in the experiments.

read the letter

The core advance here is a single-loop actor-critic that reformulates the bi-level problem with a penalty term and adds a decaying entropy regularizer to the lower-level RL objective. This produces hypergradient estimates that become unbiased for the unregularized upper-level objective without ever solving the exact lower-level problem. The authors claim finite-time and finite-sample convergence to a stationary point of the original bi-level objective through a residual analysis that uses a specialized Polyak-Lojasiewicz condition on the lower-level MDP. They demonstrate the method on a GridWorld goal-reaching task and on RLHF for happy-tweet generation, which is a direct and relevant test case for LLM fine-tuning pipelines. Those experiments show the approach can run with first-order updates only and still produce reasonable policies, which is practically useful when second-order information or nested loops are too expensive. The entropy schedule is a clean way to trade off bias and variance over time, and the single-loop structure avoids the sample inefficiency of prior nested methods. The main limitation is that the convergence guarantee depends on the lower-level MDP satisfying this stronger-than-standard PL condition. The abstract invokes it to close the residual bounds, but there is no general argument showing it holds for arbitrary reward-parameterized MDPs, and the reported experiments do not include any diagnostic plots or eigenvalue checks to confirm the condition is active. If the assumption fails, the finite-time bound for the unregularized objective does not follow. The citation pattern looks standard for the subfield and does not appear circular. This work is aimed at people building RLHF or other bi-level RL systems who need first-order, single-loop methods. A reader already familiar with actor-critic and hypergradient estimation will get concrete algorithmic details and empirical evidence to try. It deserves a serious referee because the algorithmic idea is implementable and the application is timely, even though the theory will likely need additional justification or relaxation of the PL assumption during review.

Referee Report

1 major / 2 minor

Summary. The paper proposes a single-loop first-order actor-critic algorithm for a structured bi-level optimization problem in which the upper-level objective is a smooth function of the optimal policy induced by a lower-level MDP whose reward is parameterized by the upper-level variable. The method employs a penalty reformulation together with an attenuating entropy regularization schedule at the lower level to produce asymptotically unbiased hyper-gradient estimates without solving the unregularized lower-level problem exactly. The central theoretical claim is finite-time and finite-sample convergence to a stationary point of the original unregularized bi-level objective, obtained via a novel lower-level residual analysis that invokes a specialized Polyak-Lojasiewicz condition on the MDP. The algorithm is evaluated on a GridWorld goal-position task and on RLHF for happy-tweet generation.

Significance. If the convergence result holds under verifiable conditions, the work would supply a practical Hessian-free single-loop method for bi-level RL problems that arise in LLM fine-tuning, avoiding both nested-loop sampling inefficiency and the bias introduced by permanent strong regularization. The residual-analysis technique and the attenuating-regularization schedule constitute a concrete technical contribution that could be reused in other bi-level RL settings.

major comments (1)

[Main convergence theorem (Section 4)] The finite-time and finite-sample convergence guarantee (abstract and the main convergence theorem) rests on the lower-level MDP satisfying a specialized Polyak-Lojasiewicz condition that is stronger than the standard PL inequality. No derivation is supplied showing that this condition holds for general reward-parameterized MDPs, and the GridWorld and RLHF experiments contain no diagnostic checks (e.g., gradient-norm versus suboptimality plots or Hessian-eigenvalue spectra) that would empirically support the assumption. If the condition fails, the residual bounds used to establish asymptotic unbiasedness of the upper-level hyper-gradient estimates collapse.

minor comments (2)

[Section 3] Notation for the entropy-regularization schedule and the penalty parameter is introduced without a consolidated table of symbols; readers must hunt through the text to recover the precise dependence on iteration index.
[Section 5.2] Figure captions for the RLHF experiment do not report the number of independent runs or the precise hyper-parameter values used for the attenuating schedule, making reproduction difficult.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for the careful reading and constructive feedback on our manuscript. We address the major comment point by point below.

read point-by-point responses

Referee: The finite-time and finite-sample convergence guarantee (abstract and the main convergence theorem) rests on the lower-level MDP satisfying a specialized Polyak-Lojasiewicz condition that is stronger than the standard PL inequality. No derivation is supplied showing that this condition holds for general reward-parameterized MDPs, and the GridWorld and RLHF experiments contain no diagnostic checks (e.g., gradient-norm versus suboptimality plots or Hessian-eigenvalue spectra) that would empirically support the assumption. If the condition fails, the residual bounds used to establish asymptotic unbiasedness of the upper-level hyper-gradient estimates collapse.

Authors: We thank the referee for this observation. The specialized Polyak-Lojasiewicz condition is explicitly stated as an assumption in Theorem 4.1 (and the abstract) to enable the novel lower-level residual analysis that yields asymptotic unbiasedness of the hyper-gradient estimates under attenuating entropy regularization. We do not claim or derive that the condition holds for arbitrary reward-parameterized MDPs, as this would require further structural assumptions on the MDP (e.g., strong convexity of the regularized objective or linear function approximation). In the revised manuscript we will add a dedicated remark in Section 4 clarifying the role of the assumption, providing sufficient conditions under which it holds (such as MDPs with quadratic reward parameterization), and noting that the finite-time result is conditional on it. We will also include new diagnostic plots (gradient norm vs. suboptimality) for the GridWorld experiment and a brief check for the RLHF task. These changes make the assumption's scope transparent without altering the theorem statement. revision: partial

standing simulated objections not resolved

A general derivation proving the specialized PL condition for arbitrary reward-parameterized MDPs (without additional MDP structure) cannot be supplied, as the condition is not universally true and serves only as a sufficient assumption for the residual analysis.

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained from penalty reformulation and regularization schedule

full rationale

The paper introduces a penalty-based reformulation of the bi-level objective and an attenuating entropy regularization schedule in the lower-level MDP to produce asymptotically unbiased hyper-gradients. Finite-time convergence to a stationary point of the unregularized problem is shown via a novel residual analysis that relies on an explicit assumption of a specialized Polyak-Lojasiewicz condition on the lower-level objective. No equation reduces a claimed prediction or convergence bound to a fitted parameter by construction, and no load-bearing step invokes a self-citation whose content is itself unverified or defined in terms of the present result. The special PL condition is stated as an enabling assumption rather than derived from the algorithm outputs, and the GridWorld/RLHF experiments serve as validation rather than circular fitting. The derivation chain therefore remains independent of its own fitted quantities.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the lower-level MDP satisfying a special Polyak-Lojasiewicz condition and on the attenuating entropy schedule producing asymptotically unbiased hypergradients.

free parameters (1)

entropy attenuation schedule
The rate at which the entropy regularization is reduced is introduced to achieve asymptotic unbiasedness and must be chosen to balance bias and convergence.

axioms (1)

domain assumption The lower-level MDP satisfies a special type of Polyak-Lojasiewicz condition
Invoked to obtain the novel lower-level residual analysis that yields finite-time convergence.

pith-pipeline@v0.9.0 · 5520 in / 1325 out tokens · 49705 ms · 2026-05-16T11:33:27.656199+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 1 internal anchor

[1]

On the sample complexity bounds in bilevel reinforcement learning.arXiv preprint arXiv:2503.17644,

Mudit Gaur, Amrit Singh Bedi, Raghu Pasupathu, and Vaneet Aggarwal. On the sample complexity bounds in bilevel reinforcement learning.arXiv preprint arXiv:2503.17644,

work page arXiv
[2]

Approximation Methods for Bilevel Programming

Saeed Ghadimi and Mengdi Wang. Approximation methods for bilevel programming.arXiv preprint arXiv:1802.02246,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Chia-Yuan Wu, Frank E Curtis, and Daniel P Robinson. Using synthetic data to mitigate unfairness and preserve Sihan Zeng, Sujay Bhatt, Sumitra Ganesh, Alec Koppel privacy in collaborative machine learning.arXiv preprint arXiv:2409.09532,

work page arXiv
[4]

Unlocking global optimality in bilevel optimization: A pilot study.arXiv preprint arXiv:2408.16087,

Quan Xiao and Tianyi Chen. Unlocking global optimality in bilevel optimization: A pilot study.arXiv preprint arXiv:2408.16087,

work page arXiv
[5]

A first-order generative bilevel optimization framework for diffusion models.arXiv preprint arXiv:2502.08808,

Quan Xiao, Hui Yuan, AFM Saif, Gaowen Liu, Ramana Kompella, Mengdi Wang, and Tianyi Chen. A first-order generative bilevel optimization framework for diffusion models.arXiv preprint arXiv:2502.08808,

work page arXiv
[6]

samples drawn from the stationary distribution, instead of continuously generated Markovian samples

The sole distinction between the two is that Algorithm 2 uses i.i.d. samples drawn from the stationary distribution, instead of continuously generated Markovian samples. Stochastic approximation and RL algorithms have been extensively analyzed under Markovian sampling [Zou et al., 2019, Wu et al., 2020], and it is well-established that Markovian samples a...

work page 2019
[7]

, (85) where the first inequality follows from Zeng et al. [2022][Lemma 3], the second inequality is due to the fact that f is non-negative from Assumption 3, and the third inequality follows from Lemma 2 and the relationship E(π, s)≤log|A| for any policyπ. Collecting the bounds in (83)-(85) and plugging them into (78), we get E[Lreweight wk+1,τk+1(xk+1, ...

work page 2022
[8]

We defer the proof of the lemma to Appendix E.15

2 . We defer the proof of the lemma to Appendix E.15. By the update rule in (49), we have εV k+1 =∥ ˆVk+1 −V xk+1,πθk+1 τk+1 ∥2 =∥Π BV ˆVk +β kGτk(xk, θk, ˆVk, sk, ak, s′ k) −V xk+1,πθk+1 τk+1 ∥2 ≤ ˆVk +β kGτk(xk, θk, ˆVk, sk, ak, s′ k)−V xk+1,πθk+1 τk+1 2 = ˆVk −V xk,πθk τk +β k ¯Gτk(xk, θk, ˆVk) +β k Gf(xk, θk, ˆVk, sk, ak, s′ k)− ¯Gτk(xk, θk, ˆVk) +V x...

work page 2022
[9]

The bound onE[ε V,L k+1]can be derived using an identical argument

2 + 8BGβ2 k, where the second inequality simplifies and combines terms based on the step size conditions αk βk ≤ 1−γ 2 √ 6LV LF , βk ≤ (1−γ)L2 V 8|S|log 2 |A|, and α0 τ0 ≤ 1 LV . The bound onE[ε V,L k+1]can be derived using an identical argument. ■ E Proof of Supporting Results E.1 Proof of Lemma 1 Uniqueness ofπ ⋆(x).We first take the approach of proof b...

work page 2023
[10]

dπ′ ρ = 1 2 dπ1 ρ + 1 2 dπ2 ρ .(93) We use ˆdπ ρ to denote the extend discounted visitation distribution over state and action such that ˆdπ ρ(s, a) =d π ρ(s)π(a|s)

Therefore,(92) implies that 1 2 dπ1 ρ + 1 2 dπ2 ρ is the discounted visitation distribution induced by policyπ ′, i.e. dπ′ ρ = 1 2 dπ1 ρ + 1 2 dπ2 ρ .(93) We use ˆdπ ρ to denote the extend discounted visitation distribution over state and action such that ˆdπ ρ(s, a) =d π ρ(s)π(a|s). We have from (93) ˆdπ′ ρ (s, a) =d π′ ρ (s)π′(a|s) = 1 2 dπ1 ρ (s) + 1 2...

work page 2022
[11]

and Lemma 5 of Zeng et al. [2022]) ∥V x,πθ −V x,πθ′ ∥ ≤ 2 (1−γ) 2 ∥θ−θ ′∥,(95) ∥∇θV x,πθ − ∇θV x,πθ′ ∥ ≤ 8 (1−γ) 3 ∥θ−θ ′∥.(96) A Regularized Actor-Critic Algorithm for Bi-Level Reinforcement Learning We define the shorthand notation H(θ, s)≜E ak∼π(·|sk),sk+1∼P(·|s k,ak) " ∞X k=0 −γk logπ θ(ak |s k)|s 0 =s # . This impliesV x,πθ τ (s) =V x,πθ(s) +τH(θ, s)...

work page 2022
[12]

We can express∇ θV x,πθ as follows [Agarwal et al., 2021] ∇θs′ ,a′ V x,πθ(s) = 1 1−γ dπθ s (s′)πθ(a′ |s ′)Ax,πθ(s′, a′). This implies |∇θs′ ,a′ V x,πθ(s)− ∇ θs′,a′ V x′,πθ(s)|= 1 1−γ dπθ s (s′)πθ(a′ |s ′)|Ax,πθ(s′, a′)−A x′,πθ(s′, a′)| ≤ 1 1−γ |dπθ s (s′)||πθ(a′ |s ′)| ·2L r∥x−x ′∥ ≤ 2 1−γ ∥x−x ′∥, where the first inequality is due to the fact that the ad...

work page 2021
[13]

Adapting the result from Shen et al

∞Y k=0 π(ak |s k)P(s k+1 |s k, ak). Adapting the result from Shen et al. [2019], we have ∇2 θ,θJ(x, πθ) =E traj∼p(πθ,·) h ∇θϕ(x, π,traj)∇ θp(πθ,traj) ⊤ +∇ 2 θ,θϕ(x, π,traj) i ,(104) where we defineϕ(x, π,traj) = P∞ t=0(P∞ k=t γk(rx(sk, ak))k) logπ(a t |s t). A Regularized Actor-Critic Algorithm for Bi-Level Reinforcement Learning As the right hand side ex...

work page 2019
[14]

Substituting (112) and (113) into (109), we have ∥ ¯Fw,τ (x1, θ1, V1)− ¯Fw,τ (x2, θ2, V2)∥ ≤3L r∥x1 −x 2∥+ (L r + 2 log|A|+ 2 1−γ + 1)∥θ1 −θ 2∥+ 4∥V 1 −V 2∥+ BF 1−γ ∥θ1 −θ 2∥ = 3Lr∥x1 −x 2∥+ (L r + 2 log|A|+ 2 +B F 1−γ + 1)∥θ1 −θ 2∥+ 4∥V 1 −V 2∥. By the definition of ¯Gτ in (28), ∥ ¯Gτ(x1, θ1, V1)− ¯Gτ(x2, θ2, V2)∥ = Es∼d πθ1ρ ,a∼πθ1(·|s),s′∼P(·|s,a)[Gτ(x...

work page 2023
[15]

Applying the equation, we get ∇θLw,τ2(x, πθ⋆w,τ1(x)) =∇ θf(x, π θ⋆w,τ1(x))− 1 w ∇θJτ2(x, πθ⋆w,τ1(x)) = 1 w ∇θJτ1(x, πθ⋆w,τ1(x))− ∇ θJτ2(x, πθ⋆w,τ1(x)) .(121) The regularized RL objective has a closed-form expression (see Mei et al. [2020][Lemma 10]) ∂Jτ(x, πθ) ∂θ(s, a) = dπθ ρ (s) 1−γ ·π θ(a|s)·A x,πθ τ (s, a), which in combination with (121) implies ∥∇θL...

work page 2020
[16]

Therefore, ∇2 τ,θ Jτ(x, πθ) = 1 1−γ ∇θEs∼d πθρ , a∼πθ(·|s)[E(πθ, s)].(126) Zeng et al

As∥∇ 2 θ,θJτ(x, πθ⋆τ(x))∥is lower bounded byσ due to Assumption 2, we have for anyτ≥0 dθ⋆ τ(x) dτ = − ∥∇2 θ,θJτ(x, πθ⋆τ(x)) −1 ∇2 τ,θ Jτ(x, πθ⋆τ(x)) ≤ ∥∇2 θ,θJτ(x, πθ⋆τ(x)) −1 ∥∇2 τ,θ Jτ(x, πθ⋆τ(x))∥ Sihan Zeng, Sujay Bhatt, Sumitra Ganesh, Alec Koppel ≤ 1 σ ∥∇2 τ,θ Jτ(x, πθ⋆ τ(x))∥.(125) It is clear from (2) and (3) ∇τ Jτ(x, πθ) = 1 1−γ Es∼d πθρ , a∼πθ(·...

work page 2022
[17]

We next show the smoothness ofΦ w,τ

this implies ∥∇ℓτ(x1)− ∇ℓ τ(x2)∥=∥∇ xJτ(x1, πθ⋆τ(x1))− ∇ xJτ(x2, πθ⋆τ(x2))∥ ≤L V ∥x1 −x 2∥+L V ∥π⋆ τ(x1)−π ⋆ τ(x2)∥ ≤L V ∥x1 −x 2∥+L V · 2LV CLτ ∥x1 −x 2∥ ≤ LV + 2L2 V CLτ ∥x1 −x 2∥, where the second inequality follows from Lemma 7 by recognizing thatπ ⋆ τ(x) = limw→0 π⋆ w,τ (x). We next show the smoothness ofΦ w,τ . From (13) it can be seen ∥∇xΦw,τ (x1)−...

work page 2023
[18]

Kwon et al

plays the same role as our 1/w. Kwon et al. [2023][Lemma A.2] is still valid without lower level convexity, with the lower bound on∥∇2 θ,θJτ(x, πθ⋆τ(x))∥ changed from the strong convexity coefficient toσ . This allows us to write ∇xΦτ(x)− ∇ xLw,τ (x, πθ) +∇ 2 x,θJτ(x, πθ⋆τ(x))∇2 θ,θJτ(x, πθ⋆τ(x))−1∇θLw,τ (x, πθ) ≤ 2LV σ ∥πθ −π ⋆ τ(x)∥ Lf + LV,2 w ∥πθ −π ⋆...

work page 2023
[19]

This implies∇ 2 x,θJ(x, πθ) =∇ 2 x,θJτ(x, πθ)

Note that Jτ(x, π)−J(x, π) = τ 1−γ Es∼dπρ [E(π, s)], which is independent ofx. This implies∇ 2 x,θJ(x, πθ) =∇ 2 x,θJτ(x, πθ). Using this relationship, we have ∥T3∥= ∇2 x,θJ(x, πθ) ∇2 θ,θJ(x, πθ⋆τ(x))−1∇θf(x, π θ⋆τ(x))− ∇ 2 θ,θJτ(x, πθ⋆τ(x))−1∇θf(x, π θ⋆τ(x)) ≤L V,2Lf ∥∇2 θ,θJτ(x, πθ⋆τ(x))−1∥∥∇2 θ,θJτ(x, πθ⋆τ(x))− ∇ 2 θ,θJ(x, πθ⋆τ(x))∥∥∇2 θ,θJ(x, πθ⋆τ(x))−...

work page 2022