Recognition: no theorem link
A Hessian-Free Actor-Critic Algorithm for Bi-Level Reinforcement Learning with Applications to LLM Fine-Tuning
Pith reviewed 2026-05-16 11:33 UTC · model grok-4.3
The pith
A single-loop first-order actor-critic algorithm converges to stationary points of unregularized bi-level RL problems using attenuating entropy regularization.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that the single-loop Hessian-free actor-critic algorithm, equipped with attenuating entropy regularization, converges in finite time and finite samples to a stationary point of the original unregularized bi-level optimization problem. This is established through a novel lower-level residual analysis under a special Polyak-Lojasiewicz condition, without needing to solve the lower-level problem exactly.
What carries the argument
The attenuating entropy regularization that produces asymptotically unbiased upper-level hyper-gradient estimates, combined with the lower-level residual analysis under the special Polyak-Lojasiewicz condition.
If this is right
- The algorithm provides a practical single-loop method for bi-level RL that avoids nested loops and Hessian computations.
- It applies directly to RLHF for LLM fine-tuning as shown in the GridWorld and happy tweet generation experiments.
- Finite-time and finite-sample convergence guarantees hold for the unregularized bi-level objective.
- Sample efficiency improves over methods that require exact lower-level solutions or second-order information.
Where Pith is reading between the lines
- If the special PL condition holds in a wider range of MDPs, the method could extend to other bi-level problems in optimization and control.
- Applying the algorithm to additional RLHF benchmarks could test how the attenuating regularization affects final policy quality.
- Similar regularization schedules might improve efficiency in other actor-critic methods for bi-level settings.
- A testable extension is to measure empirical convergence rates on tasks where the PL condition is approximately satisfied.
Load-bearing premise
The lower-level MDP satisfies a special type of Polyak-Lojasiewicz condition that enables the residual analysis.
What would settle it
Observing divergence or persistent bias in upper-level gradients when the lower-level MDP violates the Polyak-Lojasiewicz condition, for example in environments with multiple local optima or high stochasticity.
Figures
read the original abstract
We study a structured bi-level optimization problem where the upper-level objective is a smooth function and the lower-level problem is policy optimization in a Markov decision process (MDP). The upper-level decision variable parameterizes the reward of the lower-level MDP, and the upper-level objective depends on the optimal induced policy. Existing methods for bi-level optimization and RL often require second-order information, impose strong regularization at the lower level, or inefficiently use samples through nested-loop procedures. In this work, we propose a single-loop, first-order actor-critic algorithm that optimizes the bi-level objective via a penalty-based reformulation. We introduce into the lower-level RL objective an attenuating entropy regularization, which enables asymptotically unbiased upper-level hyper-gradient estimation without solving the unregularized RL problem exactly. We establish the finite-time and finite-sample convergence of the proposed algorithm to a stationary point of the original, unregularized bi-level optimization problem through a novel lower-level residual analysis under a special type of Polyak-Lojasiewicz condition. We validate the performance of our method through experiments on a GridWorld goal position problem and on happy tweet generation through reinforcement learning from human feedback (RLHF).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a single-loop first-order actor-critic algorithm for a structured bi-level optimization problem in which the upper-level objective is a smooth function of the optimal policy induced by a lower-level MDP whose reward is parameterized by the upper-level variable. The method employs a penalty reformulation together with an attenuating entropy regularization schedule at the lower level to produce asymptotically unbiased hyper-gradient estimates without solving the unregularized lower-level problem exactly. The central theoretical claim is finite-time and finite-sample convergence to a stationary point of the original unregularized bi-level objective, obtained via a novel lower-level residual analysis that invokes a specialized Polyak-Lojasiewicz condition on the MDP. The algorithm is evaluated on a GridWorld goal-position task and on RLHF for happy-tweet generation.
Significance. If the convergence result holds under verifiable conditions, the work would supply a practical Hessian-free single-loop method for bi-level RL problems that arise in LLM fine-tuning, avoiding both nested-loop sampling inefficiency and the bias introduced by permanent strong regularization. The residual-analysis technique and the attenuating-regularization schedule constitute a concrete technical contribution that could be reused in other bi-level RL settings.
major comments (1)
- [Main convergence theorem (Section 4)] The finite-time and finite-sample convergence guarantee (abstract and the main convergence theorem) rests on the lower-level MDP satisfying a specialized Polyak-Lojasiewicz condition that is stronger than the standard PL inequality. No derivation is supplied showing that this condition holds for general reward-parameterized MDPs, and the GridWorld and RLHF experiments contain no diagnostic checks (e.g., gradient-norm versus suboptimality plots or Hessian-eigenvalue spectra) that would empirically support the assumption. If the condition fails, the residual bounds used to establish asymptotic unbiasedness of the upper-level hyper-gradient estimates collapse.
minor comments (2)
- [Section 3] Notation for the entropy-regularization schedule and the penalty parameter is introduced without a consolidated table of symbols; readers must hunt through the text to recover the precise dependence on iteration index.
- [Section 5.2] Figure captions for the RLHF experiment do not report the number of independent runs or the precise hyper-parameter values used for the attenuating schedule, making reproduction difficult.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive feedback on our manuscript. We address the major comment point by point below.
read point-by-point responses
-
Referee: The finite-time and finite-sample convergence guarantee (abstract and the main convergence theorem) rests on the lower-level MDP satisfying a specialized Polyak-Lojasiewicz condition that is stronger than the standard PL inequality. No derivation is supplied showing that this condition holds for general reward-parameterized MDPs, and the GridWorld and RLHF experiments contain no diagnostic checks (e.g., gradient-norm versus suboptimality plots or Hessian-eigenvalue spectra) that would empirically support the assumption. If the condition fails, the residual bounds used to establish asymptotic unbiasedness of the upper-level hyper-gradient estimates collapse.
Authors: We thank the referee for this observation. The specialized Polyak-Lojasiewicz condition is explicitly stated as an assumption in Theorem 4.1 (and the abstract) to enable the novel lower-level residual analysis that yields asymptotic unbiasedness of the hyper-gradient estimates under attenuating entropy regularization. We do not claim or derive that the condition holds for arbitrary reward-parameterized MDPs, as this would require further structural assumptions on the MDP (e.g., strong convexity of the regularized objective or linear function approximation). In the revised manuscript we will add a dedicated remark in Section 4 clarifying the role of the assumption, providing sufficient conditions under which it holds (such as MDPs with quadratic reward parameterization), and noting that the finite-time result is conditional on it. We will also include new diagnostic plots (gradient norm vs. suboptimality) for the GridWorld experiment and a brief check for the RLHF task. These changes make the assumption's scope transparent without altering the theorem statement. revision: partial
- A general derivation proving the specialized PL condition for arbitrary reward-parameterized MDPs (without additional MDP structure) cannot be supplied, as the condition is not universally true and serves only as a sufficient assumption for the residual analysis.
Circularity Check
No significant circularity; derivation is self-contained from penalty reformulation and regularization schedule
full rationale
The paper introduces a penalty-based reformulation of the bi-level objective and an attenuating entropy regularization schedule in the lower-level MDP to produce asymptotically unbiased hyper-gradients. Finite-time convergence to a stationary point of the unregularized problem is shown via a novel residual analysis that relies on an explicit assumption of a specialized Polyak-Lojasiewicz condition on the lower-level objective. No equation reduces a claimed prediction or convergence bound to a fitted parameter by construction, and no load-bearing step invokes a self-citation whose content is itself unverified or defined in terms of the present result. The special PL condition is stated as an enabling assumption rather than derived from the algorithm outputs, and the GridWorld/RLHF experiments serve as validation rather than circular fitting. The derivation chain therefore remains independent of its own fitted quantities.
Axiom & Free-Parameter Ledger
free parameters (1)
- entropy attenuation schedule
axioms (1)
- domain assumption The lower-level MDP satisfies a special type of Polyak-Lojasiewicz condition
Reference graph
Works this paper leans on
-
[1]
On the sample complexity bounds in bilevel reinforcement learning.arXiv preprint arXiv:2503.17644,
Mudit Gaur, Amrit Singh Bedi, Raghu Pasupathu, and Vaneet Aggarwal. On the sample complexity bounds in bilevel reinforcement learning.arXiv preprint arXiv:2503.17644,
-
[2]
Approximation Methods for Bilevel Programming
Saeed Ghadimi and Mengdi Wang. Approximation methods for bilevel programming.arXiv preprint arXiv:1802.02246,
work page internal anchor Pith review Pith/arXiv arXiv
- [3]
-
[4]
Unlocking global optimality in bilevel optimization: A pilot study.arXiv preprint arXiv:2408.16087,
Quan Xiao and Tianyi Chen. Unlocking global optimality in bilevel optimization: A pilot study.arXiv preprint arXiv:2408.16087,
-
[5]
Quan Xiao, Hui Yuan, AFM Saif, Gaowen Liu, Ramana Kompella, Mengdi Wang, and Tianyi Chen. A first-order generative bilevel optimization framework for diffusion models.arXiv preprint arXiv:2502.08808,
-
[6]
samples drawn from the stationary distribution, instead of continuously generated Markovian samples
The sole distinction between the two is that Algorithm 2 uses i.i.d. samples drawn from the stationary distribution, instead of continuously generated Markovian samples. Stochastic approximation and RL algorithms have been extensively analyzed under Markovian sampling [Zou et al., 2019, Wu et al., 2020], and it is well-established that Markovian samples a...
work page 2019
-
[7]
, (85) where the first inequality follows from Zeng et al. [2022][Lemma 3], the second inequality is due to the fact that f is non-negative from Assumption 3, and the third inequality follows from Lemma 2 and the relationship E(π, s)≤log|A| for any policyπ. Collecting the bounds in (83)-(85) and plugging them into (78), we get E[Lreweight wk+1,τk+1(xk+1, ...
work page 2022
-
[8]
We defer the proof of the lemma to Appendix E.15
2 . We defer the proof of the lemma to Appendix E.15. By the update rule in (49), we have εV k+1 =∥ ˆVk+1 −V xk+1,πθk+1 τk+1 ∥2 =∥Π BV ˆVk +β kGτk(xk, θk, ˆVk, sk, ak, s′ k) −V xk+1,πθk+1 τk+1 ∥2 ≤ ˆVk +β kGτk(xk, θk, ˆVk, sk, ak, s′ k)−V xk+1,πθk+1 τk+1 2 = ˆVk −V xk,πθk τk +β k ¯Gτk(xk, θk, ˆVk) +β k Gf(xk, θk, ˆVk, sk, ak, s′ k)− ¯Gτk(xk, θk, ˆVk) +V x...
work page 2022
-
[9]
The bound onE[ε V,L k+1]can be derived using an identical argument
2 + 8BGβ2 k, where the second inequality simplifies and combines terms based on the step size conditions αk βk ≤ 1−γ 2 √ 6LV LF , βk ≤ (1−γ)L2 V 8|S|log 2 |A|, and α0 τ0 ≤ 1 LV . The bound onE[ε V,L k+1]can be derived using an identical argument. ■ E Proof of Supporting Results E.1 Proof of Lemma 1 Uniqueness ofπ ⋆(x).We first take the approach of proof b...
work page 2023
-
[10]
Therefore,(92) implies that 1 2 dπ1 ρ + 1 2 dπ2 ρ is the discounted visitation distribution induced by policyπ ′, i.e. dπ′ ρ = 1 2 dπ1 ρ + 1 2 dπ2 ρ .(93) We use ˆdπ ρ to denote the extend discounted visitation distribution over state and action such that ˆdπ ρ(s, a) =d π ρ(s)π(a|s). We have from (93) ˆdπ′ ρ (s, a) =d π′ ρ (s)π′(a|s) = 1 2 dπ1 ρ (s) + 1 2...
work page 2022
-
[11]
and Lemma 5 of Zeng et al. [2022]) ∥V x,πθ −V x,πθ′ ∥ ≤ 2 (1−γ) 2 ∥θ−θ ′∥,(95) ∥∇θV x,πθ − ∇θV x,πθ′ ∥ ≤ 8 (1−γ) 3 ∥θ−θ ′∥.(96) A Regularized Actor-Critic Algorithm for Bi-Level Reinforcement Learning We define the shorthand notation H(θ, s)≜E ak∼π(·|sk),sk+1∼P(·|s k,ak) " ∞X k=0 −γk logπ θ(ak |s k)|s 0 =s # . This impliesV x,πθ τ (s) =V x,πθ(s) +τH(θ, s)...
work page 2022
-
[12]
We can express∇ θV x,πθ as follows [Agarwal et al., 2021] ∇θs′ ,a′ V x,πθ(s) = 1 1−γ dπθ s (s′)πθ(a′ |s ′)Ax,πθ(s′, a′). This implies |∇θs′ ,a′ V x,πθ(s)− ∇ θs′,a′ V x′,πθ(s)|= 1 1−γ dπθ s (s′)πθ(a′ |s ′)|Ax,πθ(s′, a′)−A x′,πθ(s′, a′)| ≤ 1 1−γ |dπθ s (s′)||πθ(a′ |s ′)| ·2L r∥x−x ′∥ ≤ 2 1−γ ∥x−x ′∥, where the first inequality is due to the fact that the ad...
work page 2021
-
[13]
Adapting the result from Shen et al
∞Y k=0 π(ak |s k)P(s k+1 |s k, ak). Adapting the result from Shen et al. [2019], we have ∇2 θ,θJ(x, πθ) =E traj∼p(πθ,·) h ∇θϕ(x, π,traj)∇ θp(πθ,traj) ⊤ +∇ 2 θ,θϕ(x, π,traj) i ,(104) where we defineϕ(x, π,traj) = P∞ t=0(P∞ k=t γk(rx(sk, ak))k) logπ(a t |s t). A Regularized Actor-Critic Algorithm for Bi-Level Reinforcement Learning As the right hand side ex...
work page 2019
-
[14]
Substituting (112) and (113) into (109), we have ∥ ¯Fw,τ (x1, θ1, V1)− ¯Fw,τ (x2, θ2, V2)∥ ≤3L r∥x1 −x 2∥+ (L r + 2 log|A|+ 2 1−γ + 1)∥θ1 −θ 2∥+ 4∥V 1 −V 2∥+ BF 1−γ ∥θ1 −θ 2∥ = 3Lr∥x1 −x 2∥+ (L r + 2 log|A|+ 2 +B F 1−γ + 1)∥θ1 −θ 2∥+ 4∥V 1 −V 2∥. By the definition of ¯Gτ in (28), ∥ ¯Gτ(x1, θ1, V1)− ¯Gτ(x2, θ2, V2)∥ = Es∼d πθ1ρ ,a∼πθ1(·|s),s′∼P(·|s,a)[Gτ(x...
work page 2023
-
[15]
Applying the equation, we get ∇θLw,τ2(x, πθ⋆w,τ1(x)) =∇ θf(x, π θ⋆w,τ1(x))− 1 w ∇θJτ2(x, πθ⋆w,τ1(x)) = 1 w ∇θJτ1(x, πθ⋆w,τ1(x))− ∇ θJτ2(x, πθ⋆w,τ1(x)) .(121) The regularized RL objective has a closed-form expression (see Mei et al. [2020][Lemma 10]) ∂Jτ(x, πθ) ∂θ(s, a) = dπθ ρ (s) 1−γ ·π θ(a|s)·A x,πθ τ (s, a), which in combination with (121) implies ∥∇θL...
work page 2020
-
[16]
Therefore, ∇2 τ,θ Jτ(x, πθ) = 1 1−γ ∇θEs∼d πθρ , a∼πθ(·|s)[E(πθ, s)].(126) Zeng et al
As∥∇ 2 θ,θJτ(x, πθ⋆τ(x))∥is lower bounded byσ due to Assumption 2, we have for anyτ≥0 dθ⋆ τ(x) dτ = − ∥∇2 θ,θJτ(x, πθ⋆τ(x)) −1 ∇2 τ,θ Jτ(x, πθ⋆τ(x)) ≤ ∥∇2 θ,θJτ(x, πθ⋆τ(x)) −1 ∥∇2 τ,θ Jτ(x, πθ⋆τ(x))∥ Sihan Zeng, Sujay Bhatt, Sumitra Ganesh, Alec Koppel ≤ 1 σ ∥∇2 τ,θ Jτ(x, πθ⋆ τ(x))∥.(125) It is clear from (2) and (3) ∇τ Jτ(x, πθ) = 1 1−γ Es∼d πθρ , a∼πθ(·...
work page 2022
-
[17]
We next show the smoothness ofΦ w,τ
this implies ∥∇ℓτ(x1)− ∇ℓ τ(x2)∥=∥∇ xJτ(x1, πθ⋆τ(x1))− ∇ xJτ(x2, πθ⋆τ(x2))∥ ≤L V ∥x1 −x 2∥+L V ∥π⋆ τ(x1)−π ⋆ τ(x2)∥ ≤L V ∥x1 −x 2∥+L V · 2LV CLτ ∥x1 −x 2∥ ≤ LV + 2L2 V CLτ ∥x1 −x 2∥, where the second inequality follows from Lemma 7 by recognizing thatπ ⋆ τ(x) = limw→0 π⋆ w,τ (x). We next show the smoothness ofΦ w,τ . From (13) it can be seen ∥∇xΦw,τ (x1)−...
work page 2023
-
[18]
plays the same role as our 1/w. Kwon et al. [2023][Lemma A.2] is still valid without lower level convexity, with the lower bound on∥∇2 θ,θJτ(x, πθ⋆τ(x))∥ changed from the strong convexity coefficient toσ . This allows us to write ∇xΦτ(x)− ∇ xLw,τ (x, πθ) +∇ 2 x,θJτ(x, πθ⋆τ(x))∇2 θ,θJτ(x, πθ⋆τ(x))−1∇θLw,τ (x, πθ) ≤ 2LV σ ∥πθ −π ⋆ τ(x)∥ Lf + LV,2 w ∥πθ −π ⋆...
work page 2023
-
[19]
This implies∇ 2 x,θJ(x, πθ) =∇ 2 x,θJτ(x, πθ)
Note that Jτ(x, π)−J(x, π) = τ 1−γ Es∼dπρ [E(π, s)], which is independent ofx. This implies∇ 2 x,θJ(x, πθ) =∇ 2 x,θJτ(x, πθ). Using this relationship, we have ∥T3∥= ∇2 x,θJ(x, πθ) ∇2 θ,θJ(x, πθ⋆τ(x))−1∇θf(x, π θ⋆τ(x))− ∇ 2 θ,θJτ(x, πθ⋆τ(x))−1∇θf(x, π θ⋆τ(x)) ≤L V,2Lf ∥∇2 θ,θJτ(x, πθ⋆τ(x))−1∥∥∇2 θ,θJτ(x, πθ⋆τ(x))− ∇ 2 θ,θJ(x, πθ⋆τ(x))∥∥∇2 θ,θJ(x, πθ⋆τ(x))−...
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.