Provable Benefit of Curriculum in Transformer Tree-Reasoning Post-Training

Andi Han; Atsushi Nitanda; Dake Bu; Hau-San Wong; Qingfu Zhang; Taiji Suzuki; Wei Huang

arxiv: 2511.07372 · v3 · submitted 2025-11-10 · 💻 cs.LG

Provable Benefit of Curriculum in Transformer Tree-Reasoning Post-Training

Dake Bu , Wei Huang , Andi Han , Atsushi Nitanda , Hau-San Wong , Qingfu Zhang , Taiji Suzuki This is my paper

Pith reviewed 2026-05-17 23:14 UTC · model grok-4.3

classification 💻 cs.LG

keywords curriculum learningpost-trainingreinforcement learningchain-of-thought reasoningsample complexitytransformer modelsreasoning treestheoretical analysis

0 comments

The pith

Curriculum post-training achieves polynomial sample complexity for high-accuracy transformer tree-reasoning

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a theoretical framework to explain why curriculum techniques outperform non-curriculum methods in the post-training stage of transformers for reasoning tasks. It models the base model's Chain-of-Thought generation as a state-conditioned autoregressive reasoning tree and defines two curriculum types: depth-increasing curricula that extend reasoning horizons step by step, and hint-decreasing curricula that gradually remove partial hints. Under identified sufficient conditions, reinforcement learning finetuning with either curriculum reaches high accuracy using only a polynomial number of samples. The same finetuning without curriculum hits an exponential sample complexity bottleneck, and parallel guarantees are shown for test-time scaling, with simulations providing supporting evidence.

Core claim

By modeling Chain-of-Thought generation as a state-conditioned autoregressive reasoning tree, the authors establish that reinforcement learning finetuning with depth-increasing or hint-decreasing curriculum strategies achieves high accuracy with polynomial sample complexity in transformer tree-reasoning post-training, whereas non-curriculum finetuning encounters an exponential complexity bottleneck, with analogous polynomial guarantees holding for test-time scaling.

What carries the argument

The state-conditioned autoregressive reasoning tree, which represents the autoregressive generation of reasoning steps and enables formalization of curriculum subtasks as progressive depth extension or hint removal to analyze sample complexity.

Load-bearing premise

The base model's Chain-of-Thought generation is accurately captured by a state-conditioned autoregressive reasoning tree and the identified sufficient conditions for curriculum improvements hold in the modeled setting.

What would settle it

Direct measurement of sample complexity in RL finetuning simulations on tree-reasoning tasks, checking whether curriculum versions scale polynomially while non-curriculum versions scale exponentially under the stated conditions.

Figures

Figures reproduced from arXiv: 2511.07372 by Andi Han, Atsushi Nitanda, Dake Bu, Hau-San Wong, Qingfu Zhang, Taiji Suzuki, Wei Huang.

**Figure 1.** Figure 1: An illustration in Liu et al. (2025b) of the Chain-of-Thought for Countdown game, where the goal is to obtain 24 by applying basic arithmetic operations (+, −, ×, ÷) (Φl(·, ·) in our Def. 2) between the current step’s number (e.g., 13, 65 or 72) and some unused number (e.g., 5, 7 or 3) in {3, 5, 7, 13}, targeting 24 as the final outcome. Per Parashar et al. (2025), the difficulty measure of Countdown is th… view at source ↗

**Figure 2.** Figure 2: Reasoning tree for parity problems with d = K = 3 and input x1, x2, x3,EOS. The nodes on the 2nd–4th levels denote hypotheses about which index is the current secret index, corresponding to xi1 , xi2 , and xi3 , respectively. In our parity CoT class f ∈ Fparity 2S-ART, each step actually consists of two actions: (i) choose the next secret index it; (ii) after the choice, apply an XOR over zt−1 and xit as z… view at source ↗

read the original abstract

Recent curriculum techniques in the post-training stage of LLMs have been empirically observed to outperform non-curriculum approaches in improving reasoning performance, yet a principled understanding of their effectiveness and limitations remains incomplete. To bridge this gap, we develop an abstract theoretical framework and identify sufficient conditions under which curriculum post-training yields exponential improvements in sample complexity. To substantiate this framework, we model the base model's Chain-of-Thought generation as a state-conditioned autoregressive reasoning tree, and formalize curriculum subtasks as either depth-increasing curricula that progressively extend reasoning horizons or hint-decreasing curricula that gradually remove partial hints. Our analysis shows that reinforcement learning finetuning with both curriculum strategies achieves high accuracy with polynomial sample complexity, whereas non-curriculum counterpart encounters an exponential complexity bottleneck. We further establish analogous guarantees for test-time scaling. Empirical simulations support our theoretical findings. Code is available at https://github.com/DakeBU/Curriculum-Post-training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows that under a state-conditioned autoregressive reasoning tree model, depth-increasing or hint-decreasing curricula turn exponential sample complexity into polynomial for RL post-training on tree reasoning, with matching simulations and released code.

read the letter

The core result is a clean separation: curriculum strategies reach high accuracy in polynomial samples while the non-curriculum case hits an exponential bottleneck, all under sufficient conditions on their tree model. That distinction is the main new piece relative to prior empirical curriculum work in LLM reasoning. They formalize CoT generation as this autoregressive tree, treat curricula as progressive depth or hint removal, and derive the complexity bounds for both training and test-time scaling. The simulations line up with the theory and the code is public, which lets others check the implementation directly. That combination of abstract framework plus reproducible experiments is the part worth taking seriously. The modeling assumptions carry most of the weight. The tree has to capture how base models actually generate chains, subtasks need to be separable enough for the curriculum to help exploration, and the initial model has to satisfy the conditions that let curriculum avoid the exponential trap. If those do not map closely to real transformer post-training, the polynomial-exponential gap stays theoretical. The paper states these are sufficient conditions rather than claiming they always hold, which keeps the claim honest but also limits how far the result travels without further validation. This is for readers who want a theoretical handle on why curriculum helps reasoning post-training and who are comfortable with stylized models. It gives a starting point for thinking about sample efficiency in tree-like reasoning tasks. The work is coherent on its own terms and engages the relevant literature, so it deserves a serious referee even if the assumptions need tightening in revision.

Referee Report

2 major / 2 minor

Summary. The paper develops an abstract theoretical framework for curriculum post-training of transformers on tree-reasoning tasks. It models base-model Chain-of-Thought generation as a state-conditioned autoregressive reasoning tree and identifies sufficient conditions under which depth-increasing and hint-decreasing curricula enable reinforcement-learning finetuning to reach high accuracy with polynomial sample complexity, while the non-curriculum baseline encounters an exponential bottleneck. Analogous polynomial guarantees are claimed for test-time scaling. The claims are supported by simulations and released code.

Significance. If the modeling assumptions and sufficient conditions are realistic and non-vacuous, the work supplies a concrete theoretical account of why curriculum strategies can produce exponential sample-complexity gains in LLM reasoning post-training. The explicit tree model, the polynomial-vs-exponential separation, and the provision of reproducible simulation code are positive features that would strengthen the literature on curriculum learning for transformers.

major comments (2)

[§3] §3 (State-conditioned autoregressive reasoning tree): The central separation between polynomial and exponential sample complexity rests on this modeling choice together with the listed sufficient conditions. The manuscript should supply a precise statement of how the tree captures typical transformer failure modes (e.g., attention dilution over long chains, compounding token-level errors) and should include a short robustness argument showing that modest relaxations of the perfect-subtask-separability assumption do not collapse the claimed separation.
[§4] §4 (Sufficient conditions for depth-increasing and hint-decreasing curricula): The conditions appear to encode strong assumptions on initial model capabilities, exploration structure, and subtask separability. Because these conditions are load-bearing for the polynomial-complexity claim, the paper must demonstrate that they are satisfiable by standard RL finetuning objectives (e.g., PPO or GRPO) on realistic transformer architectures rather than only on the abstract tree model.

minor comments (2)

Define the quantitative threshold for “high accuracy” used in the complexity statements and report it consistently in both theorems and simulation figures.
Add error bars or multiple random seeds to the simulation plots that compare curriculum and non-curriculum sample complexity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed report. We address each major comment below and describe the revisions we will incorporate.

read point-by-point responses

Referee: [§3] §3 (State-conditioned autoregressive reasoning tree): The central separation between polynomial and exponential sample complexity rests on this modeling choice together with the listed sufficient conditions. The manuscript should supply a precise statement of how the tree captures typical transformer failure modes (e.g., attention dilution over long chains, compounding token-level errors) and should include a short robustness argument showing that modest relaxations of the perfect-subtask-separability assumption do not collapse the claimed separation.

Authors: We agree that an explicit link between the abstract model and concrete transformer failure modes will improve clarity. In the revised manuscript we will add a paragraph to §3 that explains how the state-conditioned autoregressive reasoning tree encodes attention dilution (through progressive degradation of the conditioning state across depth) and compounding token-level errors (through the multiplicative effect of branching factors). We will also insert a short robustness subsection showing that, under modest relaxations of perfect subtask separability (e.g., a constant-probability subtask misidentification error), the polynomial sample-complexity bound continues to hold up to an additional logarithmic factor, thereby preserving the exponential separation from the non-curriculum baseline. revision: yes
Referee: [§4] §4 (Sufficient conditions for depth-increasing and hint-decreasing curricula): The conditions appear to encode strong assumptions on initial model capabilities, exploration structure, and subtask separability. Because these conditions are load-bearing for the polynomial-complexity claim, the paper must demonstrate that they are satisfiable by standard RL finetuning objectives (e.g., PPO or GRPO) on realistic transformer architectures rather than only on the abstract tree model.

Authors: The sufficient conditions are stated at a level of generality intended to be compatible with standard RL objectives. We acknowledge that the current analysis and simulations operate on the abstract tree model rather than full-scale realistic transformers. In the revision we will expand the discussion in §4 to show how the conditions align with the empirical behavior observed under PPO-style updates in our transformer-based simulations and to argue that the assumptions can be approximately realized in practice. A comprehensive empirical validation on large language models lies beyond the scope of this work and is noted as a direction for future research. revision: partial

Circularity Check

0 steps flagged

No circularity: conditional theoretical analysis under explicit modeling assumptions

full rationale

The paper constructs an abstract framework by positing a state-conditioned autoregressive reasoning tree model for CoT generation and then derives sufficient conditions under which depth-increasing and hint-decreasing curricula yield polynomial sample complexity for RL finetuning while the non-curriculum case requires exponential samples. This is a standard conditional proof structure: the claimed separation holds inside the model once the sufficient conditions are granted. No step reduces a prediction to a fitted parameter by construction, no self-citation chain is load-bearing for the central claim, and no ansatz is smuggled via prior work. The derivation remains self-contained as a mathematical argument on the stated tree model and conditions; empirical simulations are presented only as supporting evidence, not as the source of the complexity bounds.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the abstraction of CoT generation into a state-conditioned autoregressive reasoning tree and on the sufficient conditions for curriculum benefits; these constitute the primary modeling assumptions rather than derived quantities.

axioms (1)

domain assumption The base model's Chain-of-Thought generation can be modeled as a state-conditioned autoregressive reasoning tree.
This modeling choice enables formalization of depth-increasing and hint-decreasing curricula and the subsequent complexity analysis.

invented entities (1)

State-conditioned autoregressive reasoning tree no independent evidence
purpose: Abstract model of CoT generation to analyze curriculum effects on sample complexity.
Introduced as the core abstraction for the theoretical framework; no independent empirical validation outside the paper's simulations is described.

pith-pipeline@v0.9.0 · 5476 in / 1217 out tokens · 32717 ms · 2026-05-17T23:14:47.807883+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

model the base model's Chain-of-Thought generation as a state-conditioned autoregressive reasoning tree... curriculum subtasks as either depth-increasing curricula... or hint-decreasing curricula
IndisputableMonolith/Foundation/DimensionForcing.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

exponential decay in success probability... Θ(d^{-(l+1)})

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

11 extracted references · 11 canonical work pages

[1]

arXiv:2501.12997

URLhttps://arxiv.org/abs/2501.12997. arXiv:2501.12997. Yoshua Bengio, J´ erˆ ome Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. InProceedings of the 26th ICML, pages 41–48, 2009. 10 Provable Benefit of Curriculum in Transformer Tree-Reasoning Post-Training Adam Block and Yury Polyanskiy. The sample complexity of approximate rejection s...

work page arXiv 2009
[2]

arXiv:2303.07971

URLhttps://arXiv.org/abs/2303.07971. arXiv:2303.07971. Wassily Hoeffding. Probability inequalities for sums of bounded random variables.Journal of the American Statistical Association, 58(301):13–30, 1963. Kurt Hornik. Approximation capabilities of multilayer feedforward networks.Neural networks, 4(2):251–257, 1991. Jianhao Huang, Zixuan Wang, and Jason D...

work page arXiv 1963
[3]

return a uniformly random iterate

Suppose the sequence of optimal policies{π ⋆ k}K k=1 satisfies assumptions in Eq.(1), (2), namely Ccov(π⋆ k|π⋆ k−1) = Θ(C ⋆),∀k∈[K], Ccov(π⋆ k2 |π⋆ k1) = Θ((C ⋆)k2−k1). For anyϵ >0andδ∈(0,1), by applying SpannerSampling in Foster et al. (2025)sequentiallythrough the curriculum with appropriate choices ofT prompt,T k span, andT k exp for each task, the cur...

work page 2025
[4]

= Θ((C ⋆)K)(by 2), and Theorem 3.1 gives T direct comp (ϵ, δ) = ˜O (C ⋆)K ·Γ(ϵ, δ) . Step 4 (Comparing the two strategies).Combining the above displays yields T direct comp (ϵ, δ) T curriculumcomp (ϵ, δ) = Ω (C ⋆)K−1 K , i.e., an exponential-to-polynomial improvement as a function ofK, as claimed. This recovers the abstract form of Theorem. 1 with the con...

work page 2025
[5]

curricu- lum efficiency factor

= Θ (C ⋆) bK−c ,1≤b≪K, c≪K. Then, under Spanner Sampling, the curriculum complexity satisfies T curriculum comp = ˜O K·max k∈[K] (C ⋆)pk ·Γ(ϵ, δ) ≪ ˜Θ (C ⋆)bK−c ·Γ(ϵ, δ) =T direct comp . Thus, even under relaxed conditions, the curriculum strategy with Spanner Sampling retains an exponential advantage in computational complexity compared to direct estimat...

work page 2025
[6]

(exact logit update for affine logits) For anyj∈ I l, we have sl j;W (t+1) =s l j;W (t) +ηE (t) τ Rk⋆ α(t) l (j)η (t) l (j) .(33)

work page
[7]

Then ∆l W(t+1) ≥∆ l W(t) +η E(t) τ [Rk⋆ α(t) l (il)η (t) l (il)]−sup j̸=i l E(t) τ [Rk⋆ α(t) l (j)η (t) l (j)] +O(η 2)

(subgradient inequality for the max) Let the active competitor set beA max l := arg maxj̸=il sl j;W (t−1) . Then ∆l W(t+1) ≥∆ l W(t) +η E(t) τ [Rk⋆ α(t) l (il)η (t) l (il)]−sup j̸=i l E(t) τ [Rk⋆ α(t) l (j)η (t) l (j)] +O(η 2). (34) If, in a neighborhood ofW (t−1), the active maximizer is unique and remains unchanged (there existsj max l ∈ Amax l that sta...

work page
[8]

Let˜αl(j) := exp(sl(j))P q̸=il exp(sl(q)) be the softmax normalized only over competitors, evaluated atW (t−1)

(update of a smooth lower bound) Define the competitors’ log-sum-exp lower bound ϕl(W) :=s l(il;W)−log X j̸=il exp sl(j;W) ≤∆ l(W). Let˜αl(j) := exp(sl(j))P q̸=il exp(sl(q)) be the softmax normalized only over competitors, evaluated atW (t−1). Then ϕl W(t+1) =ϕ l W(t) +η E(t) τ [Rk⋆ α(t) l (il)η (t) l (il)]− P j̸=il ˜α(t) l (j)E (t) τ [Rk⋆ α(t) l (j)η (t)...

work page
[9]

Writing the one-step termination probability atr+1under a conditionE∈ {A r, Br, Dr}as θ(t) r+1(E) :=P (t) ir+1 = EOS E , the probability of the eventU r decomposes as P(t)(Ur) =P (t)(Ar)θ (t) r+1(Ar)| {z } true-path contribution + 1 2 P(t)(Br)θ (t) r+1(Br) + 1 2 P(t)(Dr)θ (t) r+1(Dr)| {z } spurious (reward-hacking) contribution .(50) Moreover, writingp (t...

work page
[10]

(Objective) Unbiasedness and variance: E bJ k⋆ n =J k⋆ REINFORCE,Var bJ k⋆ n = 1 n psucc(1−p succ)≤ 1 4n .(54) In particular, under near-uniform initialization and using equation 31, we have psucc ≤ρ spur + (1−ρ spur) k⋆+1Y l=1 1 dl (ifd l = Θ(d), p succ ≤ρ spur + (1−ρ spur) Θ(d−(k⋆+1))),(55) so that Var bJ k⋆ n ≤ 1 n ρspur + (1−ρ spur)d −(k⋆+1) 1−ρ spur ...

work page
[11]

Letµ:=∇ Wk⋆ J k⋆ REINFORCE be the population gradient

(Gradient) Unbiasedness and covariance. Letµ:=∇ Wk⋆ J k⋆ REINFORCE be the population gradient. Then E bgk⋆ n =µ,Cov bgk⋆ n = 1 n Ex,τ h Rk⋆ k⋆+1X l,t=1 ∇logπ l ⊗ ∇logπ t i | {z } =:Σ pop −µ⊗µ ,(57) whereπ l abbreviatesπ Wk⋆ (il |x, ˆpz1:l)and⊗is the outer product in parameter space. Furthermore, using Lemma 3, write a block-orthogonal expansion ∇logπ l = ...

work page 2024

[1] [1]

arXiv:2501.12997

URLhttps://arxiv.org/abs/2501.12997. arXiv:2501.12997. Yoshua Bengio, J´ erˆ ome Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. InProceedings of the 26th ICML, pages 41–48, 2009. 10 Provable Benefit of Curriculum in Transformer Tree-Reasoning Post-Training Adam Block and Yury Polyanskiy. The sample complexity of approximate rejection s...

work page arXiv 2009

[2] [2]

arXiv:2303.07971

URLhttps://arXiv.org/abs/2303.07971. arXiv:2303.07971. Wassily Hoeffding. Probability inequalities for sums of bounded random variables.Journal of the American Statistical Association, 58(301):13–30, 1963. Kurt Hornik. Approximation capabilities of multilayer feedforward networks.Neural networks, 4(2):251–257, 1991. Jianhao Huang, Zixuan Wang, and Jason D...

work page arXiv 1963

[3] [3]

return a uniformly random iterate

Suppose the sequence of optimal policies{π ⋆ k}K k=1 satisfies assumptions in Eq.(1), (2), namely Ccov(π⋆ k|π⋆ k−1) = Θ(C ⋆),∀k∈[K], Ccov(π⋆ k2 |π⋆ k1) = Θ((C ⋆)k2−k1). For anyϵ >0andδ∈(0,1), by applying SpannerSampling in Foster et al. (2025)sequentiallythrough the curriculum with appropriate choices ofT prompt,T k span, andT k exp for each task, the cur...

work page 2025

[4] [4]

= Θ((C ⋆)K)(by 2), and Theorem 3.1 gives T direct comp (ϵ, δ) = ˜O (C ⋆)K ·Γ(ϵ, δ) . Step 4 (Comparing the two strategies).Combining the above displays yields T direct comp (ϵ, δ) T curriculumcomp (ϵ, δ) = Ω (C ⋆)K−1 K , i.e., an exponential-to-polynomial improvement as a function ofK, as claimed. This recovers the abstract form of Theorem. 1 with the con...

work page 2025

[5] [5]

curricu- lum efficiency factor

= Θ (C ⋆) bK−c ,1≤b≪K, c≪K. Then, under Spanner Sampling, the curriculum complexity satisfies T curriculum comp = ˜O K·max k∈[K] (C ⋆)pk ·Γ(ϵ, δ) ≪ ˜Θ (C ⋆)bK−c ·Γ(ϵ, δ) =T direct comp . Thus, even under relaxed conditions, the curriculum strategy with Spanner Sampling retains an exponential advantage in computational complexity compared to direct estimat...

work page 2025

[6] [6]

(exact logit update for affine logits) For anyj∈ I l, we have sl j;W (t+1) =s l j;W (t) +ηE (t) τ Rk⋆ α(t) l (j)η (t) l (j) .(33)

work page

[7] [7]

Then ∆l W(t+1) ≥∆ l W(t) +η E(t) τ [Rk⋆ α(t) l (il)η (t) l (il)]−sup j̸=i l E(t) τ [Rk⋆ α(t) l (j)η (t) l (j)] +O(η 2)

(subgradient inequality for the max) Let the active competitor set beA max l := arg maxj̸=il sl j;W (t−1) . Then ∆l W(t+1) ≥∆ l W(t) +η E(t) τ [Rk⋆ α(t) l (il)η (t) l (il)]−sup j̸=i l E(t) τ [Rk⋆ α(t) l (j)η (t) l (j)] +O(η 2). (34) If, in a neighborhood ofW (t−1), the active maximizer is unique and remains unchanged (there existsj max l ∈ Amax l that sta...

work page

[8] [8]

Let˜αl(j) := exp(sl(j))P q̸=il exp(sl(q)) be the softmax normalized only over competitors, evaluated atW (t−1)

(update of a smooth lower bound) Define the competitors’ log-sum-exp lower bound ϕl(W) :=s l(il;W)−log X j̸=il exp sl(j;W) ≤∆ l(W). Let˜αl(j) := exp(sl(j))P q̸=il exp(sl(q)) be the softmax normalized only over competitors, evaluated atW (t−1). Then ϕl W(t+1) =ϕ l W(t) +η E(t) τ [Rk⋆ α(t) l (il)η (t) l (il)]− P j̸=il ˜α(t) l (j)E (t) τ [Rk⋆ α(t) l (j)η (t)...

work page

[9] [9]

Writing the one-step termination probability atr+1under a conditionE∈ {A r, Br, Dr}as θ(t) r+1(E) :=P (t) ir+1 = EOS E , the probability of the eventU r decomposes as P(t)(Ur) =P (t)(Ar)θ (t) r+1(Ar)| {z } true-path contribution + 1 2 P(t)(Br)θ (t) r+1(Br) + 1 2 P(t)(Dr)θ (t) r+1(Dr)| {z } spurious (reward-hacking) contribution .(50) Moreover, writingp (t...

work page

[10] [10]

(Objective) Unbiasedness and variance: E bJ k⋆ n =J k⋆ REINFORCE,Var bJ k⋆ n = 1 n psucc(1−p succ)≤ 1 4n .(54) In particular, under near-uniform initialization and using equation 31, we have psucc ≤ρ spur + (1−ρ spur) k⋆+1Y l=1 1 dl (ifd l = Θ(d), p succ ≤ρ spur + (1−ρ spur) Θ(d−(k⋆+1))),(55) so that Var bJ k⋆ n ≤ 1 n ρspur + (1−ρ spur)d −(k⋆+1) 1−ρ spur ...

work page

[11] [11]

Letµ:=∇ Wk⋆ J k⋆ REINFORCE be the population gradient

(Gradient) Unbiasedness and covariance. Letµ:=∇ Wk⋆ J k⋆ REINFORCE be the population gradient. Then E bgk⋆ n =µ,Cov bgk⋆ n = 1 n Ex,τ h Rk⋆ k⋆+1X l,t=1 ∇logπ l ⊗ ∇logπ t i | {z } =:Σ pop −µ⊗µ ,(57) whereπ l abbreviatesπ Wk⋆ (il |x, ˆpz1:l)and⊗is the outer product in parameter space. Furthermore, using Lemma 3, write a block-orthogonal expansion ∇logπ l = ...

work page 2024