Bringing Stability to Diffusion: Decomposing and Reducing Variance of Training Masked Diffusion Models

Guanjun Jiang; Mengni Jia; Mengyu Zhou; Xiaoxi Jiang; Yihao Liu

arxiv: 2511.18159 · v2 · pith:RWAELKS3new · submitted 2025-11-22 · 💻 cs.LG

Bringing Stability to Diffusion: Decomposing and Reducing Variance of Training Masked Diffusion Models

Mengni Jia , Mengyu Zhou , Yihao Liu , Xiaoxi Jiang , Guanjun Jiang This is my paper

Pith reviewed 2026-05-22 12:39 UTC · model grok-4.3

classification 💻 cs.LG

keywords masked diffusion modelstraining variancevariance decompositionvariance reductionP-POTSMIRRORautoregressive modelsreasoning tasks

0 comments

The pith

Masked diffusion models decompose training variance into masking pattern noise, masking rate noise, and data noise, unlike autoregressive models that face only data noise.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the high training variance in masked diffusion models arises from three distinct sources: masking pattern noise, masking rate noise, and data noise, while autoregressive models experience only data noise. This decomposition explains why MDMs often become unstable and fall behind during task-specific training even when starting competitively. The authors introduce six variance-reduction techniques, centered on P-POTS which samples harder time steps more frequently with smaller updates and MIRROR which applies negatively correlated samples to counter masking pattern effects. Experiments on complex reasoning tasks show 7-8% accuracy gains alongside run-to-run variability dropping to levels seen in ARMs. A sympathetic reader cares because this supplies a systematic route to making MDMs more reliable for sequence generation without relying solely on empirical fixes.

Core claim

We derive the first decomposition of MDM training variance into three sources: (A) masking pattern noise, (B) masking rate noise, and (C) data noise, while ARMs are only affected by (C). This explains the fundamental training gap. Building on this foundation, we design six variance-reduction methods, including two core methods: (1) P-POTS, a Pareto-optimal t sampler that minimizes training variance by sampling harder t values more often with appropriately smaller update steps, and (2) MIRROR, which uses negatively correlated samples to reduce (A). Experiments show that compared to standard MDM training, our methods improve accuracy by 7-8% on complex reasoning tasks, while simultaneously 2re

What carries the argument

The three-source variance decomposition separating masking pattern noise and masking rate noise from data noise, which supports targeted techniques such as Pareto-optimal t sampling in P-POTS and negative correlation sampling in MIRROR.

If this is right

Accuracy on complex reasoning tasks rises by 7-8% relative to standard MDM training.
Run-to-run variability falls to near the levels observed in ARMs.
The performance gap with strong ARM baselines narrows substantially in most settings.
The worst run under the new methods outperforms the best baseline run.
Six distinct variance-reduction methods become available once the sources are identified.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same three-source breakdown could be tested in continuous diffusion models to check whether masking-specific noises appear there too.
Lower variance might support longer or larger-scale MDM training runs that previously diverged due to instability.
The methods could be paired with gradient clipping or adaptive optimizers to test additive stability gains beyond the paper's scope.
Repeating the accuracy and variability measurements on non-reasoning sequence tasks would clarify how general the reported improvements are.

Load-bearing premise

The three variance sources are separable and additive so they can be reduced independently through sampling and correlation techniques.

What would settle it

If controlled experiments applying P-POTS and MIRROR show no measurable drop in training variance or no 7-8% accuracy lift on reasoning tasks relative to standard MDM training, the decomposition and reduction claims would not hold.

Figures

Figures reproduced from arXiv: 2511.18159 by Guanjun Jiang, Mengni Jia, Mengyu Zhou, Xiaoxi Jiang, Yihao Liu.

**Figure 1.** Figure 1: Graphic illustration of three sources of training variance in MDMs. The left panel illustrates standard [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Empirical {pj} b j=1 (scatter) and the fitted curve p ∗ (t) = p g(t) 2 + v(t) (line) on three datasets: (a) OpenScience, (b) GSM8K, and (c) HiTab. The equations in each subplot show the fitted forms of g(t) and v(t), which together characterize the sampling distribution across masking rates t. for j = 1, . . . , b. The weights {pˆj} b j=1 represent an empirical scatter of the underlying optimal p ∗ (t). To… view at source ↗

**Figure 3.** Figure 3: Images generated by MMaDA-8B-MixCoT trained with P-POTS+MIRROR (top row) and the stan [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Training loss comparison between P-POTS+MIRROR (green) and Standard (red) on OpenScience [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: The left panel shows the heteroscedasticity of the [PITH_FULL_IMAGE:figures/full_fig_p030_5.png] view at source ↗

**Figure 6.** Figure 6: Shape of p(t) under different situations: (a) shape of empirical and fitted p(t) by EPR before training; (b) by polynomial of degree 7 before training; (c) fitted by EPR after training for one epoch using polynomial (degree 7). These functions are fitted by minimizing the KL divergence, so only their shapes are matched; they are later renormalized in training to yield valid PDFs. As shown in [PITH_FULL_IM… view at source ↗

read the original abstract

Masked diffusion models (MDMs) are a promising alternative to autoregressive models (ARMs), but they suffer from inherently much higher training variance. High variance leads to noisier gradient estimates and unstable optimization, so even equally strong pretrained MDMs and ARMs that are competitive at initialization often diverge after task-specific training, with MDMs falling far behind. There has been no theoretical explanation or systematic solution. We derive the first decomposition of MDM training variance into three sources: (A) masking pattern noise, (B) masking rate noise, and (C) data noise, while ARMs are only affected by (C). This explains the fundamental training gap. Building on this foundation, we design six variance-reduction methods, including two core methods: (1) P-POTS, a Pareto-optimal t sampler that minimizes training variance by sampling harder t values more often with appropriately smaller update steps, and (2) MIRROR, which uses negatively correlated samples to reduce (A). Experiments show that compared to standard MDM training, our methods improve accuracy by 7-8% on complex reasoning tasks, while simultaneously reducing run-to-run variability to near ARM levels, substantially narrowing the gap with strong ARM baselines; in most settings, even the best baseline runs remain below the worst run of our method.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper decomposes MDM training variance into pattern, rate, and data sources then introduces P-POTS and MIRROR samplers that cut variability and lift accuracy 7-8% in the reported experiments.

read the letter

The main thing here is a claimed decomposition of masked diffusion training variance into three additive parts—masking pattern noise, masking rate noise, and data noise—with autoregressive models only carrying the last one. They then build P-POTS, a Pareto-optimal sampler for the rate t, and MIRROR, a negative-correlation scheme for patterns, and report that these changes raise accuracy 7-8% on reasoning tasks while bringing run-to-run spread down to near ARM levels. That framing and the two core methods are what is actually new; prior MDM work did not present this split or these specific reducers. The experiments give a clear before-and-after picture on complex tasks and show the best baseline still underperforming the worst run of their method, which is useful evidence even if the absolute numbers depend on the exact setups. The paper does a decent job turning the variance story into concrete sampling changes that practitioners could try. The soft spot is the separability assumption in the decomposition. If the joint distribution over mask patterns and rates introduces covariance terms that do not factor cleanly, then reducing each source independently will not remove the full excess variance, and part of the reported gain might come from other tuning effects rather than the claimed mechanism. The abstract gives no equations, so the full derivation needs to show the second-moment expansion without leftover cross moments. Minor points include wanting error bars or statistical tests on the variability claims and clearer ablations that turn each reducer on and off. This paper is for groups already working on masked diffusion or other non-autoregressive generators who need lower training noise on reasoning or fine-tuning stages. A reader who cares about variance reduction techniques or closing the ARM gap will get practical ideas from the samplers and the empirical comparison. It deserves a serious referee because the claims are specific enough to check against the derivations and the results are concrete enough to matter if they hold.

Referee Report

2 major / 2 minor

Summary. The paper derives a decomposition of the training gradient variance in Masked Diffusion Models (MDMs) into three additive sources—(A) masking pattern noise, (B) masking rate noise, and (C) data noise—while autoregressive models (ARMs) are affected only by (C). It introduces six variance-reduction techniques, with core methods P-POTS (a Pareto-optimal t-sampler) and MIRROR (negatively correlated sampling for pattern noise), and reports 7-8% accuracy gains on complex reasoning tasks together with run-to-run variability reduced to near-ARM levels.

Significance. If the decomposition is exact and the reductions provably subtract the excess variance, the work would provide the first theoretical account of the MDM–ARM training gap and a practical route to stable MDM optimization. The reported accuracy lift and stability improvement, if reproducible with statistical controls, would be a meaningful step toward making MDMs competitive with ARMs on reasoning benchmarks.

major comments (2)

[§3] §3 (Variance Decomposition): The central claim that Var(∇) = Var_A + Var_B + Var_C exactly, with no residual Cov(A,B) terms, requires an explicit second-moment expansion of the stochastic gradient that accounts for the joint distribution over mask pattern and mask rate t. The manuscript must show whether the MDM forward process (shared noise schedule or joint masking probability) produces non-zero cross moments; if such terms exist, the independent reductions via P-POTS and MIRROR cannot be guaranteed to remove the full excess variance.
[§4.2 and §5] §4.2 and §5 (P-POTS and MIRROR): The Pareto-optimal t-sampler and negative-correlation construction are presented as directly subtracting Var_B and Var_A, respectively. The paper should include an ablation that isolates each component (e.g., variance measured before/after each method) and reports standard errors across at least 5–10 independent runs to confirm that the observed 7-8% accuracy gain is attributable to the claimed variance reductions rather than other training choices.

minor comments (2)

[Abstract and §5] The abstract and §5 report accuracy gains without error bars or statistical significance tests; add these to all tables and figures comparing MDM variants to ARM baselines.
[§3 and §4] Notation for the three variance sources is introduced in §3 but not consistently reused in the method descriptions; define symbols once and reuse them when stating how each technique targets a specific term.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and have revised the manuscript accordingly to strengthen the theoretical and empirical support for our claims.

read point-by-point responses

Referee: [§3] §3 (Variance Decomposition): The central claim that Var(∇) = Var_A + Var_B + Var_C exactly, with no residual Cov(A,B) terms, requires an explicit second-moment expansion of the stochastic gradient that accounts for the joint distribution over mask pattern and mask rate t. The manuscript must show whether the MDM forward process (shared noise schedule or joint masking probability) produces non-zero cross moments; if such terms exist, the independent reductions via P-POTS and MIRROR cannot be guaranteed to remove the full excess variance.

Authors: We agree that an explicit expansion is required to rigorously justify the additive decomposition. In the revised manuscript we have added a full second-moment calculation in §3 that expands E[||∇||²] under the joint distribution of mask pattern M and rate t. Because the mask pattern is sampled conditionally on t (with the data x fixed), the cross terms Cov(A,B) evaluate to zero by the law of total expectation; the gradient contribution from the pattern is orthogonal to the rate-dependent scaling. This confirms that the decomposition is exact and that the independent variance reductions from P-POTS and MIRROR together remove the full excess variance relative to ARMs. revision: yes
Referee: [§4.2 and §5] §4.2 and §5 (P-POTS and MIRROR): The Pareto-optimal t-sampler and negative-correlation construction are presented as directly subtracting Var_B and Var_A, respectively. The paper should include an ablation that isolates each component (e.g., variance measured before/after each method) and reports standard errors across at least 5–10 independent runs to confirm that the observed 7-8% accuracy gain is attributable to the claimed variance reductions rather than other training choices.

Authors: We appreciate the request for isolating evidence. The revised §5 now contains a component-wise ablation that reports gradient variance (measured on a held-out validation set) before and after P-POTS alone, MIRROR alone, and both together. All metrics are averaged over 10 independent random seeds with standard errors. The results show that each technique reduces its target variance component as predicted, and that the combined 7-8% accuracy lift on reasoning tasks remains statistically significant after accounting for run-to-run variability. revision: yes

Circularity Check

0 steps flagged

Variance decomposition presented as direct second-moment expansion with no reduction to fitted inputs or self-citations

full rationale

The paper states it derives Var(∇) = Var_A + Var_B + Var_C from the stochastic gradient taken over masking pattern, masking rate t, and data. No equations in the provided abstract or description reduce the claimed split to a definition, a fitted hyperparameter, or a load-bearing self-citation. The three sources are introduced as the result of expanding the expectation of squared deviation rather than being presupposed. Methods P-POTS and MIRROR are constructed after the decomposition and do not retroactively define it. The derivation is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Because only the abstract is available, the ledger is necessarily incomplete; the central claim rests on an unstated assumption that the three noise sources are the dominant and separable contributors to gradient variance, plus whatever background assumptions are required for the Pareto-optimal sampler and negative-correlation construction.

axioms (1)

domain assumption Training variance in MDMs can be decomposed into additive contributions from masking pattern, masking rate, and data without significant cross terms.
This separability is required for the variance-reduction methods to be derived and applied independently.

pith-pipeline@v0.9.0 · 5775 in / 1422 out tokens · 48950 ms · 2026-05-22T12:39:13.516304+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages

[1]

Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T

URLhttps://arxiv.org/abs/2309.12288. Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T. Freeman. Maskgit: Masked generative image transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11315–11325, 2022. Zhoujun Cheng, Haoyu Dong, Zhiruo Wang, Ran Jia, Jiaqi Guo, Yan Gao, Shi Han, Jian-Guang Lou...

work page doi:10.48550/arxiv.2411.09998 2022
[2]

is invariant to the specific choice of the noise scheduleα t, and

work page
[3]

Restricting attention tointerpolatingmasked diffusion models—i.e., those whose forward process qinterpolates between clean datax∈ Vand a target distributionCat(·;π), Sahoo et al

serves as an upper bound of −Epdata(x) pθ(x) , thus enabling a principled and simplified training framework for masked diffusion models. Restricting attention tointerpolatingmasked diffusion models—i.e., those whose forward process qinterpolates between clean datax∈ Vand a target distributionCat(·;π), Sahoo et al. (2024) derived a principled training obje...

work page 2024
[4]

Collect trajectories{(s t, at, rt)}under the old policyπ θold

work page
[5]

via GAE)

Compute advantage estimates ˆAt (e.g. via GAE)

work page
[6]

ForKepochs, perform minibatch gradient ascent onL KLPEN(θ): θ←θ+η∇ θLKLPEN(θ)

work page
[7]

Optionally adjustβto keep the KL close to a targetδ: β←    α+ β, ¯DDKL >1.5δ, α− β, ¯DDKL < δ/1.5, β,otherwise, where ¯DDKL =E t[DKL(πθold∥πθ)]

work page
[8]

Setθ old ←θand repeat. A.2.4 GROUPRELATIVEPOLICYOPTIMIZATION Group Relative Policy Optimization (GRPO) extends PPO to settings where we have a collection ofGgroups (or tasks), each potentially requiring its own policy behavior while sharing parameters. LetG={1, . . . , G}denote the set of groups, with weightings{ω g}g∈G such thatP g ωg = 1. For each group...

work page
[9]

For eachg∈ G, collect trajectories underπ θold(· | ·, g)

work page
[10]

Compute group-specific advantages{ ˆAg t }

work page
[11]

ForKepochs, perform minibatch gradient ascent onL GRPO(θ): θ←θ+η∇ θLGRPO(θ)

work page
[12]

Optionally adjustβ(andγ, if used) to keep eachD g DKL (and inter-group KLs) near desired targets

work page
[13]

A.3 RELATEDWORK In order to reduce training variance in diffusion models, the following strategies have been proposed: • Meng et al

Updateθ old ←θand repeat. A.3 RELATEDWORK In order to reduce training variance in diffusion models, the following strategies have been proposed: • Meng et al. (2022) proposes using denoising models trained at antithetic levels of Gaussian noise to recover arbitrary high-order derivatives of the data log-density. From our variance decompo- sition formula E...

work page 2022
[14]

By the law of total variance, Varx0,t,xt(Y) =E x0 [Vart,xt(Y|x 0)] + Varx0 [Et,xt(Y|x 0)]

First Decomposition overx 0 Viewx 0 as the outer random variable. By the law of total variance, Varx0,t,xt(Y) =E x0 [Vart,xt(Y|x 0)] + Varx0 [Et,xt(Y|x 0)]. Define gθ(x0, t) =E xt[ℓθ(x0, t, xt)|x 0, t], h θ(x0) =E t,xt[ℓθ(x0, t, xt)|x 0] =E t[gθ(x0, t)|x 0]. Then the second term becomes Varx0[hθ(x0)] = Varx0 (Et[gθ(x0, t)]) C data variance. 20

work page
[15]

Second decomposition insideE x0[Vart,xt(Y|x 0)] We apply the law of total variance again, this time conditioning ont: Vart,xt(Y|x 0) =E t [Varxt(Y|x 0, t)] + Vart [Ext(Y|x 0, t)]. Note thatE xt(Y|x 0, t) =g θ(x0, t), so: Ex0[Vart,xt(Y|x 0)] =E x0 [Et [Varxt(ℓθ |x 0, t)]] +E x0 [Vart(gθ(x0, t)|x 0)], =E x0,t [Varxt(ℓθ |x 0, t)]| {z } A mask pattern noise +...

work page
[16]

A.5.2 P-POTS: PARETO-OPTIMALPROPERTY Consider the reweighted estimator 1 p(t) lθ(x0, t, xt), t∼p(t)

Combine Everything Putting steps 1 and 2 together, we have: Varx0,t,xt(ℓθ) =E x0,t [Varxt(ℓθ |x 0, t)]| {z } A +E x0 [Vart(gθ(x0, t)|x 0)]| {z } B + Varx0 (Et[gθ(x0, t)])| {z } C , which is exactly the claimed decomposition Eq.(2). A.5.2 P-POTS: PARETO-OPTIMALPROPERTY Consider the reweighted estimator 1 p(t) lθ(x0, t, xt), t∼p(t). As shown in Eq.??, this ...

work page 2025
[17]

response tokensR(countP R),

work page
[18]

Under the assumptions (A1) All tokens inRshareµ R, σ2 R; tokens inCshareµ C, σ2 C

syntax prompt tokensC(countP C), soP=P R +P C. Under the assumptions (A1) All tokens inRshareµ R, σ2 R; tokens inCshareµ C, σ2 C. 24 (A2)σ 2 C ≪σ 2 R. (A3)ρ CC ≪ρ RR. (A4)|ρ RC | ≤ √ρRRρCC . (A5)2w Rβ≤ (1−α)(P R −1)ρ RR σ2 RB + (1−β), whereα:=σ 2 C/σ2 R,β:=ρ CC /ρRR, wR :=P R/P, andB:=E[1/t]>0, we have A SC < A SR . ProofDenoteB:=E 1 t >0. By Eq.(15), A S...

work page 2024
[19]

Reduce the overall estimation variance (which favors using a larger number of stratan)

work page
[20]

While settingn=Bresults in highly fine-grained strata, it may lead to unstable estimates within each stratum due to too few samples

Ensure the estimation of within-stratum varianceσ 2 k–denoted byˆσ2 k–is sufficiently stable (which favors having more samples per stratum, i.e.,m=int(B/n)). While settingn=Bresults in highly fine-grained strata, it may lead to unstable estimates within each stratum due to too few samples. To balance this trade-off, we propose the following optimal number...

work page 2025

[1] [1]

Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T

URLhttps://arxiv.org/abs/2309.12288. Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T. Freeman. Maskgit: Masked generative image transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11315–11325, 2022. Zhoujun Cheng, Haoyu Dong, Zhiruo Wang, Ran Jia, Jiaqi Guo, Yan Gao, Shi Han, Jian-Guang Lou...

work page doi:10.48550/arxiv.2411.09998 2022

[2] [2]

is invariant to the specific choice of the noise scheduleα t, and

work page

[3] [3]

Restricting attention tointerpolatingmasked diffusion models—i.e., those whose forward process qinterpolates between clean datax∈ Vand a target distributionCat(·;π), Sahoo et al

serves as an upper bound of −Epdata(x) pθ(x) , thus enabling a principled and simplified training framework for masked diffusion models. Restricting attention tointerpolatingmasked diffusion models—i.e., those whose forward process qinterpolates between clean datax∈ Vand a target distributionCat(·;π), Sahoo et al. (2024) derived a principled training obje...

work page 2024

[4] [4]

Collect trajectories{(s t, at, rt)}under the old policyπ θold

work page

[5] [5]

via GAE)

Compute advantage estimates ˆAt (e.g. via GAE)

work page

[6] [6]

ForKepochs, perform minibatch gradient ascent onL KLPEN(θ): θ←θ+η∇ θLKLPEN(θ)

work page

[7] [7]

Optionally adjustβto keep the KL close to a targetδ: β←    α+ β, ¯DDKL >1.5δ, α− β, ¯DDKL < δ/1.5, β,otherwise, where ¯DDKL =E t[DKL(πθold∥πθ)]

work page

[8] [8]

Setθ old ←θand repeat. A.2.4 GROUPRELATIVEPOLICYOPTIMIZATION Group Relative Policy Optimization (GRPO) extends PPO to settings where we have a collection ofGgroups (or tasks), each potentially requiring its own policy behavior while sharing parameters. LetG={1, . . . , G}denote the set of groups, with weightings{ω g}g∈G such thatP g ωg = 1. For each group...

work page

[9] [9]

For eachg∈ G, collect trajectories underπ θold(· | ·, g)

work page

[10] [10]

Compute group-specific advantages{ ˆAg t }

work page

[11] [11]

ForKepochs, perform minibatch gradient ascent onL GRPO(θ): θ←θ+η∇ θLGRPO(θ)

work page

[12] [12]

Optionally adjustβ(andγ, if used) to keep eachD g DKL (and inter-group KLs) near desired targets

work page

[13] [13]

A.3 RELATEDWORK In order to reduce training variance in diffusion models, the following strategies have been proposed: • Meng et al

Updateθ old ←θand repeat. A.3 RELATEDWORK In order to reduce training variance in diffusion models, the following strategies have been proposed: • Meng et al. (2022) proposes using denoising models trained at antithetic levels of Gaussian noise to recover arbitrary high-order derivatives of the data log-density. From our variance decompo- sition formula E...

work page 2022

[14] [14]

By the law of total variance, Varx0,t,xt(Y) =E x0 [Vart,xt(Y|x 0)] + Varx0 [Et,xt(Y|x 0)]

First Decomposition overx 0 Viewx 0 as the outer random variable. By the law of total variance, Varx0,t,xt(Y) =E x0 [Vart,xt(Y|x 0)] + Varx0 [Et,xt(Y|x 0)]. Define gθ(x0, t) =E xt[ℓθ(x0, t, xt)|x 0, t], h θ(x0) =E t,xt[ℓθ(x0, t, xt)|x 0] =E t[gθ(x0, t)|x 0]. Then the second term becomes Varx0[hθ(x0)] = Varx0 (Et[gθ(x0, t)]) C data variance. 20

work page

[15] [15]

Second decomposition insideE x0[Vart,xt(Y|x 0)] We apply the law of total variance again, this time conditioning ont: Vart,xt(Y|x 0) =E t [Varxt(Y|x 0, t)] + Vart [Ext(Y|x 0, t)]. Note thatE xt(Y|x 0, t) =g θ(x0, t), so: Ex0[Vart,xt(Y|x 0)] =E x0 [Et [Varxt(ℓθ |x 0, t)]] +E x0 [Vart(gθ(x0, t)|x 0)], =E x0,t [Varxt(ℓθ |x 0, t)]| {z } A mask pattern noise +...

work page

[16] [16]

A.5.2 P-POTS: PARETO-OPTIMALPROPERTY Consider the reweighted estimator 1 p(t) lθ(x0, t, xt), t∼p(t)

Combine Everything Putting steps 1 and 2 together, we have: Varx0,t,xt(ℓθ) =E x0,t [Varxt(ℓθ |x 0, t)]| {z } A +E x0 [Vart(gθ(x0, t)|x 0)]| {z } B + Varx0 (Et[gθ(x0, t)])| {z } C , which is exactly the claimed decomposition Eq.(2). A.5.2 P-POTS: PARETO-OPTIMALPROPERTY Consider the reweighted estimator 1 p(t) lθ(x0, t, xt), t∼p(t). As shown in Eq.??, this ...

work page 2025

[17] [17]

response tokensR(countP R),

work page

[18] [18]

Under the assumptions (A1) All tokens inRshareµ R, σ2 R; tokens inCshareµ C, σ2 C

syntax prompt tokensC(countP C), soP=P R +P C. Under the assumptions (A1) All tokens inRshareµ R, σ2 R; tokens inCshareµ C, σ2 C. 24 (A2)σ 2 C ≪σ 2 R. (A3)ρ CC ≪ρ RR. (A4)|ρ RC | ≤ √ρRRρCC . (A5)2w Rβ≤ (1−α)(P R −1)ρ RR σ2 RB + (1−β), whereα:=σ 2 C/σ2 R,β:=ρ CC /ρRR, wR :=P R/P, andB:=E[1/t]>0, we have A SC < A SR . ProofDenoteB:=E 1 t >0. By Eq.(15), A S...

work page 2024

[19] [19]

Reduce the overall estimation variance (which favors using a larger number of stratan)

work page

[20] [20]

While settingn=Bresults in highly fine-grained strata, it may lead to unstable estimates within each stratum due to too few samples

Ensure the estimation of within-stratum varianceσ 2 k–denoted byˆσ2 k–is sufficiently stable (which favors having more samples per stratum, i.e.,m=int(B/n)). While settingn=Bresults in highly fine-grained strata, it may lead to unstable estimates within each stratum due to too few samples. To balance this trade-off, we propose the following optimal number...

work page 2025