$f$-Trajectory Balance: A Loss Family for Tuning GFlowNets, Generative Models, and LLMs with Off- and On-Policy Data

Jake Fawkes; Jason Hartford

arxiv: 2605.15417 · v1 · pith:CZ346JSLnew · submitted 2026-05-14 · 💻 cs.LG · cs.AI

f-Trajectory Balance: A Loss Family for Tuning GFlowNets, Generative Models, and LLMs with Off- and On-Policy Data

Jake Fawkes , Jason Hartford This is my paper

Pith reviewed 2026-05-19 15:55 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords f-divergencesGFlowNetssurrogate lossesoff-policy trainingtrajectory balancegenerative modelsLLM fine-tuning

0 comments

The pith

A family of losses lets generative models match any f-divergence on-policy while keeping the same global minimizer off-policy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper starts from the known fact that mean-squared error between target and model log probabilities acts as a surrogate loss whose gradients match the KL divergence when samples come from the model itself, yet the loss stays valid and reaches the same minimum even when samples come from another distribution. It shows this construction generalizes to every f-divergence by replacing the squared error with a translation-invariant function on the pair of log probabilities. The resulting family therefore inherits the mode-seeking or mode-covering behavior of its parent divergence during on-policy updates, while remaining usable with off-policy data such as replay buffers or asynchronous LLM rollouts. Experiments on synthetic distributions, molecule generators, and language-model fine-tuning confirm that the new losses produce the expected divergence properties without sacrificing the off-policy guarantee.

Core claim

By restricting attention to translation-invariant loss functions on target and model log probabilities, the authors establish a one-to-one correspondence with f-divergences; each such loss, when evaluated on-policy, yields gradients identical to those of its corresponding f-divergence, yet the same loss remains a valid surrogate with unchanged global minimizer when evaluated off-policy.

What carries the argument

Translation-invariant surrogate losses on log-probability pairs that realize the on-policy gradient of any chosen f-divergence.

If this is right

Training GFlowNets or LLMs with the reverse-KL member of the family should produce more mode-seeking behavior than the original trajectory-balance loss.
The same losses can be applied directly to off-policy replay data without changing the location of the global minimum.
SynFlowNets trained with different members of the family should exhibit the mode-covering or mode-seeking traits predicted by the parent f-divergence.
Asynchronous LLM tuning can use the new losses on trajectories collected by older policy versions while still converging to the intended divergence optimum.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The construction may let practitioners select the divergence that best matches the desired exploration-exploitation trade-off without having to redesign the training pipeline for off-policy data.
The same translation-invariance trick could be tested on other classes of divergences, such as integral probability metrics, to see whether similar off-policy surrogates appear.
If the correspondence is bijective, every translation-invariant loss already in use for generative models can be reinterpreted as the on-policy gradient of some f-divergence.

Load-bearing premise

The surrogate losses must remain valid and share the same global minimizer when the data distribution differs from the model, which requires the loss on log probabilities to be invariant under additive translations.

What would settle it

For a simple two-mode target distribution, compute the on-policy gradient of the proposed loss for the reverse KL and check whether it exactly equals the analytic gradient of the reverse KL itself.

Figures

Figures reproduced from arXiv: 2605.15417 by Jake Fawkes, Jason Hartford.

**Figure 1.** Figure 1: f-divergences include the α-divergence family which contains both Forward KL (α= 0) and Reverse KL / standard Trajectory Balance (α= 1) as special cases. Lower α yields a more mode-covering loss, higher α a more mode-seeking one. inherit the specific properties of their parent divergences (e.g., the mode-covering behavior of the Forward KL or Hellinger distance) while retaining the optimization benefits of… view at source ↗

**Figure 4.** Figure 4: An on policy recreation of the synthetic grid experiment in the main text. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_4.png] view at source ↗

**Figure 5.** Figure 5: Swept reward distributions over different alpha beta values in Synflownet training 25 [PITH_FULL_IMAGE:figures/full_fig_p025_5.png] view at source ↗

**Figure 6.** Figure 6: Comparison of scaffold CDFs across different targets (columns) and β values (rows) during SynFlowNet training. In 5/6 of the plots the SynFlownet with annealed α has its reward CDF to the right of that trained via trajectory balance, indicating it is generating a higher proportion of high reward molecules. On the other hand α = 1.2 leads to clear mode collapse, choosing a few high value molecules. 26 [PIT… view at source ↗

**Figure 7.** Figure 7: Tanimoto diversity of unique samples across varying α and β values for GSK3, sEH, and DRD2 targets during SynFlowNet training.This demonstrates that lower α leads to more diverse molecules and that annealing can also lead to more diverse molecules across a range of settings. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_7.png] view at source ↗

**Figure 8.** Figure 8: Training curves for reward and entropy across all four models. We can see that the large asynchronous delay causes instability in PPO training whereas all f-trajectory balance losses lead to stable training . 28 [PITH_FULL_IMAGE:figures/full_fig_p028_8.png] view at source ↗

**Figure 9.** Figure 9: Comparison of f-divergence losses with more mode covering on the top and mode seeking on the bottom [PITH_FULL_IMAGE:figures/full_fig_p029_9.png] view at source ↗

**Figure 10.** Figure 10: Training curves for reward and entropy across all four models. We can see that the large asynchronous delay causes instability in PPO training whereas all f-trajectory balance losses lead to stable training . 30 [PITH_FULL_IMAGE:figures/full_fig_p030_10.png] view at source ↗

read the original abstract

In GFlowNets and variational inference, it has been shown that the mean square error between target and model log probabilities is an effective, low variance, surrogate loss for training generative models. This loss has the property that when evaluated \emph{on-policy} its gradients correspond to those of the KL divergence, while \emph{off-policy} it remains a valid loss with the same global minimizer. In this work, we demonstrate that this construction can be extended to the whole family of $f$-divergences, leading to a family of losses whose on-policy gradients are that of the corresponding $f$-divergence, but retain the same global minimizer off-policy. Specifically, we show that the on-policy gradients lead to a one to one correspondence between translation invariant loss functions on the target and model log probabilities, and $f$-divergences. This equivalence allows us to design new surrogate loss functions for tuning a wide class of generative models that inherit the properties of the corresponding $f$-divergence, such as being more mode covering, whilst being applicable to off-policy data. We apply our losses on a range of tasks, including classic synthetic examples, SynFlowNets for molecule discovery, and asynchronous large language model (LLM) tuning, demonstrating that our models retain their predicted properties on- and off-policy in a wide class of generative models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This extends the MSE-KL trick to a full family of f-divergence losses that work off-policy for GFlowNets and generative models.

read the letter

The main point is that this paper gives a general recipe for creating surrogate losses based on f-divergences that can be used off-policy for training GFlowNets, molecule generators, and even LLMs, while preserving the divergence's mode-seeking or covering properties. They build on the observation that mean squared error between log probabilities acts like a KL surrogate on-policy but stays valid off-policy. By characterizing translation-invariant losses on log probs, they establish a one-to-one link to f-divergences. On-policy, the gradients match the chosen f-divergence; off-policy, the minimizer is the same thanks to the invariance. This lets them create a family of f-trajectory balance losses. This extension is new and organizes things nicely instead of treating each divergence separately. The applications to SynFlowNets and async LLM tuning are practical and seem to confirm the theory, with models showing the expected behaviors. The math and definitions look direct without circularity. Experiments cover a range of tasks, which is good. One soft spot could be confirming that the construction works for the full family without hidden restrictions on the f functions, like convexity or growth rates that might limit which ones have nice losses. The abstract claims it does, so probably the paper has the details. For the LLM part, the async tuning might have variance issues not fully explored, but overall it's minor. This paper is aimed at people developing training methods for generative models with off-policy data. A reader interested in GFlowNets or variational methods would get value from the new losses and the correspondence result. It deserves a serious referee because the core idea is clean and the empirical checks are there. I recommend sending it for peer review.

Referee Report

1 major / 2 minor

Summary. The paper introduces f-Trajectory Balance, a family of surrogate losses for GFlowNets and generative models that generalizes the mean-squared error on log-probabilities. It establishes a one-to-one correspondence between translation-invariant loss functions on target and model log-probabilities and f-divergences, such that on-policy gradients recover the f-divergence while off-policy evaluation preserves the same global minimizer due to translation invariance. The resulting losses inherit mode-covering or mode-seeking behavior and are demonstrated on synthetic tasks, SynFlowNets for molecule discovery, and asynchronous LLM tuning.

Significance. If the claimed correspondence holds rigorously, the work supplies a principled and flexible toolkit for designing low-variance surrogate losses that can be trained off-policy yet reproduce the statistical properties of any f-divergence. This is particularly valuable for GFlowNet training, variational inference, and LLM alignment where off-policy data is abundant and mode-seeking versus mode-covering trade-offs matter. The empirical retention of predicted behaviors across domains supports practical utility.

major comments (1)

[§3.2, Theorem 1] §3.2, Theorem 1 and surrounding derivation: the central claim that the on-policy gradient map is surjective onto the entire family of f-divergences is load-bearing for the title and abstract statements. The provided construction demonstrates injectivity via translation invariance but establishes surjectivity only for f that are twice differentiable and strictly convex with suitable growth conditions at 0 and ∞; explicit constructions or counter-examples for boundary cases (e.g., total variation or non-strictly convex members) are needed to confirm coverage of the whole family.

minor comments (2)

[Notation and §3] The notation distinguishing the surrogate loss L from the underlying f-divergence could be made more explicit in the statement of the equivalence to avoid reader confusion between on-policy gradient equivalence and off-policy minimizer equivalence.
[§5.2] In the molecule discovery experiments, the reported metrics would benefit from an additional ablation that isolates the effect of the chosen f from the GFlowNet architecture itself.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading of the manuscript and for identifying this important point about the scope of Theorem 1. We address the comment below.

read point-by-point responses

Referee: [§3.2, Theorem 1] §3.2, Theorem 1 and surrounding derivation: the central claim that the on-policy gradient map is surjective onto the entire family of f-divergences is load-bearing for the title and abstract statements. The provided construction demonstrates injectivity via translation invariance but establishes surjectivity only for f that are twice differentiable and strictly convex with suitable growth conditions at 0 and ∞; explicit constructions or counter-examples for boundary cases (e.g., total variation or non-strictly convex members) are needed to confirm coverage of the whole family.

Authors: We agree that the surjectivity direction of the claimed one-to-one correspondence in Theorem 1 is established under the standard regularity conditions on f (twice continuous differentiability, strict convexity, and suitable growth at the boundaries). These are the conditions under which most f-divergences of practical interest in the literature are defined and for which the associated variational problems are well-posed. The injectivity part follows directly from translation invariance and holds more generally. For the commonly used members of the family (KL, reverse KL, Jensen-Shannon, Pearson χ², etc.) that satisfy the stated assumptions, the on-policy gradient map is indeed surjective onto the corresponding f-divergence, which is what underpins the mode-covering / mode-seeking behaviors shown in the experiments. We will revise §3.2 and the surrounding text to state the precise regularity assumptions on f explicitly. In addition, we will add an explicit construction for the total-variation case (treated as the limit of a sequence of smoothed, strictly convex f) together with a short discussion of non-strictly convex members, including a counter-example where surjectivity fails. These additions will be included in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

Derivation of f-trajectory balance losses is self-contained with no reductions to inputs or self-citations

full rationale

The paper constructs surrogate losses from standard f-divergences by leveraging the translation-invariance property of losses on log probabilities. The claimed one-to-one correspondence between such losses and f-divergences is presented as a direct mathematical result shown via on-policy gradients, without any fitted parameters being renamed as predictions or load-bearing self-citations. The off-policy validity follows immediately from the invariance definition itself. No equation or step in the provided derivation chain reduces the target result to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The central claim appears to rest on standard properties of f-divergences and log-probability losses whose details are not supplied here.

pith-pipeline@v0.9.0 · 5794 in / 1267 out tokens · 45082 ms · 2026-05-19T15:55:56.227045+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the on-policy gradients lead to a one to one correspondence between translation invariant loss functions on the target and model log probabilities, and f-divergences
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean costAlphaLog_fourth_deriv_at_zero unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Lf(Δθ(y)) = ∫_0^Δθ(y) (f'(exp(t)) - f'(1)) dt

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages

[1]

PMLR, 2020. D. Go, T. Korbak, G. Kruszewski, J. Rozen, N. Ryu, and M. Dymetman. Aligning language models with prefer- ences through f-divergence minimization.arXiv preprint arXiv:2302.08215, 2023. J. Han, M. Jiang, Y . Song, S. Ermon, and M. Xu. f-po: Generalizing preference optimization with f-divergence minimization.arXiv preprint arXiv:2410.21662, 2024...

work page doi:10.64434/tml 2020
[2]

If µ=p θ, the expected auto-differentiated gradients match the f-divergence gradient: ∇θDf(pθ∥p⋆) = Epθ[∇θLf(∆θ(y))]. A.1.1. PROOF OFPART1: CONVEXITY ANDGLOBALMINIMIZER Let the scalar loss function with respect to the log-probability difference be denoted by L(∆) = R ∆ 0 (f ′(exp(t))−f ′(1))dt. We determine the properties ofL(∆)by analyzing its derivative...

work page
[3]

Gradient of the Surrogate Loss:The gradient of the loss Lf with respect to the backward parameters ϕ, estimated on-policy, is: ∇ϕJon =E τ∼π F [∇ϕLf(logu)] =E τ∼π F [(f ′(u)−f ′(1))∇ϕ(−logπ B)] =E τ∼π F [−f ′(u)∇ϕ logπ B] 14 f-Trajectory Balance: A Loss Family for Tuning GFlowNets, Generative Models, and LLMs (The constant termf ′(1)vanishes becauseE πF [∇...

work page
[4]

Gradient of the Candidate Divergence:Consider the generic divergence Dg(πF ∥πB) = R πB(τ)g πF (τ) πB(τ) dτ. Differentiating with respect toϕ: ∇ϕDg = Z (∇ϕπB ·g(u) +π Bg′(u)∇ϕu)dτ Using the identity∇ ϕu=π F ∇ϕ(π−1 B ) =−u∇ ϕ logπ B and∇ ϕπB =π B∇ϕ logπ B: ∇ϕDg = Z πB (g(u)−ug ′(u))∇ ϕ logπ B dτ = Z πF 1 u (g(u)−ug ′(u))∇ ϕ logπ B dτ =E τ∼π F g(u) u −g ′(u)...

work page
[5]

•Gradient Weight: wDG i = ∆i −E B[∆(y)]

Reverse KL Divergence (Standard Vargrad) •Normalization:logZ=−E B[∆(y)]. •Gradient Weight: wDG i = ∆i −E B[∆(y)]

work page
[6]

•Gradient Weight: wDG i = 1− e−∆i EB[e−∆(y)]

Forward KL Divergence •Normalization:logZsatisfiesE B[e−(∆+logZ) ] = 1 =⇒e logZ =E B[e−∆]. •Gradient Weight: wDG i = 1− e−∆i EB[e−∆(y)]

work page
[7]

•Gradient Weight: wDG i = e∆i EB[e∆(y)] −1

Pearsonχ 2 Divergence •Normalization:logZsatisfiesE B[e∆+logZ ] = 1 =⇒e −logZ =E B[e∆]. •Gradient Weight: wDG i = e∆i EB[e∆(y)] −1

work page
[8]

•Gradient Weight: wDG i = 1 2 1− e−2∆i EB[e−2∆(y)]

Neymanχ 2 Divergence •Normalization:e 2 logZ =E B[e−2∆]. •Gradient Weight: wDG i = 1 2 1− e−2∆i EB[e−2∆(y)]

work page
[9]

22 f-Trajectory Balance: A Loss Family for Tuning GFlowNets, Generative Models, and LLMs •Gradient Weight: wDG i = 2 1− e−∆i/2 EB[e−∆(y)/2]

Squared Hellinger Distance •Normalization:e 1 2 logZ =E B[e−∆/2]. 22 f-Trajectory Balance: A Loss Family for Tuning GFlowNets, Generative Models, and LLMs •Gradient Weight: wDG i = 2 1− e−∆i/2 EB[e−∆(y)/2]

work page
[10]

•Gradient Weight: wDG i =sgn(∆ i −Median({∆(y)}))

Total Variation •Normalization:logZ=−Median({∆ j}). •Gradient Weight: wDG i =sgn(∆ i −Median({∆(y)}))

work page
[11]

•Gradient Weight: wDG i = 1 α−1 e(α−1)∆i EB[e(α−1)∆(y)] −1 E

Generalα-Divergence •Normalization:e (α−1) logZ = (EB[e(α−1)∆])−1. •Gradient Weight: wDG i = 1 α−1 e(α−1)∆i EB[e(α−1)∆(y)] −1 E. Minimal Implementation In this Section. we provide minimal formulations of our loss for both standard cases 1import torch, math 2 3def log_z_estimate(delta, name=’ReverseKL’, alpha=1): 4B = delta.size(0) 5logmeanexp = lambda k: ...

work page
[12]

Increment a coordinatedby 1:s→s+e d (allowed only ifs d < H−1)

work page
[13]

We also provide the following plots for on policy training with the same losses, showing a similar trend: (a)JSD vs

Terminate:s→s ⊤ (transition to the corresponding terminating state inX). We also provide the following plots for on policy training with the same losses, showing a similar trend: (a)JSD vs. Trajectories (b)Modes Found vs. Trajectories Figure 4.An on policy recreation of the synthetic grid experiment in the main text. 24 f-Trajectory Balance: A Loss Family...

work page 2025

[1] [1]

PMLR, 2020. D. Go, T. Korbak, G. Kruszewski, J. Rozen, N. Ryu, and M. Dymetman. Aligning language models with prefer- ences through f-divergence minimization.arXiv preprint arXiv:2302.08215, 2023. J. Han, M. Jiang, Y . Song, S. Ermon, and M. Xu. f-po: Generalizing preference optimization with f-divergence minimization.arXiv preprint arXiv:2410.21662, 2024...

work page doi:10.64434/tml 2020

[2] [2]

If µ=p θ, the expected auto-differentiated gradients match the f-divergence gradient: ∇θDf(pθ∥p⋆) = Epθ[∇θLf(∆θ(y))]. A.1.1. PROOF OFPART1: CONVEXITY ANDGLOBALMINIMIZER Let the scalar loss function with respect to the log-probability difference be denoted by L(∆) = R ∆ 0 (f ′(exp(t))−f ′(1))dt. We determine the properties ofL(∆)by analyzing its derivative...

work page

[3] [3]

Gradient of the Surrogate Loss:The gradient of the loss Lf with respect to the backward parameters ϕ, estimated on-policy, is: ∇ϕJon =E τ∼π F [∇ϕLf(logu)] =E τ∼π F [(f ′(u)−f ′(1))∇ϕ(−logπ B)] =E τ∼π F [−f ′(u)∇ϕ logπ B] 14 f-Trajectory Balance: A Loss Family for Tuning GFlowNets, Generative Models, and LLMs (The constant termf ′(1)vanishes becauseE πF [∇...

work page

[4] [4]

Gradient of the Candidate Divergence:Consider the generic divergence Dg(πF ∥πB) = R πB(τ)g πF (τ) πB(τ) dτ. Differentiating with respect toϕ: ∇ϕDg = Z (∇ϕπB ·g(u) +π Bg′(u)∇ϕu)dτ Using the identity∇ ϕu=π F ∇ϕ(π−1 B ) =−u∇ ϕ logπ B and∇ ϕπB =π B∇ϕ logπ B: ∇ϕDg = Z πB (g(u)−ug ′(u))∇ ϕ logπ B dτ = Z πF 1 u (g(u)−ug ′(u))∇ ϕ logπ B dτ =E τ∼π F g(u) u −g ′(u)...

work page

[5] [5]

•Gradient Weight: wDG i = ∆i −E B[∆(y)]

Reverse KL Divergence (Standard Vargrad) •Normalization:logZ=−E B[∆(y)]. •Gradient Weight: wDG i = ∆i −E B[∆(y)]

work page

[6] [6]

•Gradient Weight: wDG i = 1− e−∆i EB[e−∆(y)]

Forward KL Divergence •Normalization:logZsatisfiesE B[e−(∆+logZ) ] = 1 =⇒e logZ =E B[e−∆]. •Gradient Weight: wDG i = 1− e−∆i EB[e−∆(y)]

work page

[7] [7]

•Gradient Weight: wDG i = e∆i EB[e∆(y)] −1

Pearsonχ 2 Divergence •Normalization:logZsatisfiesE B[e∆+logZ ] = 1 =⇒e −logZ =E B[e∆]. •Gradient Weight: wDG i = e∆i EB[e∆(y)] −1

work page

[8] [8]

•Gradient Weight: wDG i = 1 2 1− e−2∆i EB[e−2∆(y)]

Neymanχ 2 Divergence •Normalization:e 2 logZ =E B[e−2∆]. •Gradient Weight: wDG i = 1 2 1− e−2∆i EB[e−2∆(y)]

work page

[9] [9]

22 f-Trajectory Balance: A Loss Family for Tuning GFlowNets, Generative Models, and LLMs •Gradient Weight: wDG i = 2 1− e−∆i/2 EB[e−∆(y)/2]

Squared Hellinger Distance •Normalization:e 1 2 logZ =E B[e−∆/2]. 22 f-Trajectory Balance: A Loss Family for Tuning GFlowNets, Generative Models, and LLMs •Gradient Weight: wDG i = 2 1− e−∆i/2 EB[e−∆(y)/2]

work page

[10] [10]

•Gradient Weight: wDG i =sgn(∆ i −Median({∆(y)}))

Total Variation •Normalization:logZ=−Median({∆ j}). •Gradient Weight: wDG i =sgn(∆ i −Median({∆(y)}))

work page

[11] [11]

•Gradient Weight: wDG i = 1 α−1 e(α−1)∆i EB[e(α−1)∆(y)] −1 E

Generalα-Divergence •Normalization:e (α−1) logZ = (EB[e(α−1)∆])−1. •Gradient Weight: wDG i = 1 α−1 e(α−1)∆i EB[e(α−1)∆(y)] −1 E. Minimal Implementation In this Section. we provide minimal formulations of our loss for both standard cases 1import torch, math 2 3def log_z_estimate(delta, name=’ReverseKL’, alpha=1): 4B = delta.size(0) 5logmeanexp = lambda k: ...

work page

[12] [12]

Increment a coordinatedby 1:s→s+e d (allowed only ifs d < H−1)

work page

[13] [13]

We also provide the following plots for on policy training with the same losses, showing a similar trend: (a)JSD vs

Terminate:s→s ⊤ (transition to the corresponding terminating state inX). We also provide the following plots for on policy training with the same losses, showing a similar trend: (a)JSD vs. Trajectories (b)Modes Found vs. Trajectories Figure 4.An on policy recreation of the synthetic grid experiment in the main text. 24 f-Trajectory Balance: A Loss Family...

work page 2025