f-Trajectory Balance: A Loss Family for Tuning GFlowNets, Generative Models, and LLMs with Off- and On-Policy Data
Pith reviewed 2026-05-19 15:55 UTC · model grok-4.3
The pith
A family of losses lets generative models match any f-divergence on-policy while keeping the same global minimizer off-policy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By restricting attention to translation-invariant loss functions on target and model log probabilities, the authors establish a one-to-one correspondence with f-divergences; each such loss, when evaluated on-policy, yields gradients identical to those of its corresponding f-divergence, yet the same loss remains a valid surrogate with unchanged global minimizer when evaluated off-policy.
What carries the argument
Translation-invariant surrogate losses on log-probability pairs that realize the on-policy gradient of any chosen f-divergence.
If this is right
- Training GFlowNets or LLMs with the reverse-KL member of the family should produce more mode-seeking behavior than the original trajectory-balance loss.
- The same losses can be applied directly to off-policy replay data without changing the location of the global minimum.
- SynFlowNets trained with different members of the family should exhibit the mode-covering or mode-seeking traits predicted by the parent f-divergence.
- Asynchronous LLM tuning can use the new losses on trajectories collected by older policy versions while still converging to the intended divergence optimum.
Where Pith is reading between the lines
- The construction may let practitioners select the divergence that best matches the desired exploration-exploitation trade-off without having to redesign the training pipeline for off-policy data.
- The same translation-invariance trick could be tested on other classes of divergences, such as integral probability metrics, to see whether similar off-policy surrogates appear.
- If the correspondence is bijective, every translation-invariant loss already in use for generative models can be reinterpreted as the on-policy gradient of some f-divergence.
Load-bearing premise
The surrogate losses must remain valid and share the same global minimizer when the data distribution differs from the model, which requires the loss on log probabilities to be invariant under additive translations.
What would settle it
For a simple two-mode target distribution, compute the on-policy gradient of the proposed loss for the reverse KL and check whether it exactly equals the analytic gradient of the reverse KL itself.
Figures
read the original abstract
In GFlowNets and variational inference, it has been shown that the mean square error between target and model log probabilities is an effective, low variance, surrogate loss for training generative models. This loss has the property that when evaluated \emph{on-policy} its gradients correspond to those of the KL divergence, while \emph{off-policy} it remains a valid loss with the same global minimizer. In this work, we demonstrate that this construction can be extended to the whole family of $f$-divergences, leading to a family of losses whose on-policy gradients are that of the corresponding $f$-divergence, but retain the same global minimizer off-policy. Specifically, we show that the on-policy gradients lead to a one to one correspondence between translation invariant loss functions on the target and model log probabilities, and $f$-divergences. This equivalence allows us to design new surrogate loss functions for tuning a wide class of generative models that inherit the properties of the corresponding $f$-divergence, such as being more mode covering, whilst being applicable to off-policy data. We apply our losses on a range of tasks, including classic synthetic examples, SynFlowNets for molecule discovery, and asynchronous large language model (LLM) tuning, demonstrating that our models retain their predicted properties on- and off-policy in a wide class of generative models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces f-Trajectory Balance, a family of surrogate losses for GFlowNets and generative models that generalizes the mean-squared error on log-probabilities. It establishes a one-to-one correspondence between translation-invariant loss functions on target and model log-probabilities and f-divergences, such that on-policy gradients recover the f-divergence while off-policy evaluation preserves the same global minimizer due to translation invariance. The resulting losses inherit mode-covering or mode-seeking behavior and are demonstrated on synthetic tasks, SynFlowNets for molecule discovery, and asynchronous LLM tuning.
Significance. If the claimed correspondence holds rigorously, the work supplies a principled and flexible toolkit for designing low-variance surrogate losses that can be trained off-policy yet reproduce the statistical properties of any f-divergence. This is particularly valuable for GFlowNet training, variational inference, and LLM alignment where off-policy data is abundant and mode-seeking versus mode-covering trade-offs matter. The empirical retention of predicted behaviors across domains supports practical utility.
major comments (1)
- [§3.2, Theorem 1] §3.2, Theorem 1 and surrounding derivation: the central claim that the on-policy gradient map is surjective onto the entire family of f-divergences is load-bearing for the title and abstract statements. The provided construction demonstrates injectivity via translation invariance but establishes surjectivity only for f that are twice differentiable and strictly convex with suitable growth conditions at 0 and ∞; explicit constructions or counter-examples for boundary cases (e.g., total variation or non-strictly convex members) are needed to confirm coverage of the whole family.
minor comments (2)
- [Notation and §3] The notation distinguishing the surrogate loss L from the underlying f-divergence could be made more explicit in the statement of the equivalence to avoid reader confusion between on-policy gradient equivalence and off-policy minimizer equivalence.
- [§5.2] In the molecule discovery experiments, the reported metrics would benefit from an additional ablation that isolates the effect of the chosen f from the GFlowNet architecture itself.
Simulated Author's Rebuttal
We thank the referee for their careful reading of the manuscript and for identifying this important point about the scope of Theorem 1. We address the comment below.
read point-by-point responses
-
Referee: [§3.2, Theorem 1] §3.2, Theorem 1 and surrounding derivation: the central claim that the on-policy gradient map is surjective onto the entire family of f-divergences is load-bearing for the title and abstract statements. The provided construction demonstrates injectivity via translation invariance but establishes surjectivity only for f that are twice differentiable and strictly convex with suitable growth conditions at 0 and ∞; explicit constructions or counter-examples for boundary cases (e.g., total variation or non-strictly convex members) are needed to confirm coverage of the whole family.
Authors: We agree that the surjectivity direction of the claimed one-to-one correspondence in Theorem 1 is established under the standard regularity conditions on f (twice continuous differentiability, strict convexity, and suitable growth at the boundaries). These are the conditions under which most f-divergences of practical interest in the literature are defined and for which the associated variational problems are well-posed. The injectivity part follows directly from translation invariance and holds more generally. For the commonly used members of the family (KL, reverse KL, Jensen-Shannon, Pearson χ², etc.) that satisfy the stated assumptions, the on-policy gradient map is indeed surjective onto the corresponding f-divergence, which is what underpins the mode-covering / mode-seeking behaviors shown in the experiments. We will revise §3.2 and the surrounding text to state the precise regularity assumptions on f explicitly. In addition, we will add an explicit construction for the total-variation case (treated as the limit of a sequence of smoothed, strictly convex f) together with a short discussion of non-strictly convex members, including a counter-example where surjectivity fails. These additions will be included in the revised manuscript. revision: yes
Circularity Check
Derivation of f-trajectory balance losses is self-contained with no reductions to inputs or self-citations
full rationale
The paper constructs surrogate losses from standard f-divergences by leveraging the translation-invariance property of losses on log probabilities. The claimed one-to-one correspondence between such losses and f-divergences is presented as a direct mathematical result shown via on-policy gradients, without any fitted parameters being renamed as predictions or load-bearing self-citations. The off-policy validity follows immediately from the invariance definition itself. No equation or step in the provided derivation chain reduces the target result to its own inputs by construction.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
the on-policy gradients lead to a one to one correspondence between translation invariant loss functions on the target and model log probabilities, and f-divergences
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leancostAlphaLog_fourth_deriv_at_zero unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Lf(Δθ(y)) = ∫_0^Δθ(y) (f'(exp(t)) - f'(1)) dt
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
PMLR, 2020. D. Go, T. Korbak, G. Kruszewski, J. Rozen, N. Ryu, and M. Dymetman. Aligning language models with prefer- ences through f-divergence minimization.arXiv preprint arXiv:2302.08215, 2023. J. Han, M. Jiang, Y . Song, S. Ermon, and M. Xu. f-po: Generalizing preference optimization with f-divergence minimization.arXiv preprint arXiv:2410.21662, 2024...
-
[2]
If µ=p θ, the expected auto-differentiated gradients match the f-divergence gradient: ∇θDf(pθ∥p⋆) = Epθ[∇θLf(∆θ(y))]. A.1.1. PROOF OFPART1: CONVEXITY ANDGLOBALMINIMIZER Let the scalar loss function with respect to the log-probability difference be denoted by L(∆) = R ∆ 0 (f ′(exp(t))−f ′(1))dt. We determine the properties ofL(∆)by analyzing its derivative...
-
[3]
Gradient of the Surrogate Loss:The gradient of the loss Lf with respect to the backward parameters ϕ, estimated on-policy, is: ∇ϕJon =E τ∼π F [∇ϕLf(logu)] =E τ∼π F [(f ′(u)−f ′(1))∇ϕ(−logπ B)] =E τ∼π F [−f ′(u)∇ϕ logπ B] 14 f-Trajectory Balance: A Loss Family for Tuning GFlowNets, Generative Models, and LLMs (The constant termf ′(1)vanishes becauseE πF [∇...
-
[4]
Gradient of the Candidate Divergence:Consider the generic divergence Dg(πF ∥πB) = R πB(τ)g πF (τ) πB(τ) dτ. Differentiating with respect toϕ: ∇ϕDg = Z (∇ϕπB ·g(u) +π Bg′(u)∇ϕu)dτ Using the identity∇ ϕu=π F ∇ϕ(π−1 B ) =−u∇ ϕ logπ B and∇ ϕπB =π B∇ϕ logπ B: ∇ϕDg = Z πB (g(u)−ug ′(u))∇ ϕ logπ B dτ = Z πF 1 u (g(u)−ug ′(u))∇ ϕ logπ B dτ =E τ∼π F g(u) u −g ′(u)...
-
[5]
•Gradient Weight: wDG i = ∆i −E B[∆(y)]
Reverse KL Divergence (Standard Vargrad) •Normalization:logZ=−E B[∆(y)]. •Gradient Weight: wDG i = ∆i −E B[∆(y)]
-
[6]
•Gradient Weight: wDG i = 1− e−∆i EB[e−∆(y)]
Forward KL Divergence •Normalization:logZsatisfiesE B[e−(∆+logZ) ] = 1 =⇒e logZ =E B[e−∆]. •Gradient Weight: wDG i = 1− e−∆i EB[e−∆(y)]
-
[7]
•Gradient Weight: wDG i = e∆i EB[e∆(y)] −1
Pearsonχ 2 Divergence •Normalization:logZsatisfiesE B[e∆+logZ ] = 1 =⇒e −logZ =E B[e∆]. •Gradient Weight: wDG i = e∆i EB[e∆(y)] −1
-
[8]
•Gradient Weight: wDG i = 1 2 1− e−2∆i EB[e−2∆(y)]
Neymanχ 2 Divergence •Normalization:e 2 logZ =E B[e−2∆]. •Gradient Weight: wDG i = 1 2 1− e−2∆i EB[e−2∆(y)]
-
[9]
Squared Hellinger Distance •Normalization:e 1 2 logZ =E B[e−∆/2]. 22 f-Trajectory Balance: A Loss Family for Tuning GFlowNets, Generative Models, and LLMs •Gradient Weight: wDG i = 2 1− e−∆i/2 EB[e−∆(y)/2]
-
[10]
•Gradient Weight: wDG i =sgn(∆ i −Median({∆(y)}))
Total Variation •Normalization:logZ=−Median({∆ j}). •Gradient Weight: wDG i =sgn(∆ i −Median({∆(y)}))
-
[11]
•Gradient Weight: wDG i = 1 α−1 e(α−1)∆i EB[e(α−1)∆(y)] −1 E
Generalα-Divergence •Normalization:e (α−1) logZ = (EB[e(α−1)∆])−1. •Gradient Weight: wDG i = 1 α−1 e(α−1)∆i EB[e(α−1)∆(y)] −1 E. Minimal Implementation In this Section. we provide minimal formulations of our loss for both standard cases 1import torch, math 2 3def log_z_estimate(delta, name=’ReverseKL’, alpha=1): 4B = delta.size(0) 5logmeanexp = lambda k: ...
-
[12]
Increment a coordinatedby 1:s→s+e d (allowed only ifs d < H−1)
-
[13]
Terminate:s→s ⊤ (transition to the corresponding terminating state inX). We also provide the following plots for on policy training with the same losses, showing a similar trend: (a)JSD vs. Trajectories (b)Modes Found vs. Trajectories Figure 4.An on policy recreation of the synthetic grid experiment in the main text. 24 f-Trajectory Balance: A Loss Family...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.