MARBLE: Multi-Aspect Reward Balance for Diffusion RL

Canyu Zhao; Chunhua Shen; Hao Chen; Jiacheng Li; Yunze Tong; Yu Qiao

arxiv: 2605.06507 · v1 · submitted 2026-05-07 · 💻 cs.CV · cs.LG

MARBLE: Multi-Aspect Reward Balance for Diffusion RL

Canyu Zhao , Hao Chen , Yunze Tong , Yu Qiao , Jiacheng Li , Chunhua Shen This is my paper

Pith reviewed 2026-05-08 13:13 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords diffusion modelsreinforcement learningmulti-reward optimizationgradient balancingquadratic programmingRL fine-tuningreward alignment

0 comments

The pith

MARBLE solves a quadratic program on separate per-reward gradients to jointly improve every reward dimension during diffusion model fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that simply adding multiple reward signals together wastes training signal because most generated samples are strong on only one reward and irrelevant to the others. MARBLE keeps a separate advantage estimator for each reward, derives its own policy gradient, and solves a quadratic program to produce one combined update direction that stays helpful to every reward at once. An amortized version exploits the structure of the diffusion loss to keep the extra cost low, with EMA smoothing to handle batch-to-batch variation. If this works, multi-reward alignment no longer requires specialist models, hand-tuned weights, or staged schedules. The result is a single model that advances all criteria simultaneously at nearly baseline speed.

Core claim

By maintaining independent advantage estimators for each reward, computing separate policy gradients, and harmonizing them through a quadratic program that finds a beneficial combined direction, MARBLE produces updates that improve all reward dimensions at the same time instead of the dilution and negative alignments that occur under weighted summation.

What carries the argument

The quadratic program that computes non-negative coefficients for each per-reward gradient so their linear combination remains aligned with every individual gradient while exploiting the affine loss structure for efficiency.

If this is right

All five reward dimensions improve simultaneously on SD3.5 Medium.
The worst-aligned reward's gradient cosine becomes positive in every mini-batch instead of negative in 80 percent of them.
Training runs at 0.97 times the speed of the weighted-sum baseline.
The amortized formulation keeps the per-step cost near that of single-reward training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same gradient-balancing step could apply to other multi-objective reinforcement learning settings that currently rely on weighted sums.
If the quadratic program occasionally assigns near-zero weight to one reward, training might still need occasional monitoring to prevent silent neglect of certain criteria.
EMA smoothing of the balancing coefficients is likely essential for preventing the quadratic solutions from oscillating across batches.

Load-bearing premise

Solving the quadratic program on per-reward gradients will always produce a stable combined update that improves every reward without creating new biases or instability absent from the weighted-sum baseline.

What would settle it

Check whether, in the same mini-batches where the weighted-sum baseline decreases at least one reward on held-out rollouts, the MARBLE update increases all rewards.

read the original abstract

Reinforcement learning fine-tuning has become the dominant approach for aligning diffusion models with human preferences. However, assessing images is intrinsically a multi-dimensional task, and multiple evaluation criteria need to be optimized simultaneously. Existing practice deal with multiple rewards by training one specialist model per reward, optimizing a weighted-sum reward $R(x)=\sum_k w_k R_k(x)$, or sequentially fine-tuning with a hand-crafted stage schedule. These approaches either fail to produce a unified model that can be jointly trained on all rewards or necessitates heavy manually tuned sequential training. We find that the failure stems from using a naive weighted-sum reward aggregation. This approach suffers from a sample-level mismatch because most rollouts are specialist samples, highly informative for certain reward dimensions but irrelevant for others; consequently, weighted summation dilutes their supervision. To address this issue, we propose MARBLE (Multi-Aspect Reward BaLancE), a gradient-space optimization framework that maintains independent advantage estimators for each reward, computes per-reward policy gradients, and harmonizes them into a single update direction without manually-tuned reward weighting, by solving a Quadratic Programming problem. We further propose an amortized formulation that exploits the affine structure of the loss used in DiffusionNFT, to reduce the per-step cost from K+1 backward passes to near single-reward baseline cost, together with EMA smoothing on the balancing coefficients to stabilize updates against transient single-batch fluctuations. On SD3.5 Medium with five rewards, MARBLE improves all five reward dimensions simultaneously, turns the worst-aligned reward's gradient cosine from negative under weighted summation in 80% of mini-batches to consistently positive, and runs at 0.97X the training speed of baseline training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MARBLE replaces weighted-sum reward aggregation with per-reward gradients plus a QP step to keep every direction positive, plus an amortization trick that keeps cost near baseline.

read the letter

The core move is to stop summing rewards at the scalar level and instead maintain separate advantage estimators, compute each policy gradient on its own, then solve a quadratic program that finds coefficients making the combined update have positive cosine with every individual gradient. They add an amortized backward pass that exploits the affine loss structure from DiffusionNFT and EMA smoothing on the coefficients to avoid batch-to-batch jitter. On SD3.5 Medium with five rewards the abstract reports that all five scores rise together, the worst cosine flips from negative in 80 % of batches to consistently positive, and wall-clock cost stays at 0.97× the single-reward baseline. Those are the concrete claims that matter for anyone trying to align a diffusion model to several human preferences at once. The approach is distinct from the specialist-per-reward or hand-scheduled sequential baselines they cite, and the efficiency engineering is a practical plus. The open question is what the QP actually returns when the per-reward gradients point in sharply different directions. If the only feasible solution is near-zero coefficients, the update collapses and training stalls; the abstract gives no distribution of gradient norms, no count of how often the solver hits the boundary, and no ablation on the QP constraints themselves. Without those diagnostics it is hard to know whether the reported gains come from reliable harmonization or from the particular five rewards happening to be only mildly opposed. The math and citation pattern look standard and the empirical headline is clear enough to justify review, but the paper needs to show the failure modes of the QP step before the method can be trusted for general use. This is for groups already running multi-objective RL on diffusion models who want a drop-in replacement for weighted sums. It deserves a serious referee.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes MARBLE, a gradient-space framework for multi-reward RL fine-tuning of diffusion models. It maintains separate advantage estimators per reward, computes individual policy gradients, and solves a quadratic program to produce a single combined update direction. An amortized formulation exploiting the affine structure of the DiffusionNFT loss reduces cost from K+1 to near-baseline backward passes, with EMA smoothing on coefficients. On SD3.5 Medium with five rewards, the method is claimed to improve all five reward dimensions simultaneously, convert the worst-aligned per-reward gradient cosine from negative (under weighted summation) in 80% of mini-batches to consistently positive, and run at 0.97X baseline training speed.

Significance. If the QP-based balancing reliably produces non-degenerate updates and the reported gains hold with full ablations, the work would be significant for multi-aspect alignment of diffusion models. It directly targets the sample-level mismatch in weighted-sum aggregation, a practical bottleneck when rewards are specialized. Credit is given for the amortized formulation that preserves efficiency and for the explicit gradient-cosine diagnostic, which provides a falsifiable check on alignment quality.

major comments (2)

[Abstract] Abstract: the central claim that the QP 'turns the worst-aligned reward's gradient cosine from negative ... to consistently positive' and improves all five dimensions simultaneously is load-bearing, yet no statistics are supplied on the distribution of resulting gradient norms ||g||, the fraction of batches where the QP returns a near-zero solution, or the achieved min-cosine values. This leaves open whether success occurs because the five chosen rewards happen to be only mildly conflicting or because the QP formulation reliably finds a feasible non-degenerate direction.
[Abstract] Abstract (QP description): the quadratic program is described only at the level of 'harmonizes them into a single update direction without manually-tuned reward weighting.' The exact objective, constraints (e.g., non-negativity of coefficients, normalization, or explicit positivity of g · ∇_k), and feasibility guarantees are not stated, preventing assessment of behavior under strong gradient conflict as raised by the stress-test note.

minor comments (2)

[Abstract] Abstract: grammatical issue in 'Existing practice deal with multiple rewards' (should be 'practices deal' or 'practice deals').
[Abstract] Abstract: acronym expansion 'Multi-Aspect Reward BaLancE' uses inconsistent internal capitalization; standard form is 'Multi-Aspect Reward Balance'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments on the abstract claims and QP formulation. We address each point below and will revise the manuscript accordingly to provide additional statistics, clarify the method, and strengthen the presentation of results.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the QP 'turns the worst-aligned reward's gradient cosine from negative ... to consistently positive' and improves all five dimensions simultaneously is load-bearing, yet no statistics are supplied on the distribution of resulting gradient norms ||g||, the fraction of batches where the QP returns a near-zero solution, or the achieved min-cosine values. This leaves open whether success occurs because the five chosen rewards happen to be only mildly conflicting or because the QP formulation reliably finds a feasible non-degenerate direction.

Authors: We agree that additional quantitative diagnostics would make the central claim more robust. The manuscript already reports that MARBLE converts the worst-aligned per-reward cosine from negative in 80% of mini-batches (under weighted summation) to consistently positive while improving all five rewards on SD3.5 Medium. In the revision we will add: (i) statistics and/or histograms of the combined gradient norm ||g||, (ii) the fraction of batches in which the QP returns a near-zero solution (defined by ||g|| below a small threshold), and (iii) the distribution of achieved minimum cosine values. These additions will directly address whether the QP reliably produces non-degenerate updates. The 80% negative-cosine rate under the baseline already indicates non-trivial conflict among the five rewards; we will also reference the stress-test experiments to show behavior under stronger conflicts. revision: yes
Referee: [Abstract] Abstract (QP description): the quadratic program is described only at the level of 'harmonizes them into a single update direction without manually-tuned reward weighting.' The exact objective, constraints (e.g., non-negativity of coefficients, normalization, or explicit positivity of g · ∇_k), and feasibility guarantees are not stated, preventing assessment of behavior under strong gradient conflict as raised by the stress-test note.

Authors: The full QP formulation—including the objective (balancing per-reward gradients to maximize the minimum alignment), constraints (non-negative coefficients that sum to one, with explicit encouragement of positive dot products g · ∇_k), and the amortized solution—is given in Section 3.2 of the manuscript together with the EMA smoothing mechanism. We acknowledge that the abstract is too high-level. In the revision we will expand the abstract to concisely state the QP objective and key constraints, and we will add a short paragraph discussing feasibility and behavior under strong gradient conflict, directly referencing the stress-test results already present in the paper. revision: yes

Circularity Check

0 steps flagged

No significant circularity in MARBLE's derivation

full rationale

The paper introduces MARBLE as a gradient-space optimization framework that maintains independent advantage estimators for each reward, computes per-reward policy gradients, and combines them via a standard quadratic program without manually tuned weights. No step reduces by construction to a fitted quantity defined by the target result, a self-citation chain, or an ansatz smuggled from prior work by the same authors. The amortized formulation exploits the known affine structure of the DiffusionNFT loss as an external property, and empirical claims rest on observed improvements over the weighted-sum baseline rather than tautological predictions or renamings. The derivation is self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the DiffusionNFT loss admits an affine structure usable for amortization and that quadratic programming on independent gradients yields a non-dominated update direction.

axioms (1)

domain assumption The loss function used in DiffusionNFT has an affine structure that permits amortization of the multi-reward gradient computation.
Invoked to reduce per-step cost from K+1 backward passes to near single-reward baseline cost.

pith-pipeline@v0.9.0 · 5616 in / 1255 out tokens · 53110 ms · 2026-05-08T13:13:15.370620+00:00 · methodology

MARBLE: Multi-Aspect Reward Balance for Diffusion RL

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)