MARBLE: Multi-Aspect Reward Balance for Diffusion RL
Pith reviewed 2026-05-08 13:13 UTC · model grok-4.3
The pith
MARBLE solves a quadratic program on separate per-reward gradients to jointly improve every reward dimension during diffusion model fine-tuning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By maintaining independent advantage estimators for each reward, computing separate policy gradients, and harmonizing them through a quadratic program that finds a beneficial combined direction, MARBLE produces updates that improve all reward dimensions at the same time instead of the dilution and negative alignments that occur under weighted summation.
What carries the argument
The quadratic program that computes non-negative coefficients for each per-reward gradient so their linear combination remains aligned with every individual gradient while exploiting the affine loss structure for efficiency.
If this is right
- All five reward dimensions improve simultaneously on SD3.5 Medium.
- The worst-aligned reward's gradient cosine becomes positive in every mini-batch instead of negative in 80 percent of them.
- Training runs at 0.97 times the speed of the weighted-sum baseline.
- The amortized formulation keeps the per-step cost near that of single-reward training.
Where Pith is reading between the lines
- The same gradient-balancing step could apply to other multi-objective reinforcement learning settings that currently rely on weighted sums.
- If the quadratic program occasionally assigns near-zero weight to one reward, training might still need occasional monitoring to prevent silent neglect of certain criteria.
- EMA smoothing of the balancing coefficients is likely essential for preventing the quadratic solutions from oscillating across batches.
Load-bearing premise
Solving the quadratic program on per-reward gradients will always produce a stable combined update that improves every reward without creating new biases or instability absent from the weighted-sum baseline.
What would settle it
Check whether, in the same mini-batches where the weighted-sum baseline decreases at least one reward on held-out rollouts, the MARBLE update increases all rewards.
read the original abstract
Reinforcement learning fine-tuning has become the dominant approach for aligning diffusion models with human preferences. However, assessing images is intrinsically a multi-dimensional task, and multiple evaluation criteria need to be optimized simultaneously. Existing practice deal with multiple rewards by training one specialist model per reward, optimizing a weighted-sum reward $R(x)=\sum_k w_k R_k(x)$, or sequentially fine-tuning with a hand-crafted stage schedule. These approaches either fail to produce a unified model that can be jointly trained on all rewards or necessitates heavy manually tuned sequential training. We find that the failure stems from using a naive weighted-sum reward aggregation. This approach suffers from a sample-level mismatch because most rollouts are specialist samples, highly informative for certain reward dimensions but irrelevant for others; consequently, weighted summation dilutes their supervision. To address this issue, we propose MARBLE (Multi-Aspect Reward BaLancE), a gradient-space optimization framework that maintains independent advantage estimators for each reward, computes per-reward policy gradients, and harmonizes them into a single update direction without manually-tuned reward weighting, by solving a Quadratic Programming problem. We further propose an amortized formulation that exploits the affine structure of the loss used in DiffusionNFT, to reduce the per-step cost from K+1 backward passes to near single-reward baseline cost, together with EMA smoothing on the balancing coefficients to stabilize updates against transient single-batch fluctuations. On SD3.5 Medium with five rewards, MARBLE improves all five reward dimensions simultaneously, turns the worst-aligned reward's gradient cosine from negative under weighted summation in 80% of mini-batches to consistently positive, and runs at 0.97X the training speed of baseline training.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes MARBLE, a gradient-space framework for multi-reward RL fine-tuning of diffusion models. It maintains separate advantage estimators per reward, computes individual policy gradients, and solves a quadratic program to produce a single combined update direction. An amortized formulation exploiting the affine structure of the DiffusionNFT loss reduces cost from K+1 to near-baseline backward passes, with EMA smoothing on coefficients. On SD3.5 Medium with five rewards, the method is claimed to improve all five reward dimensions simultaneously, convert the worst-aligned per-reward gradient cosine from negative (under weighted summation) in 80% of mini-batches to consistently positive, and run at 0.97X baseline training speed.
Significance. If the QP-based balancing reliably produces non-degenerate updates and the reported gains hold with full ablations, the work would be significant for multi-aspect alignment of diffusion models. It directly targets the sample-level mismatch in weighted-sum aggregation, a practical bottleneck when rewards are specialized. Credit is given for the amortized formulation that preserves efficiency and for the explicit gradient-cosine diagnostic, which provides a falsifiable check on alignment quality.
major comments (2)
- [Abstract] Abstract: the central claim that the QP 'turns the worst-aligned reward's gradient cosine from negative ... to consistently positive' and improves all five dimensions simultaneously is load-bearing, yet no statistics are supplied on the distribution of resulting gradient norms ||g||, the fraction of batches where the QP returns a near-zero solution, or the achieved min-cosine values. This leaves open whether success occurs because the five chosen rewards happen to be only mildly conflicting or because the QP formulation reliably finds a feasible non-degenerate direction.
- [Abstract] Abstract (QP description): the quadratic program is described only at the level of 'harmonizes them into a single update direction without manually-tuned reward weighting.' The exact objective, constraints (e.g., non-negativity of coefficients, normalization, or explicit positivity of g · ∇_k), and feasibility guarantees are not stated, preventing assessment of behavior under strong gradient conflict as raised by the stress-test note.
minor comments (2)
- [Abstract] Abstract: grammatical issue in 'Existing practice deal with multiple rewards' (should be 'practices deal' or 'practice deals').
- [Abstract] Abstract: acronym expansion 'Multi-Aspect Reward BaLancE' uses inconsistent internal capitalization; standard form is 'Multi-Aspect Reward Balance'.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive comments on the abstract claims and QP formulation. We address each point below and will revise the manuscript accordingly to provide additional statistics, clarify the method, and strengthen the presentation of results.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that the QP 'turns the worst-aligned reward's gradient cosine from negative ... to consistently positive' and improves all five dimensions simultaneously is load-bearing, yet no statistics are supplied on the distribution of resulting gradient norms ||g||, the fraction of batches where the QP returns a near-zero solution, or the achieved min-cosine values. This leaves open whether success occurs because the five chosen rewards happen to be only mildly conflicting or because the QP formulation reliably finds a feasible non-degenerate direction.
Authors: We agree that additional quantitative diagnostics would make the central claim more robust. The manuscript already reports that MARBLE converts the worst-aligned per-reward cosine from negative in 80% of mini-batches (under weighted summation) to consistently positive while improving all five rewards on SD3.5 Medium. In the revision we will add: (i) statistics and/or histograms of the combined gradient norm ||g||, (ii) the fraction of batches in which the QP returns a near-zero solution (defined by ||g|| below a small threshold), and (iii) the distribution of achieved minimum cosine values. These additions will directly address whether the QP reliably produces non-degenerate updates. The 80% negative-cosine rate under the baseline already indicates non-trivial conflict among the five rewards; we will also reference the stress-test experiments to show behavior under stronger conflicts. revision: yes
-
Referee: [Abstract] Abstract (QP description): the quadratic program is described only at the level of 'harmonizes them into a single update direction without manually-tuned reward weighting.' The exact objective, constraints (e.g., non-negativity of coefficients, normalization, or explicit positivity of g · ∇_k), and feasibility guarantees are not stated, preventing assessment of behavior under strong gradient conflict as raised by the stress-test note.
Authors: The full QP formulation—including the objective (balancing per-reward gradients to maximize the minimum alignment), constraints (non-negative coefficients that sum to one, with explicit encouragement of positive dot products g · ∇_k), and the amortized solution—is given in Section 3.2 of the manuscript together with the EMA smoothing mechanism. We acknowledge that the abstract is too high-level. In the revision we will expand the abstract to concisely state the QP objective and key constraints, and we will add a short paragraph discussing feasibility and behavior under strong gradient conflict, directly referencing the stress-test results already present in the paper. revision: yes
Circularity Check
No significant circularity in MARBLE's derivation
full rationale
The paper introduces MARBLE as a gradient-space optimization framework that maintains independent advantage estimators for each reward, computes per-reward policy gradients, and combines them via a standard quadratic program without manually tuned weights. No step reduces by construction to a fitted quantity defined by the target result, a self-citation chain, or an ansatz smuggled from prior work by the same authors. The amortized formulation exploits the known affine structure of the DiffusionNFT loss as an external property, and empirical claims rest on observed improvements over the weighted-sum baseline rather than tautological predictions or renamings. The derivation is self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The loss function used in DiffusionNFT has an affine structure that permits amortization of the multi-reward gradient computation.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.