arxiv: 2605.13155 · v1 · submitted 2026-05-13 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Pareto-Guided Optimal Transport for Multi-Reward Alignment

Ying Ba , Tianyu Zhang , Mohan Zhou , Yalong Bai , Wenyi Mo , Guiwei Zhang , Bing Su , Ji-Rong Wen

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:19 UTC · model grok-4.3

classification 💻 cs.CV

keywords rewardmodelshackingmulti-rewardoptimaloptimizationratetransport

0 comments

The pith

PG-OT builds prompt-specific Pareto frontiers and applies distribution-aware optimal transport to improve multi-reward alignment while introducing JDR and JCR metrics to measure synergy and hacking.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Text-to-image AI models improve by optimizing against reward models that score how well images match human preferences. When several rewards are used together they often conflict, and simply adding them with weights requires lots of trial-and-error tuning. Worse, the model can increase the reward scores while actually producing worse-looking images, a problem called reward hacking. The new method first finds, for each text prompt, the set of best possible trade-off points among the rewards; this set is called the Pareto frontier. It then uses optimal transport, a technique that finds the cheapest way to move one group of points to another while respecting the overall shape of the distribution, to push generated images toward those good frontier points. Separate online and offline versions handle cases where reward signals are available during generation or only afterward. To judge success the authors define Joint Domination Rate, which counts how often one method beats others across all rewards at once, and Joint Collapse Rate, which detects when rewards are being gamed. Experiments report an 11 percent lift in JDR and an 80 percent human preference win rate over baselines.

Core claim

Experimental results show that our approach outperforms strong baselines with an 11% gain in JDR and achieves a near 80% win rate in human evaluations.

Load-bearing premise

That a prompt-specific Pareto frontier can be constructed reliably from the available reward models and that mapping samples to it via optimal transport will consistently reduce reward hacking without introducing new instabilities or excessive compute cost.

Figures

Figures reproduced from arXiv: 2605.13155 by Bing Su, Guiwei Zhang, Ji-Rong Wen, Mohan Zhou, Tianyu Zhang, Wenyi Mo, Yalong Bai, Ying Ba.

**Figure 1.** Figure 1: Empirical validation of heterogeneous prompt-wise reward upper bounds under the ICT reward (Ba et al., 2025). Reward distributions are estimated from 50 samples per prompt across 20 distinct prompts. and propose methods to verify our hypothesis. Problem Setup. We study post-training preference optimization for text-to-image (T2I) generative models using reward models. Let P = {p1, . . . , pn} denote a se… view at source ↗

**Figure 2.** Figure 2: Qualitative Comparison of Optimization Results Across Different Methods. x j i whose reward vector is in the Pareto frontier, and the γ is a n × qi transport matrix. Training Objective. In practice, R˜(x) is computed by reward models on the generated image x. We optimize the T2I model parameters by backpropagating the OT-based loss through the reward computation, following the differentiable reward optim… view at source ↗

**Figure 3.** Figure 3: Pareto Frontier Visualization based on strong rewards (ICT and HP) on Three Prompts [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Comparative Training Curves of Joint Domination Rate (JDR4) for Ours versus Baseline Methods. performance and instability. Quantitative results. As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Adaptive Decision Pipeline of the VLM based Agent for Multi-Reward Optimization. D. Details VLM Based Decision-making Agent We introduce VLM as a decision-making agent to adaptively manage multi-reward model training. The agent dynamically determines actions based on generated image quality and training stability, with three core capabilities: • Continue Training — When no signs of collapse are observed in… view at source ↗

**Figure 6.** Figure 6: Broad Comparative Examples of Pareto Frontier Visualizations for Various Methods. joint domination rate, indicating that the baseline approaches quickly encounter reward hacking issues. Qualitative Case Studies on Pareto Frontier Visualization. As shown in [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

**Figure 8.** Figure 8: Visualization of Box Plots Showing Reward Variations Across Prompts on HP Score [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

**Figure 9.** Figure 9: Training Curves of Joint Domination Rate (JDR2) [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

**Figure 11.** Figure 11: Qualitative Comparison of Optimization Results with Single-Reward Baselines. 10 [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗

**Figure 12.** Figure 12: Qualitative Comparison of Optimization Results with Multi-Reward Baselines. 11 [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗

read the original abstract

Text-to-image generation models have achieved remarkable progress in preference optimization, yet achieving robust alignment across diverse reward models remains a significant challenge. Existing multi-reward fusion approaches rely on weighted summation, which is costly to tune and insufficient for balancing conflicting objectives. More critically, optimization with reward models is highly susceptible to reward hacking, where reward scores increase while the perceived quality of generated images deteriorates. We demonstrate that optimizing against a unified global target under heterogeneous reward upper bounds can induce reward hacking, a risk further exacerbated by the inherent instability of weak reward models. To mitigate this, we propose a Pareto Frontier-Guided Optimal Transport (PG-OT) framework. Our method constructs a prompt-specific Pareto frontier and maps dominated samples toward it via distribution-aware optimal transport. Furthermore, we develop both online and offline optimization strategies tailored to diverse reward signal characteristics. To provide a more rigorous assessment, we introduce the Joint Domination Rate (JDR) and Joint Collapse Rate (JCR) as principled metrics to quantify multi-reward synergy and reward hacking. Experimental results show that our approach outperforms strong baselines with an 11% gain in JDR and achieves a near 80% win rate in human evaluations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PG-OT gives a workable way to map samples to prompt-specific Pareto fronts via optimal transport, but the 11% JDR gain and 80% human win rate rest on unverified assumptions about frontier quality.

read the letter

The main point is that this paper offers a concrete method for balancing multiple rewards in text-to-image models by building a prompt-specific Pareto frontier and then using distribution-aware optimal transport to pull dominated samples toward it. They also introduce JDR and JCR to track joint domination and collapse rates, which is a clearer way to measure the multi-reward problem than just averaging scores. That combination is new relative to standard weighted-sum baselines, and the online/offline optimization variants show some attention to how different reward signals behave in practice. The reported 11% JDR lift and near-80% human preference win rate over strong baselines indicate the approach can reduce reward hacking in the tested setting. The problem statement itself is on target: unified global targets with heterogeneous upper bounds do tend to produce hacking, especially with weak models. The soft spot is verification. The abstract and stress-test note both flag that the gains depend on the frontier being a reliable representation of non-dominated trade-offs. Without reported diagnostics on frontier coverage, sensitivity to reward scaling or noise, or ablations that swap in weaker or correlated reward models, it is hard to know whether the improvement is general or tied to the particular reward set used. If frontier construction is unstable, the OT step could simply mask rather than fix the underlying issue. This paper is for groups already working on multi-objective alignment for generative models who need something more structured than hand-tuned weights. A reader who cares about practical deployment trade-offs would get value from the metrics and the OT framing, but would still want the full methods and reproducibility details before adopting it. I would send it to peer review so referees can check the frontier construction and run the missing controls.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes Pareto Frontier-Guided Optimal Transport (PG-OT) for multi-reward alignment in text-to-image models. It constructs a prompt-specific Pareto frontier from available reward models and maps generated samples to this frontier via distribution-aware optimal transport, with online and offline optimization variants. New metrics Joint Domination Rate (JDR) and Joint Collapse Rate (JCR) are introduced to quantify multi-reward synergy and reward hacking. Experiments report an 11% JDR gain over baselines and a near-80% win rate in human evaluations.

Significance. If the central claims hold, the framework offers a principled alternative to weighted-sum reward fusion that directly targets non-dominated trade-offs, which could improve robustness to reward hacking in heterogeneous multi-objective settings. The JDR/JCR metrics provide a more structured evaluation lens than single-reward scores.

major comments (3)

[Abstract / Methods] Abstract and Methods: The 11% JDR gain and ~80% human win rate rest on reliable construction of prompt-specific Pareto frontiers from the given reward models. No quantitative diagnostics (frontier coverage, sensitivity to reward scaling or correlations, or ablation on frontier quality) are reported, leaving open whether the gains are artifacts of the particular reward set rather than a general property of PG-OT.
[Experiments] Experiments: The human-study win rate lacks reported details on number of raters, inter-rater agreement, prompt sampling procedure, and statistical controls. Without these, it is difficult to assess whether the 80% figure generalizes beyond the tested prompts or is inflated by evaluation confounds.
[Methods] Methods (online/offline strategies): The mapping via optimal transport is claimed to reduce reward hacking without introducing new instabilities or excessive cost, yet no analysis of computational overhead, convergence behavior under weak reward models, or comparison of online vs. offline variants on JCR is provided to support this.

minor comments (2)

[Methods] Notation for the Pareto frontier and transport plan should be introduced with an explicit equation early in the Methods section for clarity.
[Related Work] The paper should cite prior work on Pareto optimization in multi-objective RL and optimal transport applications in generative modeling.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract / Methods] Abstract and Methods: The 11% JDR gain and ~80% human win rate rest on reliable construction of prompt-specific Pareto frontiers from the given reward models. No quantitative diagnostics (frontier coverage, sensitivity to reward scaling or correlations, or ablation on frontier quality) are reported, leaving open whether the gains are artifacts of the particular reward set rather than a general property of PG-OT.

Authors: We acknowledge that additional diagnostics on Pareto frontier construction would strengthen the presentation. In the revised manuscript we will add quantitative metrics for frontier coverage, sensitivity analysis to reward scaling and correlations, and an ablation on frontier quality obtained by varying the reward-model subset. These additions will help confirm that the reported gains are not artifacts of the specific reward set. revision: yes
Referee: [Experiments] Experiments: The human-study win rate lacks reported details on number of raters, inter-rater agreement, prompt sampling procedure, and statistical controls. Without these, it is difficult to assess whether the 80% figure generalizes beyond the tested prompts or is inflated by evaluation confounds.

Authors: We agree that these experimental details are necessary for proper evaluation of the human-study results. In the revised manuscript we will expand the Experiments section to report the number of raters, inter-rater agreement, prompt sampling procedure, and statistical controls employed. revision: yes
Referee: [Methods] Methods (online/offline strategies): The mapping via optimal transport is claimed to reduce reward hacking without introducing new instabilities or excessive cost, yet no analysis of computational overhead, convergence behavior under weak reward models, or comparison of online vs. offline variants on JCR is provided to support this.

Authors: We note that the current manuscript already shows JCR reductions for the proposed method, yet we did not include explicit overhead or convergence analysis. In the revision we will add runtime benchmarks, convergence plots under varying reward-model quality, and a side-by-side JCR comparison of the online and offline variants. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper introduces the PG-OT framework by constructing prompt-specific Pareto frontiers from reward models and applying distribution-aware optimal transport to map samples, along with online/offline optimization variants. It defines the new metrics JDR and JCR independently to quantify multi-reward synergy and reward hacking. The reported 11% JDR gain and human win rates are presented as empirical results from applying the method to baselines, not as quantities that define or are fitted into the method itself. No equations reduce by construction to inputs, no predictions are statistically forced from fits, and no load-bearing self-citations or uniqueness theorems are invoked in the provided text. The derivation from method description to metrics to experimental outcomes remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are identifiable from the provided text.

pith-pipeline@v0.9.0 · 5526 in / 1139 out tokens · 45104 ms · 2026-05-14T19:19:39.897493+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose a Pareto Frontier-Guided Optimal Transport (PG-OT) framework. Our method constructs a prompt-specific Pareto frontier and maps dominated samples toward it via distribution-aware optimal transport.
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

JDRK = 1/N Σ 1(Ri ≻ Ri,b); JCRK = 1/N Σ 1(Ri,b ≻ Ri)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 2 internal anchors

[1]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

work page
[2]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

work page
[3]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

work page 2016
[4]

ReNO: Enhancing One-step Text-to-Image Models through Reward-based Noise Optimization , booktitle =

Luca Eyring and Shyamgopal Karthik and Karsten Roth and Alexey Dosovitskiy and Zeynep Akata , editor =. ReNO: Enhancing One-step Text-to-Image Models through Reward-based Noise Optimization , booktitle =. 2024 , url =

work page 2024
[5]

In: CVPR

Yanyu Li and Xian Liu and Anil Kag and Ju Hu and Yerlan Idelbayev and Dhritiman Sagar and Yanzhi Wang and Sergey Tulyakov and Jian Ren , title =. 2024 , url =. doi:10.1109/CVPR52733.2024.00763 , timestamp =

work page doi:10.1109/cvpr52733.2024.00763 2024
[6]

Parrot: Pareto-Optimal Multi-reward Reinforcement Learning Framework for Text-to-Image Generation , booktitle =

Seung Hyun Lee and Yinxiao Li and Junjie Ke and Innfarn Yoo and Han Zhang and Jiahui Yu and Qifei Wang and Fei Deng and Glenn Entis and Junfeng He and Gang Li and Sangpil Kim and Irfan Essa and Feng Yang , editor =. Parrot: Pareto-Optimal Multi-reward Reinforcement Learning Framework for Text-to-Image Generation , booktitle =. 2024 , url =. doi:10.1007/97...

work page doi:10.1007/978-3-031-72920-1 2024
[7]

What’s in the image? a deep-dive into the vision of vision language models

Kyungmin Lee and Xiahong Li and Qifei Wang and Junfeng He and Junjie Ke and Ming. Calibrated Multi-Preference Optimization for Aligning Diffusion Models , booktitle =. 2025 , url =. doi:10.1109/CVPR52734.2025.01721 , timestamp =

work page doi:10.1109/cvpr52734.2025.01721 2025
[8]

BalancedDPO: Adaptive Multi-Metric Alignment

Dipesh Tamboli and Souradip Chakraborty and Aditya Malusare and Biplab Banerjee and Amrit Singh Bedi and Vaneet Aggarwal , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2503.12575 , eprinttype =. 2503.12575 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.12575 2025
[9]

Learning Transferable Visual Models From Natural Language Supervision , booktitle =

Alec Radford and Jong Wook Kim and Chris Hallacy and Aditya Ramesh and Gabriel Goh and Sandhini Agarwal and Girish Sastry and Amanda Askell and Pamela Mishkin and Jack Clark and Gretchen Krueger and Ilya Sutskever , editor =. Learning Transferable Visual Models From Natural Language Supervision , booktitle =. 2021 , url =

work page 2021
[10]

Junnan Li and Dongxu Li and Caiming Xiong and Steven C. H. Hoi , editor =. International Conference on Machine Learning,. 2022 , url =

work page 2022
[11]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Enhancing reward models for high-quality image generation: Beyond text-image alignment , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page
[12]

2026 , eprint=

GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning , author=. 2026 , eprint=

work page 2026
[13]

2025 , eprint=

Qwen3-VL Technical Report , author=. 2025 , eprint=

work page 2025
[14]

ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation , booktitle =

Jiazheng Xu and Xiao Liu and Yuchen Wu and Yuxuan Tong and Qinkai Li and Ming Ding and Jie Tang and Yuxiao Dong , editor =. ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation , booktitle =. 2023 , url =

work page 2023
[15]

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

Xiaoshi Wu and Yiming Hao and Keqiang Sun and Yixiong Chen and Feng Zhu and Rui Zhao and Hongsheng Li , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2306.09341 , eprinttype =. 2306.09341 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2306.09341 2023
[16]

Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation , booktitle =

Yuval Kirstain and Adam Polyak and Uriel Singer and Shahbuland Matiana and Joe Penna and Omer Levy , editor =. Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation , booktitle =. 2023 , url =

work page 2023
[17]

In: CVPR

Bram Wallace and Meihua Dang and Rafael Rafailov and Linqi Zhou and Aaron Lou and Senthil Purushwalkam and Stefano Ermon and Caiming Xiong and Shafiq Joty and Nikhil Naik , title =. 2024 , url =. doi:10.1109/CVPR52733.2024.00786 , timestamp =

work page doi:10.1109/cvpr52733.2024.00786 2024
[18]

Sinkhorn Distances: Lightspeed Computation of Optimal Transport , booktitle =

Marco Cuturi , editor =. Sinkhorn Distances: Lightspeed Computation of Optimal Transport , booktitle =. 2013 , url =

work page 2013
[19]

Histoire de l'Académie Royale des Sciences de Paris , year =

Gaspard Monge , title =. Histoire de l'Académie Royale des Sciences de Paris , year =

work page
[20]

2024 , eprint=

Directly Fine-Tuning Diffusion Models on Differentiable Rewards , author=. 2024 , eprint=

work page 2024
[21]

2021 , eprint=

LoRA: Low-Rank Adaptation of Large Language Models , author=. 2021 , eprint=

work page 2021
[22]

2022 , eprint=

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation , author=. 2022 , eprint=

work page 2022
[23]

2023 , eprint=

DiffusionDB: A Large-scale Prompt Gallery Dataset for Text-to-Image Generative Models , author=. 2023 , eprint=

work page 2023
[24]

2024 , eprint=

Training Diffusion Models with Reinforcement Learning , author=. 2024 , eprint=

work page 2024
[25]

2023 , eprint=

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis , author=. 2023 , eprint=

work page 2023
[26]

2022 , eprint=

High-Resolution Image Synthesis with Latent Diffusion Models , author=. 2022 , eprint=

work page 2022
[27]

2024 , eprint=

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis , author=. 2024 , eprint=

work page 2024
[28]

2015 , eprint=

Deep Unsupervised Learning using Nonequilibrium Thermodynamics , author=. 2015 , eprint=

work page 2015
[29]

2020 , eprint=

Generative Modeling by Estimating Gradients of the Data Distribution , author=. 2020 , eprint=

work page 2020
[30]

2020 , eprint=

Fine-Tuning Language Models from Human Preferences , author=. 2020 , eprint=

work page 2020
[31]

2022 , eprint=

Learning to summarize from human feedback , author=. 2022 , eprint=

work page 2022
[32]

2021 , eprint=

Diffusion Models Beat GANs on Image Synthesis , author=. 2021 , eprint=

work page 2021
[33]

2024 , eprint=

Large-scale Reinforcement Learning for Diffusion Models , author=. 2024 , eprint=

work page 2024
[34]

2024 , eprint=

TextCraftor: Your Text Encoder Can be Image Quality Controller , author=. 2024 , eprint=

work page 2024
[35]

2024 , url=

Reinforcement Learning with Human Feedback: Learning Dynamic Choices via Pessimism , author=. 2024 , url=

work page 2024
[36]

2023 , eprint=

Reinforcement Learning for Joint Optimization of Multiple Rewards , author=. 2023 , eprint=

work page 2023
[37]

2024 , eprint=

Powerful and Flexible: Personalized Text-to-Image Generation via Reinforcement Learning , author=. 2024 , eprint=

work page 2024
[38]

2024 , eprint=

PRDP: Proximal Reward Difference Prediction for Large-Scale Reward Finetuning of Diffusion Models , author=. 2024 , eprint=

work page 2024
[39]

2024 , eprint=

Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author=. 2024 , eprint=

work page 2024
[40]

2023 , eprint=

DPOK: Reinforcement Learning for Fine-tuning Text-to-Image Diffusion Models , author=. 2023 , eprint=

work page 2023
[41]

2023 , eprint=

Human Preference Score: Better Aligning Text-to-Image Models with Human Preference , author=. 2023 , eprint=

work page 2023
[42]

2024 , eprint=

Dynamic Multi-Reward Weighting for Multi-Style Controllable Generation , author=. 2024 , eprint=

work page 2024
[43]

2026 , eprint=

On the Plasticity and Stability for Post-Training Large Language Models , author=. 2026 , eprint=

work page 2026
[44]

2025 , eprint=

Group Causal Policy Optimization for Post-Training Large Language Models , author=. 2025 , eprint=

work page 2025
[45]

2025 , eprint=

Learning to Think: Information-Theoretic Reinforcement Fine-Tuning for LLMs , author=. 2025 , eprint=

work page 2025
[46]

2025 , eprint=

Causal Reward Adjustment: Mitigating Reward Hacking in External Reasoning via Backdoor Correction , author=. 2025 , eprint=

work page 2025
[47]

2025 , eprint=

PrefGen: Multimodal Preference Learning for Preference-Conditioned Image Generation , author=. 2025 , eprint=

work page 2025
[48]

2025 , eprint=

Learning User Preferences for Image Generation Model , author=. 2025 , eprint=

work page 2025
[49]

2026 , eprint=

Say Cheese! Detail-Preserving Portrait Collection Generation via Natural Language Edits , author=. 2026 , eprint=

work page 2026