arxiv: 2604.20816 · v1 · submitted 2026-04-22 · 💻 cs.LG · cs.CV

Recognition: unknown

ParetoSlider: Diffusion Models Post-Training for Continuous Reward Control

Shelly Golan , Michael Finkelson , Ariel Bereslavsky , Yotam Nitzan , Or Patashnik

Authors on Pith no claims yet

Pith reviewed 2026-05-10 00:17 UTC · model grok-4.3

classification 💻 cs.LG cs.CV

keywords diffusion modelsmulti-objective reinforcement learningPareto frontpreference conditioningpost-traininggenerative modelsreward trade-offsinference-time control

0 comments

The pith

A single diffusion model conditioned on varying preference weights during post-training can match fixed-reward baselines while enabling continuous control over trade-offs at inference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard RL post-training for diffusion models collapses multiple rewards into one fixed weighted sum early in training. This locks the model to one compromise point and requires separate training runs for every other balance between goals such as prompt following and source fidelity. ParetoSlider instead feeds continuously changing preference weights as a conditioning signal throughout multi-objective RL post-training. One resulting model therefore learns to produce outputs anywhere along the Pareto front. Experiments on SD3.5, FluxKontext, and LTX-2 backbones show the single model matches or beats the performance of models trained for fixed points while adding the missing control dimension.

Core claim

By training a diffusion model with continuously varying preference weights as a conditioning signal during multi-objective RL post-training, a single checkpoint can approximate the entire Pareto front for competing rewards and allow users to select any desired trade-off at inference time without retraining or maintaining multiple models.

What carries the argument

Preference-weight conditioning inside multi-objective RL post-training, which turns a diffusion model into a continuous approximator of the Pareto front.

If this is right

Users can adjust the balance between competing objectives interactively during generation instead of committing at training time.
Only one model checkpoint needs to be trained and stored instead of a separate model for each desired trade-off.
Performance on any fixed trade-off point remains competitive with a model trained specifically for that point.
The same model can be reused across different applications that require different operating points on the same objective set.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same conditioning approach could be tested on other generative families such as autoregressive or transformer-based models.
Hosting costs drop because only one model is served regardless of how many trade-off points users request.
The method may extend naturally to three or more simultaneous objectives if the conditioning signal remains low-dimensional.

Load-bearing premise

That feeding a continuous range of preference weights as conditioning during training will produce stable, high-quality generation at every trade-off point rather than leaving gaps or degrading performance.

What would settle it

For a chosen preference weight, compare the conditioned model's reward scores against those of a model trained from scratch with that exact fixed weight; if the conditioned model is consistently worse on the relevant objectives, the claim fails.

Figures

Figures reproduced from arXiv: 2604.20816 by Ariel Bereslavsky, Michael Finkelson, Or Patashnik, Shelly Golan, Yotam Nitzan.

**Figure 1.** Figure 1: ParetoSlider enables smooth inference-time control over competing rewards trade-off via a single preference-conditioned model. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: ParetoSlider training pipeline. (1) For each prompt and sampled [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Multi-objective style interpolation results. Each triplet shows images generated at three preference configurations spanning the [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Input preservation vs. instruction adherence, moving from full source preservation to full instruction adherence with a balanced [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative results on text-to-video generation (LTX2, [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Pareto front and qualitative T2I comparison on SD3.5 for photorealism-sketch trade-offs. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Ablation of preference-conditioning architectures for SD3.5 on the photorealism-sketch trade-off. [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Ablation of scalarization strategies for SD3.5 on the photorealism-sketch trade-off. [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 9.** Figure 9: Comparison on instruction-based image editing between source preservation and instruction adherence. [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗

**Figure 10.** Figure 10: Robust Pareto-front comparison of baseline methods [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 12.** Figure 12: Robustness of scalarization strategies under alternative [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗

**Figure 13.** Figure 13: Different sketch reward models results. gray to a saturated pink. Similarly, in the first row of [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗

**Figure 14.** Figure 14: Our results for continuous preference control in text-to-image generation. [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗

**Figure 15.** Figure 15: Our results for continuous preference control in image editing. Rows 1-5 were trained on the FFHQ editing data, rows 6-7 were [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗

**Figure 16.** Figure 16: Additional qualitative results on text-to-video generation. [PITH_FULL_IMAGE:figures/full_fig_p024_16.png] view at source ↗

read the original abstract

Reinforcement Learning (RL) post-training has become the standard for aligning generative models with human preferences, yet most methods rely on a single scalar reward. When multiple criteria matter, the prevailing practice of ``early scalarization'' collapses rewards into a fixed weighted sum. This commits the model to a single trade-off point at training time, providing no inference-time control over inherently conflicting goals -- such as prompt adherence versus source fidelity in image editing. We introduce ParetoSlider, a multi-objective RL (MORL) framework that trains a single diffusion model to approximate the entire Pareto front. By training the model with continuously varying preference weights as a conditioning signal, we enable users to navigate optimal trade-offs at inference time without retraining or maintaining multiple checkpoints. We evaluate ParetoSlider across three state-of-the-art flow-matching backbones: SD3.5, FluxKontext, and LTX-2. Our single preference-conditioned model matches or exceeds the performance of baselines trained separately for fixed reward trade-offs, while uniquely providing fine-grained control over competing generative goals.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ParetoSlider trains one diffusion model on varying preference weights to cover the Pareto front in post-training, and the results hold up against fixed baselines on three backbones.

read the letter

The core idea here is straightforward and useful: instead of training separate models for each fixed trade-off between rewards, they condition a single flow-matching diffusion model on a continuous preference vector during multi-objective RL post-training. This lets users slide between points on the front at inference time. They test it on SD3.5, FluxKontext, and LTX-2, and the single conditioned model matches or beats the separate fixed-weight baselines while adding the control that the baselines lack. That combination of continuous conditioning with flow-matching backbones is the actual novelty, and it avoids the usual early scalarization trap without obvious extra cost at inference. The evaluation protocol looks internally consistent once you get into the methods; the reward formulation and conditioning mechanism line up with the reported matching performance, and there are no hidden instabilities or post-hoc tweaks that jump out. The main soft spot is that the gains are described as “matches or exceeds” rather than large consistent wins, so the practical upside depends on how much users actually value the fine-grained control versus just picking one good fixed point. Minor implementation details like exactly how the preference weights are encoded and normalized could use more explicit ablation, but nothing load-bearing seems to rest on shaky ground. This is the kind of paper that matters for people shipping diffusion models in tools where prompt adherence and fidelity pull in opposite directions. It is worth a serious referee pass because the empirical comparison is direct and the setup is reproducible enough to check. I would bring it to a reading group to discuss the conditioning trick and whether it generalizes beyond the three backbones they tried.

Referee Report

0 major / 3 minor

Summary. The paper introduces ParetoSlider, a multi-objective RL post-training framework for diffusion models. It conditions a single model on continuously varying preference weights during training to approximate the full Pareto front for conflicting objectives (e.g., prompt adherence vs. source fidelity in image editing). This enables inference-time navigation of trade-offs without retraining or multiple checkpoints. The method is evaluated on three flow-matching backbones (SD3.5, FluxKontext, LTX-2), with the central empirical claim that the single conditioned model matches or exceeds the performance of separately trained fixed-weight baselines while providing fine-grained control.

Significance. If the results hold under the reported evaluation protocol, the work provides a practical solution to the early-scalarization limitation in preference alignment for generative models. It demonstrates that preference conditioning during MORL can cover the Pareto front without evident instabilities, offering inference-time flexibility that fixed baselines lack. The multi-backbone evaluation adds generality to the claim.

minor comments (3)

The abstract and introduction refer to 'continuously varying preference weights' but the precise sampling distribution and normalization of these weights during training should be clarified in the methods section for reproducibility.
Figure captions and axis labels in the results section could more explicitly indicate which metrics correspond to which objectives to aid interpretation of the Pareto coverage plots.
The paper would benefit from a brief discussion of potential limitations, such as sensitivity to the choice of reward models or the range of preference weights tested.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, recognition of the method's practical value for inference-time Pareto navigation, and recommendation for minor revision. No major comments were provided in the report.

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents ParetoSlider as an empirical MORL post-training method that conditions a diffusion model on continuously varying preference weights to approximate the Pareto front. All load-bearing claims rest on direct experimental comparisons to fixed-trade-off baselines across SD3.5, FluxKontext, and LTX-2 backbones, with no equations, predictions, or uniqueness results that reduce by construction to fitted inputs, self-definitions, or self-citation chains. The approach uses standard conditioning in RL fine-tuning and is evaluated against external benchmarks, remaining self-contained without circular reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review is based solely on the abstract; no specific free parameters, axioms, or invented entities can be identified from the provided text.

pith-pipeline@v0.9.0 · 5495 in / 1058 out tokens · 48491 ms · 2026-05-10T00:17:53.161367+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

56 extracted references · 7 canonical work pages · 5 internal anchors

[1]

Preference conditioned multi- objective reinforcement learning: Decomposed, diversity- driven policy optimization, 2026

Tanmay Ambadkar, Sourav Panda, Shreyash Kale, Jonathan Dodge, and Abhinav Verma. Preference conditioned multi- objective reinforcement learning: Decomposed, diversity- driven policy optimization, 2026. 5

2026
[2]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023. 7, 16, 18, 20

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Training diffusion models with reinforce- ment learning

Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforce- ment learning. InThe Twelfth International Conference on Learning Representations. 1, 2, 3
[4]

In- structpix2pix: Learning to follow image editing instructions

Tim Brooks, Aleksander Holynski, and Alexei A Efros. In- structpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 18392–18402, 2023. 9

2023
[5]

Lan- guage models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners. pages 1877–1901, 2020. 1

1901
[6]

Min Cheng, Fatemeh Doudi, Dileep Kalathil, Mohammad Ghavamzadeh, and Panganamala R. Kumar. Diffusion blend: Inference-time multi-preference alignment for diffu- sion models, 2025. 2, 3

2025
[7]

Directly fine-tuning diffusion models on differentiable re- wards

Kevin Clark, Paul Vicol, Kevin Swersky, and David J Fleet. Directly fine-tuning diffusion models on differentiable re- wards. InThe Twelfth International Conference on Learning Representations. 1, 2
[8]

Personalized preference fine-tuning of diffusion models

Meihua Dang, Anikait Singh, Linqi Zhou, Stefano Ermon, and Jiaming Song. Personalized preference fine-tuning of diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8020–8030, 2025. 3

2025
[9]

Multi-objective optimisation using evolu- tionary algorithms: an introduction

Kalyanmoy Deb. Multi-objective optimisation using evolu- tionary algorithms: an introduction. InMulti-objective evo- lutionary optimisation for product design and manufactur- ing, pages 3–34. Springer, 2011. 2

2011
[10]

Scaling recti- fied flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning,
[11]

Dpok: Reinforcement learning for fine-tuning text-to-image diffu- sion models

Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Moham- mad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Dpok: Reinforcement learning for fine-tuning text-to-image diffu- sion models. pages 79858–79885, 2023. 1, 2, 3

2023
[12]

Stylegan-nada: Clip- guided domain adaptation of image generators.ACM Trans- actions on Graphics (TOG), 41(4):1–13, 2022

Rinon Gal, Or Patashnik, Haggai Maron, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. Stylegan-nada: Clip- guided domain adaptation of image generators.ACM Trans- actions on Graphics (TOG), 41(4):1–13, 2022. 20

2022
[13]

Ltx-2: Efficient joint audio-visual foundation model, 2026

Yoav HaCohen, Benny Brazowski, Nisan Chiprut, Yaki Bitterman, Andrew Kvochko, Avishai Berkowitz, Daniel Shalem, Daphna Lifschitz, Dudu Moshe, Eitan Porat, Eitan Richardson, Guy Shiran, Itay Chachy, Jonathan Chetboun, Michael Finkelson, Michael Kupchick, Nir Zabari, Nitzan Guetta, Noa Kotler, Ofir Bibi, Ori Gordon, Poriya Panet, Roi Benita, Shahar Armon, V...

2026
[14]

Parameter-efficient fine-tuning for large models: A comprehensive survey.Transactions on Machine Learning Research

Zeyu Han, Chao Gao, Jinyang Liu, Jeff Zhang, and Sai Qian Zhang. Parameter-efficient fine-tuning for large models: A comprehensive survey.Transactions on Machine Learning Research. 14
[15]

Hayes, Roxana R ˘adulescu, Eugenio Bargiacchi, Johan K ¨allstr¨om, Matthew Macfarlane, Mathieu Reymond, Timothy Verstraeten, Luisa M

Conor F. Hayes, Roxana R ˘adulescu, Eugenio Bargiacchi, Johan K ¨allstr¨om, Matthew Macfarlane, Mathieu Reymond, Timothy Verstraeten, Luisa M. Zintgraf, Richard Dazeley, Fredrik Heintz, Enda Howley, Athirai A. Irissappane, Patrick Mannion, Ann Now´e, Gabriel Ramos, Marcello Restelli, Pe- ter Vamplew, and Diederik M. Roijers. A practical guide to multi-obj...

2022
[16]

Hayes, Roxana R ˘adulescu, Eugenio Bargiacchi, Johan K ¨allstr¨om, Matthew Macfarlane, Mathieu Reymond, Timothy Verstraeten, Luisa M

Conor F. Hayes, Roxana R ˘adulescu, Eugenio Bargiacchi, Johan K ¨allstr¨om, Matthew Macfarlane, Mathieu Reymond, Timothy Verstraeten, Luisa M. Zintgraf, Richard Dazeley, Fredrik Heintz, Enda Howley, Athirai A. Irissappane, Patrick Mannion, Ann Now´e, Gabriel Ramos, Marcello Restelli, Pe- ter Vamplew, and Diederik M. Roijers. A practical guide to multi-obj...

2022
[17]

Clipscore: A reference-free evaluation met- ric for image captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation met- ric for image captioning. InProceedings of the 2021 confer- ence on empirical methods in natural language processing, pages 7514–7528, 2021. 7, 14, 16, 20

2021
[18]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 18

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Inference-time alignment control for diffusion models with reinforcement learning guidance, 2025

Luozhijie Jin, Zijie Qiu, Jie Liu, Zijie Diao, Lifeng Qiao, Ning Ding, Alex Lamb, and Xipeng Qiu. Inference-time alignment control for diffusion models with reinforcement learning guidance, 2025. 2

2025
[20]

A style-based generator architecture for generative adversarial networks

Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 4401–4410, 2019. 8

2019
[21]

Pick-a-pic: An open dataset of user preferences for text-to-image generation

Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Ma- tiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation. pages 36652–36663, 2023. 1, 7, 14, 16

2023
[22]

Viescore: Towards explainable metrics for conditional image synthesis evaluation

Max Ku, Dongfu Jiang, Cong Wei, Xiang Yue, and Wenhu Chen. Viescore: Towards explainable metrics for conditional image synthesis evaluation. InProceedings of the 62nd An- 11 nual Meeting of the Association for Computational Linguis- tics (Volume 1: Long Papers), pages 12268–12290, 2024. 18, 20

2024
[23]

Flux.1 kontext: Flow matching for in-context image generation and editing in latent space,

Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dock- horn, Jack English, Zion English, Patrick Esser, Sumith Ku- lal, Kyle Lacey, Yam Levi, Cheng Li, Dominik Lorenz, Jonas M¨uller, Dustin Podell, Robin Rombach, Harry Saini, Axel Sauer, and Luke Smith. Flux.1 kontext: Flow matching for in-context i...
[24]

Flow-multi: A flow- matching multi-reward framework for text-to-image gener- ation.Sensors, 26(4):1120, 2026

Jaegun Lee and Janghoon Choi. Flow-multi: A flow- matching multi-reward framework for text-to-image gener- ation.Sensors, 26(4):1120, 2026. 3, 8

2026
[25]

Calibrated multi-preference opti- mization for aligning diffusion models

Kyungmin Lee, Xiahong Li, Qifei Wang, Junfeng He, Jun- jie Ke, Ming-Hsuan Yang, Irfan Essa, Jinwoo Shin, Feng Yang, and Yinxiao Li. Calibrated multi-preference opti- mization for aligning diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18465–18475, 2025. 2, 3

2025
[26]

Parrot: Pareto-optimal multi-reward reinforce- ment learning framework for text-to-image generation

Seung Hyun Lee, Yinxiao Li, Junjie Ke, Innfarn Yoo, Han Zhang, Jiahui Yu, Qifei Wang, Fei Deng, Glenn Entis, Jun- feng He, et al. Parrot: Pareto-optimal multi-reward reinforce- ment learning framework for text-to-image generation. In European Conference on Computer Vision, pages 462–478. Springer, 2024. 2, 3

2024
[27]

MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE

Junzhe Li, Yutao Cui, Tao Huang, Yinping Ma, Chun Fan, Miles Yang, and Zhao Zhong. Mixgrpo: Unlocking flow- based grpo efficiency with mixed ode-sde.arXiv preprint arXiv:2507.21802, 2025. 2

work page internal anchor Pith review arXiv 2025
[28]

Flow-grpo: Training flow matching models via on- line rl

Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di ZHANG, and Wanli Ouyang. Flow-grpo: Training flow matching models via on- line rl. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems. 1, 2, 4, 5
[29]

arXiv preprint arXiv:2509.23909 (2025)

Xin Luo, Jiahao Wang, Chenyuan Wu, Shitao Xiao, Xiyan Jiang, Defu Lian, Jiajun Zhang, Dong Liu, and Zheng Liu. Editscore: Unlocking online rl for image editing via high- fidelity reward modeling.arXiv preprint arXiv:2509.23909,

work page arXiv
[30]

Hpsv3: Towards wide-spectrum human preference score

Yuhang Ma, Xiaoshi Wu, Keqiang Sun, and Hongsheng Li. Hpsv3: Towards wide-spectrum human preference score. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15086–15095, 2025. 1

2025
[31]

Springer Science & Business Media, 1999

Kaisa Miettinen.Nonlinear multiobjective optimization. Springer Science & Business Media, 1999. 2

1999
[32]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. pages 27730–27744, 2022. 1

2022
[33]

Kontinuous kontext: Continuous strength control for instruction-based image editing.arXiv preprint arXiv:2510.08532, 2025

Rishubh Parihar, Or Patashnik, Daniil Ostashev, R Venkatesh Babu, Daniel Cohen-Or, and Kuan-Chieh Wang. Kontinuous kontext: Continuous strength control for instruction-based image editing.arXiv preprint arXiv:2510.08532, 2025. 6

work page arXiv 2025
[34]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 7

2021
[35]

Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse re- wards

Alexandre Rame, Guillaume Couairon, Corentin Dancette, Jean-Baptiste Gaya, Mustafa Shukor, Laure Soulier, and Matthieu Cord. Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse re- wards. pages 71095–71134, 2023. 2, 3

2023
[36]

Cos-dpo: Conditioned one- shot multi-objective fine-tuning framework

Yinuo Ren, Tesi Xiao, Michael Shavlovsky, Lexing Ying, and Holakou Rahmanian. Cos-dpo: Conditioned one- shot multi-objective fine-tuning framework. InConference on Uncertainty in Artificial Intelligence, pages 3525–3551. PMLR, 2025. 3

2025
[37]

D. M. Roijers, P. Vamplew, S. Whiteson, and R. Dazeley. A survey of multi-objective sequential decision-making.Jour- nal of Artificial Intelligence Research, 48:67–113, 2013. 2

2013
[38]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. 1, 2

2024
[39]

Lora: Low-rank adaptation of large lan- guage models

Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, et al. Lora: Low-rank adaptation of large lan- guage models. 14
[40]

Sd3.5.https : / / github

Stability AI. Sd3.5.https : / / github . com / Stability-AI/sd3.5, 2024. 2

2024
[41]

Gemma: Open Models Based on Gemini Research and Technology

Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivi `ere, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and tech- nology.arXiv preprint arXiv:2403.08295, 2024. 17

work page internal anchor Pith review arXiv 2024
[42]

Diffusion model align- ment using direct preference optimization

Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model align- ment using direct preference optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8228–8238, 2024. 1, 2

2024
[43]

Diffusion model align- ment using direct preference optimization

Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model align- ment using direct preference optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8228–8238, 2024. 3

2024
[44]

Unified Reward Model for Multimodal Understanding and Generation

Yibin Wang, Yuhang Zang, Hao Li, Cheng Jin, and Jiaqi Wang. Unified reward model for multimodal understanding and generation.arXiv preprint arXiv:2503.05236, 2025. 7, 16, 18, 20

work page internal anchor Pith review arXiv 2025
[45]

Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis, 2023

Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis, 2023. 1

2023
[46]

Imagere- ward: Learning and evaluating human preferences for text- to-image generation

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagere- ward: Learning and evaluating human preferences for text- to-image generation. InThirty-seventh Conference on Neu- ral Information, pages 15903–15935, 2023. 1

2023
[47]

Dancegrpo: Unleashing grpo on visual generation, 2025

Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, 12 Weilin Huang, and Ping Luo. Dancegrpo: Unleashing grpo on visual generation, 2025. 1, 2, 4

2025
[48]

Rewards-in-context: Multi- objective alignment of foundation models with dynamic preference adjustment

Rui Yang, Xiaoman Pan, Feng Luo, Shuang Qiu, Han Zhong, Dong Yu, and Jianshu Chen. Rewards-in-context: Multi- objective alignment of foundation models with dynamic preference adjustment. InInternational Conference on Ma- chine Learning, pages 56276–56297. PMLR, 2024. 3

2024
[49]

Proud: Pareto-guided diffusion model for multi- objective generation.Machine Learning, 113(9):6511–6538,

Yinghua Yao, Yuangang Pan, Jing Li, Ivor Tsang, and Xin Yao. Proud: Pareto-guided diffusion model for multi- objective generation.Machine Learning, 113(9):6511–6538,
[50]

Pacs: A dataset for physical audiovisual commonsense reasoning

Samuel Yu, Peter Wu, Paul Pu Liang, Ruslan Salakhutdinov, and Louis-Philippe Morency. Pacs: A dataset for physical audiovisual commonsense reasoning. InEuropean Confer- ence on Computer Vision, pages 292–309. Springer, 2022. 7, 16

2022
[51]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 586–595, 2018. 20

2018
[52]

Diffusionnft: Online diffusion rein- forcement with forward process, 2026

Kaiwen Zheng, Huayu Chen, Haotian Ye, Haoxiang Wang, Qinsheng Zhang, Kai Jiang, Hang Su, Stefano Ermon, Jun Zhu, and Ming-Yu Liu. Diffusionnft: Online diffusion rein- forcement with forward process, 2026. 1, 2, 4, 5, 14

2026
[53]

Panacea: Pareto alignment via preference adaptation for llms

Yifan Zhong, Chengdong Ma, Xiaoyuan Zhang, Ziran Yang, Haojun Chen, Qingfu Zhang, Siyuan Qi, and Yaodong Yang. Panacea: Pareto alignment via preference adaptation for llms. pages 75522–75558, 2024. 3, 10

2024
[54]

Multiobjective evolution- ary algorithms: a comparative case study and the strength pareto approach.IEEE transactions on Evolutionary Com- putation, 3(4):257–271, 2002

Eckart Zitzler and Lothar Thiele. Multiobjective evolution- ary algorithms: a comparative case study and the strength pareto approach.IEEE transactions on Evolutionary Com- putation, 3(4):257–271, 2002. 19 13 Appendix This supplementary material provides additional details, extended evaluations, and broader context for our frame- work. Section A comprehen...

2002
[55]

in the transformer modulation space. Concretely, the preference vector is mapped by a lightweight projector to a modulation vector of dimension2d, which is split into scale and shift terms and added to the AdaLN parameters of the context stream inside the dual-stream transformer blocks. The image stream is not directly modulated, and the single-stream blo...
[56]

A photorealistic, high quality, 4K, camera-captured snapshot of [prompt]

reward using the prompt:“A photorealistic, high quality, 4K, camera-captured snapshot of [prompt]. ”and the structured sketch reward. To ensure the robustness of the quantitative evaluation in Fig. 7 and avoid potential over-optimization artifacts, we validate our Pareto frontiers using a diverse set of evaluation metrics distinct from those guiding the t...