Recognition: unknown
ParetoSlider: Diffusion Models Post-Training for Continuous Reward Control
Pith reviewed 2026-05-10 00:17 UTC · model grok-4.3
The pith
A single diffusion model conditioned on varying preference weights during post-training can match fixed-reward baselines while enabling continuous control over trade-offs at inference.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By training a diffusion model with continuously varying preference weights as a conditioning signal during multi-objective RL post-training, a single checkpoint can approximate the entire Pareto front for competing rewards and allow users to select any desired trade-off at inference time without retraining or maintaining multiple models.
What carries the argument
Preference-weight conditioning inside multi-objective RL post-training, which turns a diffusion model into a continuous approximator of the Pareto front.
If this is right
- Users can adjust the balance between competing objectives interactively during generation instead of committing at training time.
- Only one model checkpoint needs to be trained and stored instead of a separate model for each desired trade-off.
- Performance on any fixed trade-off point remains competitive with a model trained specifically for that point.
- The same model can be reused across different applications that require different operating points on the same objective set.
Where Pith is reading between the lines
- The same conditioning approach could be tested on other generative families such as autoregressive or transformer-based models.
- Hosting costs drop because only one model is served regardless of how many trade-off points users request.
- The method may extend naturally to three or more simultaneous objectives if the conditioning signal remains low-dimensional.
Load-bearing premise
That feeding a continuous range of preference weights as conditioning during training will produce stable, high-quality generation at every trade-off point rather than leaving gaps or degrading performance.
What would settle it
For a chosen preference weight, compare the conditioned model's reward scores against those of a model trained from scratch with that exact fixed weight; if the conditioned model is consistently worse on the relevant objectives, the claim fails.
Figures
read the original abstract
Reinforcement Learning (RL) post-training has become the standard for aligning generative models with human preferences, yet most methods rely on a single scalar reward. When multiple criteria matter, the prevailing practice of ``early scalarization'' collapses rewards into a fixed weighted sum. This commits the model to a single trade-off point at training time, providing no inference-time control over inherently conflicting goals -- such as prompt adherence versus source fidelity in image editing. We introduce ParetoSlider, a multi-objective RL (MORL) framework that trains a single diffusion model to approximate the entire Pareto front. By training the model with continuously varying preference weights as a conditioning signal, we enable users to navigate optimal trade-offs at inference time without retraining or maintaining multiple checkpoints. We evaluate ParetoSlider across three state-of-the-art flow-matching backbones: SD3.5, FluxKontext, and LTX-2. Our single preference-conditioned model matches or exceeds the performance of baselines trained separately for fixed reward trade-offs, while uniquely providing fine-grained control over competing generative goals.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ParetoSlider, a multi-objective RL post-training framework for diffusion models. It conditions a single model on continuously varying preference weights during training to approximate the full Pareto front for conflicting objectives (e.g., prompt adherence vs. source fidelity in image editing). This enables inference-time navigation of trade-offs without retraining or multiple checkpoints. The method is evaluated on three flow-matching backbones (SD3.5, FluxKontext, LTX-2), with the central empirical claim that the single conditioned model matches or exceeds the performance of separately trained fixed-weight baselines while providing fine-grained control.
Significance. If the results hold under the reported evaluation protocol, the work provides a practical solution to the early-scalarization limitation in preference alignment for generative models. It demonstrates that preference conditioning during MORL can cover the Pareto front without evident instabilities, offering inference-time flexibility that fixed baselines lack. The multi-backbone evaluation adds generality to the claim.
minor comments (3)
- The abstract and introduction refer to 'continuously varying preference weights' but the precise sampling distribution and normalization of these weights during training should be clarified in the methods section for reproducibility.
- Figure captions and axis labels in the results section could more explicitly indicate which metrics correspond to which objectives to aid interpretation of the Pareto coverage plots.
- The paper would benefit from a brief discussion of potential limitations, such as sensitivity to the choice of reward models or the range of preference weights tested.
Simulated Author's Rebuttal
We thank the referee for the positive summary, recognition of the method's practical value for inference-time Pareto navigation, and recommendation for minor revision. No major comments were provided in the report.
Circularity Check
No significant circularity in derivation chain
full rationale
The paper presents ParetoSlider as an empirical MORL post-training method that conditions a diffusion model on continuously varying preference weights to approximate the Pareto front. All load-bearing claims rest on direct experimental comparisons to fixed-trade-off baselines across SD3.5, FluxKontext, and LTX-2 backbones, with no equations, predictions, or uniqueness results that reduce by construction to fitted inputs, self-definitions, or self-citation chains. The approach uses standard conditioning in RL fine-tuning and is evaluated against external benchmarks, remaining self-contained without circular reductions.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Preference conditioned multi- objective reinforcement learning: Decomposed, diversity- driven policy optimization, 2026
Tanmay Ambadkar, Sourav Panda, Shreyash Kale, Jonathan Dodge, and Abhinav Verma. Preference conditioned multi- objective reinforcement learning: Decomposed, diversity- driven policy optimization, 2026. 5
2026
-
[2]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023. 7, 16, 18, 20
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Training diffusion models with reinforce- ment learning
Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforce- ment learning. InThe Twelfth International Conference on Learning Representations. 1, 2, 3
-
[4]
In- structpix2pix: Learning to follow image editing instructions
Tim Brooks, Aleksander Holynski, and Alexei A Efros. In- structpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 18392–18402, 2023. 9
2023
-
[5]
Lan- guage models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners. pages 1877–1901, 2020. 1
1901
-
[6]
Min Cheng, Fatemeh Doudi, Dileep Kalathil, Mohammad Ghavamzadeh, and Panganamala R. Kumar. Diffusion blend: Inference-time multi-preference alignment for diffu- sion models, 2025. 2, 3
2025
-
[7]
Directly fine-tuning diffusion models on differentiable re- wards
Kevin Clark, Paul Vicol, Kevin Swersky, and David J Fleet. Directly fine-tuning diffusion models on differentiable re- wards. InThe Twelfth International Conference on Learning Representations. 1, 2
-
[8]
Personalized preference fine-tuning of diffusion models
Meihua Dang, Anikait Singh, Linqi Zhou, Stefano Ermon, and Jiaming Song. Personalized preference fine-tuning of diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8020–8030, 2025. 3
2025
-
[9]
Multi-objective optimisation using evolu- tionary algorithms: an introduction
Kalyanmoy Deb. Multi-objective optimisation using evolu- tionary algorithms: an introduction. InMulti-objective evo- lutionary optimisation for product design and manufactur- ing, pages 3–34. Springer, 2011. 2
2011
-
[10]
Scaling recti- fied flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning,
-
[11]
Dpok: Reinforcement learning for fine-tuning text-to-image diffu- sion models
Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Moham- mad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Dpok: Reinforcement learning for fine-tuning text-to-image diffu- sion models. pages 79858–79885, 2023. 1, 2, 3
2023
-
[12]
Stylegan-nada: Clip- guided domain adaptation of image generators.ACM Trans- actions on Graphics (TOG), 41(4):1–13, 2022
Rinon Gal, Or Patashnik, Haggai Maron, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. Stylegan-nada: Clip- guided domain adaptation of image generators.ACM Trans- actions on Graphics (TOG), 41(4):1–13, 2022. 20
2022
-
[13]
Ltx-2: Efficient joint audio-visual foundation model, 2026
Yoav HaCohen, Benny Brazowski, Nisan Chiprut, Yaki Bitterman, Andrew Kvochko, Avishai Berkowitz, Daniel Shalem, Daphna Lifschitz, Dudu Moshe, Eitan Porat, Eitan Richardson, Guy Shiran, Itay Chachy, Jonathan Chetboun, Michael Finkelson, Michael Kupchick, Nir Zabari, Nitzan Guetta, Noa Kotler, Ofir Bibi, Ori Gordon, Poriya Panet, Roi Benita, Shahar Armon, V...
2026
-
[14]
Parameter-efficient fine-tuning for large models: A comprehensive survey.Transactions on Machine Learning Research
Zeyu Han, Chao Gao, Jinyang Liu, Jeff Zhang, and Sai Qian Zhang. Parameter-efficient fine-tuning for large models: A comprehensive survey.Transactions on Machine Learning Research. 14
-
[15]
Hayes, Roxana R ˘adulescu, Eugenio Bargiacchi, Johan K ¨allstr¨om, Matthew Macfarlane, Mathieu Reymond, Timothy Verstraeten, Luisa M
Conor F. Hayes, Roxana R ˘adulescu, Eugenio Bargiacchi, Johan K ¨allstr¨om, Matthew Macfarlane, Mathieu Reymond, Timothy Verstraeten, Luisa M. Zintgraf, Richard Dazeley, Fredrik Heintz, Enda Howley, Athirai A. Irissappane, Patrick Mannion, Ann Now´e, Gabriel Ramos, Marcello Restelli, Pe- ter Vamplew, and Diederik M. Roijers. A practical guide to multi-obj...
2022
-
[16]
Hayes, Roxana R ˘adulescu, Eugenio Bargiacchi, Johan K ¨allstr¨om, Matthew Macfarlane, Mathieu Reymond, Timothy Verstraeten, Luisa M
Conor F. Hayes, Roxana R ˘adulescu, Eugenio Bargiacchi, Johan K ¨allstr¨om, Matthew Macfarlane, Mathieu Reymond, Timothy Verstraeten, Luisa M. Zintgraf, Richard Dazeley, Fredrik Heintz, Enda Howley, Athirai A. Irissappane, Patrick Mannion, Ann Now´e, Gabriel Ramos, Marcello Restelli, Pe- ter Vamplew, and Diederik M. Roijers. A practical guide to multi-obj...
2022
-
[17]
Clipscore: A reference-free evaluation met- ric for image captioning
Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation met- ric for image captioning. InProceedings of the 2021 confer- ence on empirical methods in natural language processing, pages 7514–7528, 2021. 7, 14, 16, 20
2021
-
[18]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 18
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
Inference-time alignment control for diffusion models with reinforcement learning guidance, 2025
Luozhijie Jin, Zijie Qiu, Jie Liu, Zijie Diao, Lifeng Qiao, Ning Ding, Alex Lamb, and Xipeng Qiu. Inference-time alignment control for diffusion models with reinforcement learning guidance, 2025. 2
2025
-
[20]
A style-based generator architecture for generative adversarial networks
Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 4401–4410, 2019. 8
2019
-
[21]
Pick-a-pic: An open dataset of user preferences for text-to-image generation
Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Ma- tiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation. pages 36652–36663, 2023. 1, 7, 14, 16
2023
-
[22]
Viescore: Towards explainable metrics for conditional image synthesis evaluation
Max Ku, Dongfu Jiang, Cong Wei, Xiang Yue, and Wenhu Chen. Viescore: Towards explainable metrics for conditional image synthesis evaluation. InProceedings of the 62nd An- 11 nual Meeting of the Association for Computational Linguis- tics (Volume 1: Long Papers), pages 12268–12290, 2024. 18, 20
2024
-
[23]
Flux.1 kontext: Flow matching for in-context image generation and editing in latent space,
Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dock- horn, Jack English, Zion English, Patrick Esser, Sumith Ku- lal, Kyle Lacey, Yam Levi, Cheng Li, Dominik Lorenz, Jonas M¨uller, Dustin Podell, Robin Rombach, Harry Saini, Axel Sauer, and Luke Smith. Flux.1 kontext: Flow matching for in-context i...
-
[24]
Flow-multi: A flow- matching multi-reward framework for text-to-image gener- ation.Sensors, 26(4):1120, 2026
Jaegun Lee and Janghoon Choi. Flow-multi: A flow- matching multi-reward framework for text-to-image gener- ation.Sensors, 26(4):1120, 2026. 3, 8
2026
-
[25]
Calibrated multi-preference opti- mization for aligning diffusion models
Kyungmin Lee, Xiahong Li, Qifei Wang, Junfeng He, Jun- jie Ke, Ming-Hsuan Yang, Irfan Essa, Jinwoo Shin, Feng Yang, and Yinxiao Li. Calibrated multi-preference opti- mization for aligning diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18465–18475, 2025. 2, 3
2025
-
[26]
Parrot: Pareto-optimal multi-reward reinforce- ment learning framework for text-to-image generation
Seung Hyun Lee, Yinxiao Li, Junjie Ke, Innfarn Yoo, Han Zhang, Jiahui Yu, Qifei Wang, Fei Deng, Glenn Entis, Jun- feng He, et al. Parrot: Pareto-optimal multi-reward reinforce- ment learning framework for text-to-image generation. In European Conference on Computer Vision, pages 462–478. Springer, 2024. 2, 3
2024
-
[27]
MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE
Junzhe Li, Yutao Cui, Tao Huang, Yinping Ma, Chun Fan, Miles Yang, and Zhao Zhong. Mixgrpo: Unlocking flow- based grpo efficiency with mixed ode-sde.arXiv preprint arXiv:2507.21802, 2025. 2
work page internal anchor Pith review arXiv 2025
-
[28]
Flow-grpo: Training flow matching models via on- line rl
Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di ZHANG, and Wanli Ouyang. Flow-grpo: Training flow matching models via on- line rl. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems. 1, 2, 4, 5
-
[29]
arXiv preprint arXiv:2509.23909 (2025)
Xin Luo, Jiahao Wang, Chenyuan Wu, Shitao Xiao, Xiyan Jiang, Defu Lian, Jiajun Zhang, Dong Liu, and Zheng Liu. Editscore: Unlocking online rl for image editing via high- fidelity reward modeling.arXiv preprint arXiv:2509.23909,
-
[30]
Hpsv3: Towards wide-spectrum human preference score
Yuhang Ma, Xiaoshi Wu, Keqiang Sun, and Hongsheng Li. Hpsv3: Towards wide-spectrum human preference score. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15086–15095, 2025. 1
2025
-
[31]
Springer Science & Business Media, 1999
Kaisa Miettinen.Nonlinear multiobjective optimization. Springer Science & Business Media, 1999. 2
1999
-
[32]
Training language models to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. pages 27730–27744, 2022. 1
2022
-
[33]
Rishubh Parihar, Or Patashnik, Daniil Ostashev, R Venkatesh Babu, Daniel Cohen-Or, and Kuan-Chieh Wang. Kontinuous kontext: Continuous strength control for instruction-based image editing.arXiv preprint arXiv:2510.08532, 2025. 6
-
[34]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 7
2021
-
[35]
Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse re- wards
Alexandre Rame, Guillaume Couairon, Corentin Dancette, Jean-Baptiste Gaya, Mustafa Shukor, Laure Soulier, and Matthieu Cord. Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse re- wards. pages 71095–71134, 2023. 2, 3
2023
-
[36]
Cos-dpo: Conditioned one- shot multi-objective fine-tuning framework
Yinuo Ren, Tesi Xiao, Michael Shavlovsky, Lexing Ying, and Holakou Rahmanian. Cos-dpo: Conditioned one- shot multi-objective fine-tuning framework. InConference on Uncertainty in Artificial Intelligence, pages 3525–3551. PMLR, 2025. 3
2025
-
[37]
D. M. Roijers, P. Vamplew, S. Whiteson, and R. Dazeley. A survey of multi-objective sequential decision-making.Jour- nal of Artificial Intelligence Research, 48:67–113, 2013. 2
2013
-
[38]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. 1, 2
2024
-
[39]
Lora: Low-rank adaptation of large lan- guage models
Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, et al. Lora: Low-rank adaptation of large lan- guage models. 14
-
[40]
Sd3.5.https : / / github
Stability AI. Sd3.5.https : / / github . com / Stability-AI/sd3.5, 2024. 2
2024
-
[41]
Gemma: Open Models Based on Gemini Research and Technology
Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivi `ere, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and tech- nology.arXiv preprint arXiv:2403.08295, 2024. 17
work page internal anchor Pith review arXiv 2024
-
[42]
Diffusion model align- ment using direct preference optimization
Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model align- ment using direct preference optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8228–8238, 2024. 1, 2
2024
-
[43]
Diffusion model align- ment using direct preference optimization
Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model align- ment using direct preference optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8228–8238, 2024. 3
2024
-
[44]
Unified Reward Model for Multimodal Understanding and Generation
Yibin Wang, Yuhang Zang, Hao Li, Cheng Jin, and Jiaqi Wang. Unified reward model for multimodal understanding and generation.arXiv preprint arXiv:2503.05236, 2025. 7, 16, 18, 20
work page internal anchor Pith review arXiv 2025
-
[45]
Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis, 2023
Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis, 2023. 1
2023
-
[46]
Imagere- ward: Learning and evaluating human preferences for text- to-image generation
Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagere- ward: Learning and evaluating human preferences for text- to-image generation. InThirty-seventh Conference on Neu- ral Information, pages 15903–15935, 2023. 1
2023
-
[47]
Dancegrpo: Unleashing grpo on visual generation, 2025
Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, 12 Weilin Huang, and Ping Luo. Dancegrpo: Unleashing grpo on visual generation, 2025. 1, 2, 4
2025
-
[48]
Rewards-in-context: Multi- objective alignment of foundation models with dynamic preference adjustment
Rui Yang, Xiaoman Pan, Feng Luo, Shuang Qiu, Han Zhong, Dong Yu, and Jianshu Chen. Rewards-in-context: Multi- objective alignment of foundation models with dynamic preference adjustment. InInternational Conference on Ma- chine Learning, pages 56276–56297. PMLR, 2024. 3
2024
-
[49]
Proud: Pareto-guided diffusion model for multi- objective generation.Machine Learning, 113(9):6511–6538,
Yinghua Yao, Yuangang Pan, Jing Li, Ivor Tsang, and Xin Yao. Proud: Pareto-guided diffusion model for multi- objective generation.Machine Learning, 113(9):6511–6538,
-
[50]
Pacs: A dataset for physical audiovisual commonsense reasoning
Samuel Yu, Peter Wu, Paul Pu Liang, Ruslan Salakhutdinov, and Louis-Philippe Morency. Pacs: A dataset for physical audiovisual commonsense reasoning. InEuropean Confer- ence on Computer Vision, pages 292–309. Springer, 2022. 7, 16
2022
-
[51]
The unreasonable effectiveness of deep features as a perceptual metric
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 586–595, 2018. 20
2018
-
[52]
Diffusionnft: Online diffusion rein- forcement with forward process, 2026
Kaiwen Zheng, Huayu Chen, Haotian Ye, Haoxiang Wang, Qinsheng Zhang, Kai Jiang, Hang Su, Stefano Ermon, Jun Zhu, and Ming-Yu Liu. Diffusionnft: Online diffusion rein- forcement with forward process, 2026. 1, 2, 4, 5, 14
2026
-
[53]
Panacea: Pareto alignment via preference adaptation for llms
Yifan Zhong, Chengdong Ma, Xiaoyuan Zhang, Ziran Yang, Haojun Chen, Qingfu Zhang, Siyuan Qi, and Yaodong Yang. Panacea: Pareto alignment via preference adaptation for llms. pages 75522–75558, 2024. 3, 10
2024
-
[54]
Multiobjective evolution- ary algorithms: a comparative case study and the strength pareto approach.IEEE transactions on Evolutionary Com- putation, 3(4):257–271, 2002
Eckart Zitzler and Lothar Thiele. Multiobjective evolution- ary algorithms: a comparative case study and the strength pareto approach.IEEE transactions on Evolutionary Com- putation, 3(4):257–271, 2002. 19 13 Appendix This supplementary material provides additional details, extended evaluations, and broader context for our frame- work. Section A comprehen...
2002
-
[55]
in the transformer modulation space. Concretely, the preference vector is mapped by a lightweight projector to a modulation vector of dimension2d, which is split into scale and shift terms and added to the AdaLN parameters of the context stream inside the dual-stream transformer blocks. The image stream is not directly modulated, and the single-stream blo...
-
[56]
A photorealistic, high quality, 4K, camera-captured snapshot of [prompt]
reward using the prompt:“A photorealistic, high quality, 4K, camera-captured snapshot of [prompt]. ”and the structured sketch reward. To ensure the robustness of the quantitative evaluation in Fig. 7 and avoid potential over-optimization artifacts, we validate our Pareto frontiers using a diverse set of evaluation metrics distinct from those guiding the t...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.