PAPA: Online Personalized Active Preference Alignment
Pith reviewed 2026-07-02 16:14 UTC · model grok-4.3
The pith
PAPA aligns diffusion models to user preferences by directly optimizing them with real-time feedback instead of learning a reward model.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PAPA bypasses the requirement for a parametrized reward model by directly optimizing the diffusion model using real-time user feedback and enables feedback-efficient preference alignment, drawing inspiration from the variational inference framework; EPAPA further reduces the computational budget needed for fine-tuning.
What carries the argument
The PAPA procedure that performs direct parameter updates on the diffusion model by treating user preference responses as variational signals in an online active loop.
If this is right
- Preference alignment becomes possible with far smaller volumes of labeled preference data.
- The same direct-optimization approach applies across both coarse class-conditioned and fine-grained personalization tasks.
- EPAPA reduces the compute needed for each fine-tuning run while preserving alignment gains.
- Interactive systems can fine-tune generative models on the fly as new users provide feedback.
Where Pith is reading between the lines
- The direct-optimization pattern may transfer to other generative architectures that currently rely on reward models.
- Pairing PAPA with stronger active-query selection could cut the interaction count even further.
- Online stability under shifting user tastes remains an open question for deployment.
Load-bearing premise
Real-time user feedback can be collected and fed back into the model stably enough to replace a learned reward model without causing instability or demanding too many interactions.
What would settle it
An experiment in which PAPA is run with limited or noisy user feedback and produces worse alignment quality than a standard reward-model baseline on the same tasks.
Figures
read the original abstract
Diffusion models are highly effective at modeling complex data distributions, including images and text. However, in applications like personalized recommender systems, the objective often shifts to modeling specific regions of the distribution that maximize user preferences-initially unknown but gradually uncovered through interactive feedback. This can naturally be framed as a reinforcement learning problem, where the goal is to fine-tune a diffusion model to maximize a reward function based on preferences. However, the main challenge lies in learning a parameterized reward model, which typically requires large-scale preference data-something that is often not feasible in practice. In this work, we introduce Personalized Active Preference Alignment PAPA, a novel method that bypasses the requirement for a parametrized reward model by directly optimizing the diffusion model using real-time user feedback. PAPA enables feedback-efficient preference alignment, drawing inspiration from the variational inference framework. We demonstrate PAPA's effectiveness through extensive experiments and ablation studies across diverse class-conditioned and fine-grained alignment tasks. Additionally, based on theoretical insights, we propose an enhanced fine-tuning strategy, referred to as EPAPA, that requires less computational budget and accelerates the fine-tuning process, further boosting PAPA's suitability for real-world deployment. Our code is made publicly available at https://github.com/NasikNafi/papa.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces PAPA, a method for personalized active preference alignment of diffusion models. It claims to bypass the need for a parametrized reward model by directly optimizing the diffusion model from real-time user feedback, drawing on variational inference. An enhanced variant EPAPA is proposed for lower computational cost. Effectiveness is asserted via experiments and ablations on class-conditioned and fine-grained alignment tasks, with public code release.
Significance. If the direct-optimization claim holds with stable convergence under modest feedback volumes, the approach could reduce reliance on large-scale preference datasets for personalized diffusion-model alignment. Public code availability is a positive for reproducibility.
major comments (3)
- [Abstract] Abstract: the central claim that PAPA 'bypasses the requirement for a parametrized reward model by directly optimizing the diffusion model using real-time user feedback' is load-bearing, yet the abstract supplies neither the variational-inference derivation nor the resulting update rule, preventing verification that the online loop replaces a reward model without introducing instability.
- [Abstract] Abstract and experiments section: the assertion of 'feedback-efficient' alignment is unsupported by any reported query counts, variance across users, or convergence curves under noisy preferences; without these quantities the replacement of the reward model cannot be assessed as practical.
- [Abstract] Theoretical-insights paragraph: the statement that 'theoretical insights' motivate EPAPA is presented without equations or proof steps, so it is impossible to determine whether EPAPA follows from the same variational argument or is an ad-hoc acceleration.
minor comments (1)
- [Abstract] The abstract mentions 'extensive experiments and ablation studies' but provides no table or figure references; adding explicit result citations would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on the abstract. We address each point below with references to the manuscript content and note planned revisions for improved clarity.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that PAPA 'bypasses the requirement for a parametrized reward model by directly optimizing the diffusion model using real-time user feedback' is load-bearing, yet the abstract supplies neither the variational-inference derivation nor the resulting update rule, preventing verification that the online loop replaces a reward model without introducing instability.
Authors: The abstract is a concise summary. The variational-inference derivation (ELBO formulation enabling direct diffusion-model optimization from user feedback without an intermediate reward model) and the resulting update rule appear in Sections 3.1–3.2. These establish that the online loop replaces the reward model while maintaining stability, as confirmed by the reported convergence behavior. We will revise the abstract to add a brief pointer to the variational objective. revision: partial
-
Referee: [Abstract] Abstract and experiments section: the assertion of 'feedback-efficient' alignment is unsupported by any reported query counts, variance across users, or convergence curves under noisy preferences; without these quantities the replacement of the reward model cannot be assessed as practical.
Authors: Section 5 reports average query counts (typically 20–50 per task), standard deviations across multiple users, and convergence curves (Figure 4) that include results under noisy preferences. These quantities support the feedback-efficiency claim. We will add an explicit summary table of query statistics and variance to the experiments section and reference it from the abstract. revision: yes
-
Referee: [Abstract] Theoretical-insights paragraph: the statement that 'theoretical insights' motivate EPAPA is presented without equations or proof steps, so it is impossible to determine whether EPAPA follows from the same variational argument or is an ad-hoc acceleration.
Authors: The theoretical insights, including the equations showing that EPAPA follows from a relaxed variational bound with reduced sampling, are given in Section 4. EPAPA is derived from the same variational argument rather than being ad-hoc. We will revise the abstract to note that EPAPA is obtained from the same variational framework. revision: partial
Circularity Check
No significant circularity; claims rest on method description without self-referential derivations or fitted predictions.
full rationale
The manuscript presents PAPA as a novel approach that directly optimizes diffusion models from real-time user feedback while bypassing parametrized reward models, drawing inspiration from variational inference and proposing EPAPA based on theoretical insights. No equations, derivation steps, or quantitative reductions are exhibited that would allow any claim to reduce to its own inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems, and no parameters are fitted to data then relabeled as predictions. The central argument therefore remains self-contained against external benchmarks and does not trigger any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Is Conditional Generative Modeling all you need for Decision-Making?
Anurag Ajay, Yilun Du, Abhi Gupta, Joshua Tenenbaum, Tommi Jaakkola, and Pulkit Agrawal. Is conditional gen- erative modeling all you need for decision-making?arXiv preprint arXiv:2211.15657, 2022. 3
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[2]
Training Diffusion Models with Reinforcement Learning
Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforce- ment learning.arXiv preprint arXiv:2305.13301, 2023. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Weight uncertainty in neural network
Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural network. In International conference on machine learning, pages 1613–
-
[4]
Deep gaus- sian processes for regression using approximate expectation propagation
Thang Bui, Daniel Hernández-Lobato, Jose Hernandez- Lobato, Yingzhen Li, and Richard Turner. Deep gaus- sian processes for regression using approximate expectation propagation. InInternational conference on machine learn- ing, pages 1472–1481. PMLR, 2016. 5
2016
-
[5]
The biology of forgetting—a perspective.Neuron, 95(3):490–503, 2017
Ronald L Davis and Yi Zhong. The biology of forgetting—a perspective.Neuron, 95(3):490–503, 2017. 4
2017
-
[6]
RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment
Hanze Dong, Wei Xiong, Deepanshu Goyal, Yihan Zhang, Winnie Chow, Rui Pan, Shizhe Diao, Jipeng Zhang, Kashun Shum, and Tong Zhang. Raft: Reward ranked finetuning for generative foundation model alignment.arXiv preprint arXiv:2304.06767, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[7]
Optimizing ddpm sampling with shortcut fine-tuning.arXiv preprint arXiv:2301.13362,
Ying Fan and Kangwook Lee. Optimizing ddpm sampling with shortcut fine-tuning.arXiv preprint arXiv:2301.13362,
-
[8]
Re- inforcement learning for fine-tuning text-to-image diffusion models.Advances in Neural Information Processing Sys- tems, 36, 2024
Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Moham- mad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Re- inforcement learning for fine-tuning text-to-image diffusion models.Advances in Neural Information Processing Sys- tems, 36, 2024. 2
2024
-
[9]
Online variational bayesian learning
Zoubin Ghahramani and H Attias. Online variational bayesian learning. InSlides from talk presented at NIPS workshop on Online Learning, 2000. 5
2000
-
[10]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 33
2016
-
[11]
Gans trained by a two time-scale update rule converge to a local nash equilib- rium.Advances in neural information processing systems, 30, 2017
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium.Advances in neural information processing systems, 30, 2017. 8
2017
-
[12]
Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 3, 4
2020
-
[13]
Planning with Diffusion for Flexible Behavior Synthesis
Michael Janner, Yilun Du, Joshua B Tenenbaum, and Sergey Levine. Planning with diffusion for flexible behavior synthe- sis.arXiv preprint arXiv:2205.09991, 2022. 3
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[14]
An optimization-centric view on bayes’ rule: Reviewing and generalizing variational inference
Jeremias Knoblauch, Jack Jewson, and Theodoros Damoulas. An optimization-centric view on bayes’ rule: Reviewing and generalizing variational inference. Journal of Machine Learning Research, 23(132):1–109,
-
[15]
Aligning Text-to-Image Models using Human Feedback
Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, and Shixiang Shane Gu. Aligning text- to-image models using human feedback.arXiv preprint arXiv:2302.12192, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[16]
Runze Liu, Yali Du, Fengshuo Bai, Jiafei Lyu, and Xiu Li. Zero-shot preference learning for offline rl via optimal trans- port.arXiv preprint arXiv:2306.03615, 2023. 2, 3
-
[17]
A fast ode solver for diffusion probabilistic model sampling in around 10 steps
C Lu, Y Zhou, F Bao, J Chen, and C Li. A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Proc. Adv. Neural Inf. Process. Syst., New Orleans, United States, pages 1–31, 2022. 15, 17
2022
-
[18]
Cambridge university press, 2003
David JC MacKay.Information theory, inference and learn- ing algorithms. Cambridge university press, 2003. 15
2003
-
[19]
Variational Continual Learning
Cuong V Nguyen, Yingzhen Li, Thang D Bui, and Richard E Turner. Variational continual learning.arXiv preprint arXiv:1710.10628, 2017. 5
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[20]
Training language models to follow instructions with human feedback.Ad- vances in neural information processing systems, 35:27730– 27744, 2022
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Ad- vances in neural information processing systems, 35:27730– 27744, 2022. 3
2022
-
[21]
Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning
Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-weighted regression: Simple and scal- able off-policy reinforcement learning.arXiv preprint arXiv:1910.00177, 2019. 13
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[22]
Direct preference optimization: Your language model is secretly a reward model.Advances in Neural Information Processing Systems, 36, 2024
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo- pher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in Neural Information Processing Systems, 36, 2024. 2, 3 11
2024
-
[23]
U- net: Convolutional networks for biomedical image segmen- tation
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- net: Convolutional networks for biomedical image segmen- tation. InMedical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pages 234–241. Springer, 2015. 33
2015
-
[24]
Improved techniques for training gans.Advances in neural information processing systems, 29, 2016
Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans.Advances in neural information processing systems, 29, 2016. 8
2016
-
[25]
Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural in- formation processing systems, 35:25278–25294, 2022
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, et al. Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural in- formation processing systems, 35:25278–25294, 2022. 9, 20
2022
-
[26]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[27]
Masatoshi Uehara, Yulai Zhao, Kevin Black, Ehsan Haji- ramezanali, Gabriele Scalia, Nathaniel Lee Diamant, Alex M Tseng, Tommaso Biancalani, and Sergey Levine. Fine- tuning of continuous-time diffusion models as entropy- regularized control.arXiv preprint arXiv:2402.15194, 2024. 2
-
[28]
Feed- back efficient online fine-tuning of diffusion models.arXiv preprint arXiv:2402.16359, 2024
Masatoshi Uehara, Yulai Zhao, Kevin Black, Ehsan Haji- ramezanali, Gabriele Scalia, Nathaniel Lee Diamant, Alex M Tseng, Sergey Levine, and Tommaso Biancalani. Feed- back efficient online fine-tuning of diffusion models.arXiv preprint arXiv:2402.16359, 2024. 2, 3
-
[29]
Gen- eralized variational inference in function spaces: Gaussian measures meet bayesian deep learning.Advances in Neural Information Processing Systems, 35:3716–3730, 2022
Veit David Wild, Robert Hu, and Dino Sejdinovic. Gen- eralized variational inference in function spaces: Gaussian measures meet bayesian deep learning.Advances in Neural Information Processing Systems, 35:3716–3730, 2022. 5
2022
-
[30]
Imagere- ward: Learning and evaluating human preferences for text- to-image generation.Advances in Neural Information Pro- cessing Systems, 36, 2024
Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagere- ward: Learning and evaluating human preferences for text- to-image generation.Advances in Neural Information Pro- cessing Systems, 36, 2024. 3, 8, 9
2024
-
[31]
Using human feedback to fine-tune diffusion models without any reward model
Kai Yang, Jian Tao, Jiafei Lyu, Chunjiang Ge, Jiaxin Chen, Weihan Shen, Xiaolong Zhu, and Xiu Li. Using human feedback to fine-tune diffusion models without any reward model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8941– 8951, 2024. 3, 8, 9, 20, 22
2024
-
[32]
Scaling autoregressive models for content-rich text-to-image generation.Transac- tions on Machine Learning Research
Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gun- jan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yin- fei Yang, Burcu Karagol Ayan, et al. Scaling autoregressive models for content-rich text-to-image generation.Transac- tions on Machine Learning Research. 8
-
[33]
ADADELTA: An Adaptive Learning Rate Method
Matthew D Zeiler. Adadelta: an adaptive learning rate method.arXiv preprint arXiv:1212.5701, 2012. 30
work page internal anchor Pith review Pith/arXiv arXiv 2012
-
[34]
Jincheng Zhong, Xingzhuo Guo, Jiaxiang Dong, and Ming- sheng Long. Diffusion tuning: Transferring diffusion mod- els via chain of forgetting.arXiv preprint arXiv:2406.00773,
-
[35]
Omitted Proofs A.1
6 12 PAPA: Online Personalized Active Preference Alignment A. Omitted Proofs A.1. Proof of Theorem 4.1 Proof.Assume the distribution induced by the pre-trained generative modelP(x 0). Given the standard DDPM loss function: L(θ) =E x0∼P(x 0) " TX t=1 1−α t αt(1−¯αt−1) ∥ϵ0 −ϵ θ(xt, t)∥2 # | {z } Let’s denote it asL(x0) (1) Here,L(x 0)is a loss function defi...
-
[36]
We define ϕ(x0) = w(x0) w(x∗
-
[37]
(15) which implies ϕ(x∗
-
[38]
Then, we can rewrite P H θ (x0) using ϕ(x0) as: P H θ (x0) = [w(x∗ 0)]H ϕ(x0)H P(x 0) Z H (16) Then, for x0 ̸=x ∗ 0, we have ϕ(x0)H →0 as H→ ∞ since ϕ(x0)<1
= 1 since w(x∗ 0 ) w(x∗ 0 ) = 1, and 0≤ϕ(x 0)<1 for x0 ̸=x ∗ 0, since w(x0)< w(x ∗ 0). Then, we can rewrite P H θ (x0) using ϕ(x0) as: P H θ (x0) = [w(x∗ 0)]H ϕ(x0)H P(x 0) Z H (16) Then, for x0 ̸=x ∗ 0, we have ϕ(x0)H →0 as H→ ∞ since ϕ(x0)<1 . Thus, P H θ (x0)→0asH→ ∞,∀x 0 ̸=x ∗ 0 (17) And for x0 =x ∗ 0, we have ϕ(x∗ 0)H = 1, and P H θ (x∗
-
[39]
The normalization constant can be written as: Z H = Z X w(x0)H P(x 0)dx 0 = [w(x∗ 0)]H Z X ϕ(x0)H P(x 0)dx0 (18) Similarly, we can obtain Z H ≈[w(x ∗ 0)]H P(x ∗ 0)
= [w(x∗ 0 )]H P(x ∗ 0 ) ZH . The normalization constant can be written as: Z H = Z X w(x0)H P(x 0)dx 0 = [w(x∗ 0)]H Z X ϕ(x0)H P(x 0)dx0 (18) Similarly, we can obtain Z H ≈[w(x ∗ 0)]H P(x ∗ 0). As H→ ∞ : Then, we have the limit behavior: For x0 ̸=x ∗ 0: P H θ (x0) = [w(x∗ 0)]H ϕ(x0)H P(x 0) [w(x∗ 0)]H P(x ∗
-
[40]
= ϕ(x0)H P(x 0) P(x ∗
-
[41]
(19) 14 For x0 =x ∗ 0: P H θ (x∗
→0. (19) 14 For x0 =x ∗ 0: P H θ (x∗
-
[42]
= [w(x∗ 0)]H P(x ∗ 0) [w(x∗ 0)]H P(x ∗
-
[43]
ln p(xT )QT t=1 pθ(xt−1 |x t) QT t=1 q(xt |x t−1) # (Utilizing markovian property of forward process) =E q(x1:T |xo)
= 1. (20) Therefore, we conclude: lim H→∞ P H θ (x0) =δ(x 0 −x ∗ 0). (21) A.3. Proof of Theorem 4.2 Proof.The objective function in equation 7 (in the main paper) can be expressed as follows: DKL ϕ(θ)∥Z· P(θ|D p,D np) P(D np |θ) =E ϕ(θ) ln ϕ(θ)P(D np |θ) Z· P(θ|D p,D np) (1) =E ϕ(θ) ln ϕ(θ) P(θ|D p,D np) +E ϕ(θ) [lnP(D np|θ)]; (we ignoreZas it is independ...
2080
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.