When Preference Labels Fall Short: Aligning Diffusion Models from Real Data
Pith reviewed 2026-05-20 06:09 UTC · model grok-4.3
The pith
Real images can supply clear preference signals for aligning diffusion models without needing annotated pairs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By treating real images as reference points and constructing preference signals through direct contrast with generated or perturbed samples, the curation strategy supplies effective supervision that guides diffusion models toward higher-quality outputs and reaches performance levels comparable to methods that rely on model-generated preference pairs.
What carries the argument
A curation strategy that uses real images as reference points to build preference signals by contrasting them with generated or perturbed samples.
If this is right
- Real-data supervision achieves alignment performance comparable to existing preference-based methods on diffusion models.
- The approach reduces reliance on manually annotated preference pairs.
- Real data serves as a practical complementary source of supervision for preference alignment.
- The method highlights new directions for label-efficient alignment strategies.
Where Pith is reading between the lines
- The same contrast approach might transfer to other generative architectures that currently depend on synthetic preference data.
- Combining real-image references with a small number of generated pairs could further strengthen the signals.
- Testing the method on domains where real data distributions differ sharply from the target generation task would reveal whether distribution shift limits the gains.
Load-bearing premise
Contrasting real images with generated or perturbed samples produces clear preference signals that reliably point to desirable outputs without new biases from the real data or the perturbation process.
What would settle it
If models aligned via this real-data contrast method receive lower human preference ratings or worse automatic scores than models aligned with standard generated preference pairs on the same benchmarks, the claim of comparable effectiveness would not hold.
Figures
read the original abstract
Preference alignment aims to guide generative models by learning from comparisons between preferred and non-preferred samples. In practice, most existing approaches rely on preference pairs constructed from model-generated images. Such supervision is inherently relative and can be ambiguous when both samples exhibit artifacts or limited visual quality, making it difficult to infer what constitutes a truly desirable output. In this work, we investigate whether real data can serve as an alternative source of supervision for preference alignment. We adopt a data-centric perspective and study a curation strategy that treats real images as reference points and constructs preference signals by contrasting them with generated or perturbed samples, without requiring manually annotated preference pairs. Through empirical analysis, we show that real-data-based supervision provides effective guidance for aligning diffusion models and achieves performance comparable to existing preference-based methods. Our results suggest that real data offers a practical and complementary source of supervision for preference alignment and highlight directions of label-efficient alignment strategies. Code and models are available at https://cwyxx.github.io/RealAlign.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that real-data-based supervision, by treating real images as references and constructing preference signals through contrasts with generated or perturbed samples without manual annotations, provides effective guidance for aligning diffusion models and achieves performance comparable to existing preference-based methods, as demonstrated through empirical analysis.
Significance. If the results hold, this offers a practical, label-efficient alternative to preference alignment that mitigates ambiguities in model-generated pairs, potentially enabling more robust supervision for improving diffusion model outputs and complementing existing techniques.
major comments (1)
- The central assumption that contrasting real images with generated or perturbed samples produces unambiguous preference signals that reliably indicate desirable outputs without introducing new biases requires stronger support. The manuscript should include explicit controls, ablations, or analysis addressing potential distribution shift from the real data or artifacts from the perturbation process, as this directly underpins the claim of effective guidance and comparable performance.
minor comments (2)
- The abstract states empirical results but omits specifics on datasets, metrics, baselines, and statistical controls; adding a concise summary of these in the abstract or introduction would improve immediate readability.
- Verify that the linked code repository at https://cwyxx.github.io/RealAlign includes full reproduction scripts, hyperparameters, and data processing details to support the reproducibility of the reported empirical findings.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback. We address the single major comment below and describe the revisions we will make to the manuscript.
read point-by-point responses
-
Referee: The central assumption that contrasting real images with generated or perturbed samples produces unambiguous preference signals that reliably indicate desirable outputs without introducing new biases requires stronger support. The manuscript should include explicit controls, ablations, or analysis addressing potential distribution shift from the real data or artifacts from the perturbation process, as this directly underpins the claim of effective guidance and comparable performance.
Authors: We agree that additional targeted analysis would strengthen the support for our central assumption. While the current empirical results demonstrate that real-data supervision yields performance comparable to standard preference methods, we acknowledge that explicit controls for distribution shift and perturbation artifacts are not yet present. In the revised manuscript we will add: (i) quantitative comparisons of low-level statistics (e.g., FID, perceptual distances) between real references and the generated/perturbed negatives, (ii) an ablation varying perturbation strength and type while measuring downstream alignment quality, and (iii) a small-scale human study assessing whether the derived preference signals align with human judgments of visual quality. These additions will directly address potential biases and better substantiate the claim of effective, unambiguous guidance. revision: yes
Circularity Check
No significant circularity in empirical data-driven approach
full rationale
The paper is an empirical investigation into using real images as reference points for constructing preference signals in diffusion model alignment, without any claimed mathematical derivation chain, equations, or first-principles results. The central claim rests on a curation strategy and experimental comparisons showing comparable performance to preference-based methods, which is presented as data-driven evidence rather than a reduction to fitted parameters or self-referential definitions. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling are described in the provided text, and the approach is self-contained against external benchmarks via reported empirical analysis.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We adopt a data-centric perspective and study a curation strategy that treats real images as reference points and constructs preference signals by contrasting them with generated or perturbed samples... Lstage-1(ϕ) and Lstage-2(θ) using Diffusion-DRO and Diffusion-DPO objectives
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
real-data-based supervision provides effective guidance for aligning diffusion models and achieves performance comparable to existing preference-based methods
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Training Diffusion Models with Reinforcement Learning
Black, K., Janner, M., Du, Y ., Kostrikov, I., and Levine, S. Training diffusion models with reinforcement learning. arXiv preprint arXiv:2305.13301,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Oneig-bench: Omni- dimensional nuanced evaluation for image generation
Chang, J., Fang, Y ., Xing, P., Wu, S., Cheng, W., Wang, R., Zeng, X., Yu, G., and Chen, H.-B. Oneig-bench: Omni- dimensional nuanced evaluation for image generation. arXiv preprint arXiv:2506.07977,
-
[3]
Clark, K., Vicol, P., Swersky, K., and Fleet, D. J. Directly fine-tuning diffusion models on differentiable rewards. arXiv preprint arXiv:2309.17400,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment
Hu, X., Wang, R., Fang, Y ., Fu, B., Cheng, P., and Yu, G. Ella: Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE
Li, J., Cui, Y ., Huang, T., Ma, Y ., Fan, C., Yang, M., and Zhong, Z. Mixgrpo: Unlocking flow-based grpo efficiency with mixed ode-sde.arXiv preprint arXiv:2507.21802,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Flow-GRPO: Training Flow Matching Models via Online RL
Liu, J., Liu, G., Liang, J., Li, Y ., Liu, J., Wang, X., Wan, P., Zhang, D., and Ouyang, W. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al. Deepseekmath: Push- ing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Directly aligning the full diffusion trajectory with fine-grained human preference
10 When Preference Labels Fall Short: Aligning Diffusion Models from Real Data Shen, X., Li, Z., Yang, Z., Zhang, S., Zhang, Y ., Li, D., Wang, C., Lu, Q., and Tang, Y . Directly aligning the full diffusion trajectory with fine-grained human preference. arXiv preprint arXiv:2509.06942,
-
[9]
Unified Reward Model for Multimodal Understanding and Generation
URL https:// huggingface.co/datasets/wallstoneai/ civitai-top-sfw-images-with-metadata. Wang, Y ., Zang, Y ., Li, H., Jin, C., and Wang, J. Unified re- ward model for multimodal understanding and generation. arXiv preprint arXiv:2503.05236,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Wu, X., Hao, Y ., Sun, K., Chen, Y ., Zhu, F., Zhao, R., and Li, H. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis. arXiv preprint arXiv:2306.09341,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Wu, Y .-L., Ruan, B.-K., Tseng, C., and Shuai, H.-H. Ranking-based preference optimization for diffusion models from implicit user feedback.arXiv preprint arXiv:2510.18353,
-
[12]
DanceGRPO: Unleashing GRPO on Visual Generation
Xue, Z., Wu, J., Gao, Y ., Kong, F., Zhu, L., Chen, M., Liu, Z., Liu, W., Guo, Q., Huang, W., et al. Dancegrpo: Unleashing grpo on visual generation.arXiv preprint arXiv:2505.07818,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
Yu, J., Xu, Y ., Koh, J. Y ., Luong, T., Baid, G., Wang, Z., Va- sudevan, V ., Ku, A., Yang, Y ., Ayan, B. K., et al. Scaling autoregressive models for content-rich text-to-image gen- eration.arXiv preprint arXiv:2206.10789, 2(3):5,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
11 When Preference Labels Fall Short: Aligning Diffusion Models from Real Data A. Implementation Details A.1. Hyperparameters Specification We fine-tune both Stable Diffusion v1.5 (SD-1.5) and Stable Diffusion 3.5 Medium (SD-3.5-M) using LoRA for parameter- efficient adaptation. For SD-1.5, we adopt LoRA with rank r= 4 and scaling factor α= 4 , following ...
work page 2025
-
[15]
Which image do you prefer given the prompt?
is set to −0.001. We train SD-1.5 with a learning rate of 1e−4 for 1600 optimization steps, while SD-3.5-M with a learning rate of 2e−4 for 3200 steps. For sampling xt from the policy model, classifier-free guidance is set to 1.0. SD-1.5 adopts DPMSolver++ with 20 sampling steps, while SD-3.5-M uses FlowMatchEulerDiscreteScheduler with 10 steps. Stage 2: ...
work page 2000
-
[16]
to the real image to extract a saliency map. Next, we use the prompt-conditioned SD-v1.5 Inpainting model3 to regenerate the masked salient regions. Due to the limited generative capability of the inpainting model, the regenerated regions often introduce perceptual artifacts and may not align well with the prompt, resulting in a degraded counterpart. This...
work page 2020
-
[17]
and then negligible-degradation filtering (Ours) provides additional but relatively modest improvements. Overall, the performance improves gradually rather than abruptly as curation is introduced, indicating that the method does not critically depend on highly curated data or specific selection choices. Instead, curation primarily helps stabilize and slig...
work page 2024
-
[18]
The cost is computed under a conservative setting where all 25,096 images are processed before filtering. Step Throughput Total Cost Colorfulness scoring≈4.5 images/s≈1.5 GPU-hours Saliency + Inpainting (50 steps)≈0.14 images/s≈49 GPU-hours PickScore evaluation≈6 pairs/s≈1.2 GPU-hours Total –≈52 GPU-hours data. These signals are general and transferable, ...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.