pith. sign in

arxiv: 2605.19839 · v1 · pith:XUR6AB6Bnew · submitted 2026-05-19 · 💻 cs.CV

When Preference Labels Fall Short: Aligning Diffusion Models from Real Data

Pith reviewed 2026-05-20 06:09 UTC · model grok-4.3

classification 💻 cs.CV
keywords preference alignmentdiffusion modelsreal data supervisionimage generationlabel-efficient alignmentgenerative modelsdata curation
0
0 comments X p. Extension
pith:XUR6AB6B Add to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{XUR6AB6B}

Prints a linked pith:XUR6AB6B badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

Real images can supply clear preference signals for aligning diffusion models without needing annotated pairs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether real photographs can replace the usual preference labels when training diffusion models to produce better images. Instead of comparing two flawed generated samples, it sets real images as the good reference and builds signals by pitting them against model outputs or slightly altered versions. Experiments show this real-data route matches the results of standard preference methods while avoiding the ambiguity that arises when both samples look poor. The work points toward using existing real data collections to make alignment more practical and less dependent on new human labels.

Core claim

By treating real images as reference points and constructing preference signals through direct contrast with generated or perturbed samples, the curation strategy supplies effective supervision that guides diffusion models toward higher-quality outputs and reaches performance levels comparable to methods that rely on model-generated preference pairs.

What carries the argument

A curation strategy that uses real images as reference points to build preference signals by contrasting them with generated or perturbed samples.

If this is right

  • Real-data supervision achieves alignment performance comparable to existing preference-based methods on diffusion models.
  • The approach reduces reliance on manually annotated preference pairs.
  • Real data serves as a practical complementary source of supervision for preference alignment.
  • The method highlights new directions for label-efficient alignment strategies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same contrast approach might transfer to other generative architectures that currently depend on synthetic preference data.
  • Combining real-image references with a small number of generated pairs could further strengthen the signals.
  • Testing the method on domains where real data distributions differ sharply from the target generation task would reveal whether distribution shift limits the gains.

Load-bearing premise

Contrasting real images with generated or perturbed samples produces clear preference signals that reliably point to desirable outputs without new biases from the real data or the perturbation process.

What would settle it

If models aligned via this real-data contrast method receive lower human preference ratings or worse automatic scores than models aligned with standard generated preference pairs on the same benchmarks, the claim of comparable effectiveness would not hold.

Figures

Figures reproduced from arXiv: 2605.19839 by Ibrahim Radwan, Liang Lin, Pengxu Wei, Weijian Deng, Weijie Tu, Weiyan Chen, Yao Xiao, ZiYi Dong.

Figure 1
Figure 1. Figure 1: Preference pairs from Pick-a-Pic v2. The left group shows preferred images with local generation artifacts, while the right presents preferred images with unnatural global color. These cases highlight limitations of preference-based supervision in cap￾turing holistic image quality. samples, they are less affected by generation artifacts and exhibit broader visual diversity. Building on this observation, we… view at source ↗
Figure 2
Figure 2. Figure 2: Analysis of preference- and reward-based alignment behaviors. (a) Comparison of realism and texture detail across different preference-based methods on SD-1.5. Methods opti￾mized using pairwise preferences often improve specific visual aspects (e.g., smoothness or texture consistency) but do not con￾sistently yield balanced gains in overall realism across diverse prompts. (b) Comparison of human preference… view at source ↗
Figure 3
Figure 3. Figure 3: Examples of preference pairs derived from real im￾ages. Red contours indicate salient regions where controlled in￾painting introduces localized artifacts. The original images act as preferred references, while the degraded counterparts expose inter￾pretable deviations in texture, structure, or semantics, providing effective supervision for preference alignment without labeling. first step is to construct a… view at source ↗
Figure 4
Figure 4. Figure 4: User study on SD-3.5-M. Following the protocol of Diffusion-DRO (Wu et al., 2025), we randomly sample 60 prompts from HPDv2 and ask users to compare our fine-tuned SD-3.5-M with baselines. Across both comparisons, users prefer our model [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Complementarity with existing preference alignment models. Top row: Quantitative results on Pick-a-Pic v2 using SD-1.5 as the base model. Real-data-based supervision is integrated with Diffusion-DPO. Bottom row: Quantitative results on DrawBench using SD-3.5-M as the base model. Real-data-based supervision is used as a complementary post-training step on top of FlowGRPO. SD-1.5 Diffusion-DPO Ours Diffusion… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison based on SD-1.5. Post-training with real data produces images with improved visual realism and richer texture details. When applied as an additional post-training stage on top of Diffusion-DPO (Wallace et al., 2024), it also im￾proves the visual realism of the resulting generations. Prompts from top to bottom : (1) a plant. (2) a woman sitting on a table drinking coffee, long shot, w… view at source ↗
Figure 8
Figure 8. Figure 8: Evaluation results of ours with varying numbers of constructed preference pairs. Increasing the dataset size from 256 to 512 leads to overall improvement, whereas further increasing the size yields diminishing returns. Results are reported on the Pick-a-Pic v2 test set with SD-1.5 as the base model. performance is more sensitive to the quality of selected real images than to sheer quantity, and that a rela… view at source ↗
Figure 9
Figure 9. Figure 9: Framework for saliency-guided construction of contrastive samples. Given a real image, we extract the salient regions using U 2 -Net (Qin et al., 2020). A prompt-conditioned inpainting model (SD v1.5 Inpainting) regenerates salient regions according to the caption to produce a corresponding degraded counterpart. The resulting discrepancies are localized to salient regions, while the background and global l… view at source ↗
Figure 10
Figure 10. Figure 10: Complementarity with existing preference alignment models. The evaluation is conducted on Parti-Prompts. Top row: Results based on SD-1.5, where real-data-based supervision is integrated with Diffusion-DPO. Bottom row: Results based on SD-3.5-M, where real-data-based supervision is used as a complementary post-training step on top of FlowGRPO [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Qualitative comparison based on SD-1.5. Post-training with real data produces images with improved visual realism and richer texture details. When applied as an additional post-training stage on top of Diffusion-DPO (Wallace et al., 2024), it also improves the visual realism of the resulting generations. Prompts from top to bottom : (1) a TV. (2) a pineapple. (3) a tiger cow. (4) 30 year old short slim ma… view at source ↗
Figure 12
Figure 12. Figure 12: Qualitative comparison based on SD-3.5-M. Compared to FlowGRPO (Liu et al., 2025), post-training with real data yields more realistic lighting and more natural color distributions. When applied as an additional post-training step on top of FlowGRPO, it further alleviates stylistic homogenization and enhances texture details. Prompts from top to bottom : (1) Cat with a top hat on a bean bag. (2) cute waifu… view at source ↗
read the original abstract

Preference alignment aims to guide generative models by learning from comparisons between preferred and non-preferred samples. In practice, most existing approaches rely on preference pairs constructed from model-generated images. Such supervision is inherently relative and can be ambiguous when both samples exhibit artifacts or limited visual quality, making it difficult to infer what constitutes a truly desirable output. In this work, we investigate whether real data can serve as an alternative source of supervision for preference alignment. We adopt a data-centric perspective and study a curation strategy that treats real images as reference points and constructs preference signals by contrasting them with generated or perturbed samples, without requiring manually annotated preference pairs. Through empirical analysis, we show that real-data-based supervision provides effective guidance for aligning diffusion models and achieves performance comparable to existing preference-based methods. Our results suggest that real data offers a practical and complementary source of supervision for preference alignment and highlight directions of label-efficient alignment strategies. Code and models are available at https://cwyxx.github.io/RealAlign.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper claims that real-data-based supervision, by treating real images as references and constructing preference signals through contrasts with generated or perturbed samples without manual annotations, provides effective guidance for aligning diffusion models and achieves performance comparable to existing preference-based methods, as demonstrated through empirical analysis.

Significance. If the results hold, this offers a practical, label-efficient alternative to preference alignment that mitigates ambiguities in model-generated pairs, potentially enabling more robust supervision for improving diffusion model outputs and complementing existing techniques.

major comments (1)
  1. The central assumption that contrasting real images with generated or perturbed samples produces unambiguous preference signals that reliably indicate desirable outputs without introducing new biases requires stronger support. The manuscript should include explicit controls, ablations, or analysis addressing potential distribution shift from the real data or artifacts from the perturbation process, as this directly underpins the claim of effective guidance and comparable performance.
minor comments (2)
  1. The abstract states empirical results but omits specifics on datasets, metrics, baselines, and statistical controls; adding a concise summary of these in the abstract or introduction would improve immediate readability.
  2. Verify that the linked code repository at https://cwyxx.github.io/RealAlign includes full reproduction scripts, hyperparameters, and data processing details to support the reproducibility of the reported empirical findings.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback. We address the single major comment below and describe the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: The central assumption that contrasting real images with generated or perturbed samples produces unambiguous preference signals that reliably indicate desirable outputs without introducing new biases requires stronger support. The manuscript should include explicit controls, ablations, or analysis addressing potential distribution shift from the real data or artifacts from the perturbation process, as this directly underpins the claim of effective guidance and comparable performance.

    Authors: We agree that additional targeted analysis would strengthen the support for our central assumption. While the current empirical results demonstrate that real-data supervision yields performance comparable to standard preference methods, we acknowledge that explicit controls for distribution shift and perturbation artifacts are not yet present. In the revised manuscript we will add: (i) quantitative comparisons of low-level statistics (e.g., FID, perceptual distances) between real references and the generated/perturbed negatives, (ii) an ablation varying perturbation strength and type while measuring downstream alignment quality, and (iii) a small-scale human study assessing whether the derived preference signals align with human judgments of visual quality. These additions will directly address potential biases and better substantiate the claim of effective, unambiguous guidance. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical data-driven approach

full rationale

The paper is an empirical investigation into using real images as reference points for constructing preference signals in diffusion model alignment, without any claimed mathematical derivation chain, equations, or first-principles results. The central claim rests on a curation strategy and experimental comparisons showing comparable performance to preference-based methods, which is presented as data-driven evidence rather than a reduction to fitted parameters or self-referential definitions. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling are described in the provided text, and the approach is self-contained against external benchmarks via reported empirical analysis.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so no specific free parameters, axioms, or invented entities can be extracted. The contribution is framed as an empirical data-curation study rather than a theoretical derivation.

pith-pipeline@v0.9.0 · 5719 in / 1058 out tokens · 37968 ms · 2026-05-20T06:09:01.898262+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 10 internal anchors

  1. [1]

    Training Diffusion Models with Reinforcement Learning

    Black, K., Janner, M., Du, Y ., Kostrikov, I., and Levine, S. Training diffusion models with reinforcement learning. arXiv preprint arXiv:2305.13301,

  2. [2]

    Oneig-bench: Omni- dimensional nuanced evaluation for image generation

    Chang, J., Fang, Y ., Xing, P., Wu, S., Cheng, W., Wang, R., Zeng, X., Yu, G., and Chen, H.-B. Oneig-bench: Omni- dimensional nuanced evaluation for image generation. arXiv preprint arXiv:2506.07977,

  3. [3]

    Clark, K., Vicol, P., Swersky, K., and Fleet, D. J. Directly fine-tuning diffusion models on differentiable rewards. arXiv preprint arXiv:2309.17400,

  4. [4]

    ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

    Hu, X., Wang, R., Fang, Y ., Fu, B., Cheng, P., and Yu, G. Ella: Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135,

  5. [5]

    MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE

    Li, J., Cui, Y ., Huang, T., Ma, Y ., Fan, C., Yang, M., and Zhong, Z. Mixgrpo: Unlocking flow-based grpo efficiency with mixed ode-sde.arXiv preprint arXiv:2507.21802,

  6. [6]

    Flow-GRPO: Training Flow Matching Models via Online RL

    Liu, J., Liu, G., Liang, J., Li, Y ., Liu, J., Wang, X., Wan, P., Zhang, D., and Ouyang, W. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470,

  7. [7]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al. Deepseekmath: Push- ing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

  8. [8]

    Directly aligning the full diffusion trajectory with fine-grained human preference

    10 When Preference Labels Fall Short: Aligning Diffusion Models from Real Data Shen, X., Li, Z., Yang, Z., Zhang, S., Zhang, Y ., Li, D., Wang, C., Lu, Q., and Tang, Y . Directly aligning the full diffusion trajectory with fine-grained human preference. arXiv preprint arXiv:2509.06942,

  9. [9]

    Unified Reward Model for Multimodal Understanding and Generation

    URL https:// huggingface.co/datasets/wallstoneai/ civitai-top-sfw-images-with-metadata. Wang, Y ., Zang, Y ., Li, H., Jin, C., and Wang, J. Unified re- ward model for multimodal understanding and generation. arXiv preprint arXiv:2503.05236,

  10. [10]

    Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

    Wu, X., Hao, Y ., Sun, K., Chen, Y ., Zhu, F., Zhao, R., and Li, H. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis. arXiv preprint arXiv:2306.09341,

  11. [11]

    Ranking-based preference optimization for diffusion models from implicit user feedback.arXiv preprint arXiv:2510.18353,

    Wu, Y .-L., Ruan, B.-K., Tseng, C., and Shuai, H.-H. Ranking-based preference optimization for diffusion models from implicit user feedback.arXiv preprint arXiv:2510.18353,

  12. [12]

    DanceGRPO: Unleashing GRPO on Visual Generation

    Xue, Z., Wu, J., Gao, Y ., Kong, F., Zhu, L., Chen, M., Liu, Z., Liu, W., Guo, Q., Huang, W., et al. Dancegrpo: Unleashing grpo on visual generation.arXiv preprint arXiv:2505.07818,

  13. [13]

    Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

    Yu, J., Xu, Y ., Koh, J. Y ., Luong, T., Baid, G., Wang, Z., Va- sudevan, V ., Ku, A., Yang, Y ., Ayan, B. K., et al. Scaling autoregressive models for content-rich text-to-image gen- eration.arXiv preprint arXiv:2206.10789, 2(3):5,

  14. [14]

    Implementation Details A.1

    11 When Preference Labels Fall Short: Aligning Diffusion Models from Real Data A. Implementation Details A.1. Hyperparameters Specification We fine-tune both Stable Diffusion v1.5 (SD-1.5) and Stable Diffusion 3.5 Medium (SD-3.5-M) using LoRA for parameter- efficient adaptation. For SD-1.5, we adopt LoRA with rank r= 4 and scaling factor α= 4 , following ...

  15. [15]

    Which image do you prefer given the prompt?

    is set to −0.001. We train SD-1.5 with a learning rate of 1e−4 for 1600 optimization steps, while SD-3.5-M with a learning rate of 2e−4 for 3200 steps. For sampling xt from the policy model, classifier-free guidance is set to 1.0. SD-1.5 adopts DPMSolver++ with 20 sampling steps, while SD-3.5-M uses FlowMatchEulerDiscreteScheduler with 10 steps. Stage 2: ...

  16. [16]

    Next, we use the prompt-conditioned SD-v1.5 Inpainting model3 to regenerate the masked salient regions

    to the real image to extract a saliency map. Next, we use the prompt-conditioned SD-v1.5 Inpainting model3 to regenerate the masked salient regions. Due to the limited generative capability of the inpainting model, the regenerated regions often introduce perceptual artifacts and may not align well with the prompt, resulting in a degraded counterpart. This...

  17. [17]

    and then negligible-degradation filtering (Ours) provides additional but relatively modest improvements. Overall, the performance improves gradually rather than abruptly as curation is introduced, indicating that the method does not critically depend on highly curated data or specific selection choices. Instead, curation primarily helps stabilize and slig...

  18. [18]

    The cost is computed under a conservative setting where all 25,096 images are processed before filtering. Step Throughput Total Cost Colorfulness scoring≈4.5 images/s≈1.5 GPU-hours Saliency + Inpainting (50 steps)≈0.14 images/s≈49 GPU-hours PickScore evaluation≈6 pairs/s≈1.2 GPU-hours Total –≈52 GPU-hours data. These signals are general and transferable, ...