pith. sign in

arxiv: 2603.14936 · v3 · pith:7A6YQ5WXnew · submitted 2026-03-16 · 💻 cs.CV

Bridging the Intention-Expression Gap: Aligning Multi-Dimensional Preferences via Hierarchical Relevance Feedback in Text-to-Image Diffusion

Pith reviewed 2026-05-21 11:02 UTC · model grok-4.3

classification 💻 cs.CV
keywords text-to-image diffusionpreference alignmentrelevance feedbackhierarchical feedbacktraining-free methodsmulti-dimensional preferencesuser intentstatistical inference
0
0 comments X

The pith

HRFD aligns multi-dimensional user preferences in text-to-image diffusion by organizing features into a three-tier hierarchy, decoupling inference tasks, and applying statistical comparison of liked versus disliked image sets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Users often know the image they want but cannot put the exact details into words, creating a persistent mismatch with what diffusion models produce from text prompts. The paper argues this gap can be closed without retraining any model or asking users to write detailed descriptions. Instead, the framework turns simple binary likes and dislikes on generated images into precise feature adjustments. It does so by breaking the problem into ordered layers, handling one feature at a time, and measuring how feature values differ between the sets users approve and reject. If this works, image generation becomes more responsive to unspoken visual intent while staying fully outside the trained model.

Core claim

The paper claims that a Hierarchical Relevance Feedback-Driven framework captures true visual intent by placing features into a three-tier hierarchy that enforces coarse-to-fine convergence, splitting the alignment process into separate single-feature inference tasks, and using statistical inference to quantify distribution divergence between liked and disliked image sets, all while operating strictly in external text space without training or dependence on any specific foundation model.

What carries the argument

The Hierarchical Relevance Feedback-Driven (HRFD) framework, which structures features into three tiers for sequential convergence and replaces foundation-model semantic inference with statistical measurement of feature distribution differences between liked and disliked sets.

If this is right

  • Binary click feedback on images becomes sufficient to steer generations toward specific visual features without requiring users to articulate preferences in text.
  • Foundation models avoid overload because each feature is evaluated independently rather than jointly at the semantic level.
  • Preference alignment remains model-agnostic and training-free, allowing the same interface to work across different diffusion backbones.
  • Cognitive load drops because users respond only to concrete images rather than writing or refining complex prompts.
  • The process produces transparent preference measurements by directly comparing feature statistics instead of relying on opaque model inferences.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same hierarchy and statistical divergence method could be applied to refine preferences in video or 3D generation pipelines.
  • Iterative sessions could be shortened further if the three-tier structure is learned from past user sessions rather than fixed in advance.
  • The approach suggests a general pattern for turning binary feedback into continuous parameter updates in any generative system that exposes controllable features.
  • Combining the external statistical layer with occasional textual clarifications might handle edge cases where feature distributions overlap heavily.

Load-bearing premise

Statistical inference on the distribution divergence of features between liked and disliked image sets can reliably recover the user's exact preferred feature values even when signals conflict across dimensions.

What would settle it

A test in which users hold fixed but conflicting multi-dimensional preferences, supply binary feedback on successive image batches, and the final outputs are scored against a held-out set of images the same users judge as matching their intent; failure of the outputs to match would disprove the claim.

Figures

Figures reproduced from arXiv: 2603.14936 by Hongbin Liu, Junqi Zhang, Junyan Yuan, Mingqian Li, Wenxi Wang.

Figure 1
Figure 1. Figure 1: The overall framework of the proposed RFD method. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
read the original abstract

Users often possess a clear visual intent but struggle to articulate it precisely in language. This intention-expression gap makes aligning generated images with latent visual preferences a fundamental challenge in text-to-image diffusion models. Existing methods either require model training, sacrificing flexibility, or rely on textual feedback, imposing a heavy cognitive burden. Although recent training-free methods use click-based binary preference feedback to reduce user effort, they force Foundation Models (FMs) to infer preferences at the semantic level. When faced with multi-dimensional preferences, FMs suffer from inference overload and fail to identify exact preferred feature values under conflicting user signals. Consequently, a flexible framework for multi-dimensional feature alignment remains absent. To address this, we propose a Hierarchical Relevance Feedback-Driven (HRFD) framework. Recognizing that multiple features struggle to converge simultaneously, HRFD organizes them into a three-tier hierarchy and adapts relevance feedback to enforce coarse-to-fine convergence, minimizing cognitive load. To bypass FM inference overload, HRFD decouples the process into independent single-feature preference inference tasks. Furthermore, to overcome FMs' failure in identifying preferred values, HRFD employs statistical inference to quantify the distribution divergence of features between "liked" and "disliked" image sets, achieving robust and transparent preference measurement. Crucially, HRFD operates entirely within the external text space, remaining strictly training-free and model-agnostic. Extensive experiments demonstrate that HRFD effectively captures the user's true visual intent, significantly outperforming baseline approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces the Hierarchical Relevance Feedback-Driven (HRFD) framework to address the intention-expression gap in text-to-image diffusion models. It organizes multi-dimensional features into a three-tier hierarchy to enable coarse-to-fine convergence with reduced cognitive load, decouples the preference inference into independent single-feature tasks to avoid foundation model overload, and uses statistical inference to quantify distribution divergence between liked and disliked image sets for transparent preference measurement. The framework is strictly training-free and model-agnostic, operating in the external text space. The paper claims that extensive experiments show HRFD significantly outperforms baseline approaches in capturing the user's true visual intent.

Significance. If the central claims hold, this work could have significant impact on interactive text-to-image generation by providing a flexible, low-burden method for multi-dimensional preference alignment without requiring model retraining. The hierarchical organization and statistical divergence approach offer a novel way to handle complex preferences transparently. The training-free aspect enhances its practicality and generalizability across different foundation models. However, the significance is tempered by the need to validate the independence assumption for feature preferences.

major comments (3)
  1. Abstract (HRFD components paragraph): The decoupling into independent single-feature preference inference tasks is claimed to bypass FM inference overload while recovering exact preferred feature values. However, this assumes that user preferences for features are separable and that per-feature statistical divergence can recover the joint optimum. No argument or test is provided to show that this holds when features have interdependencies, such as color interacting with composition in visual intent. This is load-bearing for the claim of robust multi-dimensional alignment.
  2. Abstract (experiments claim): The abstract asserts that 'extensive experiments demonstrate that HRFD effectively captures the user's true visual intent, significantly outperforming baseline approaches,' but provides no details on the baselines used, evaluation metrics, statistical significance testing, number of users or trials, or rules for data exclusion. This undermines the ability to assess the strength of the outperformance claim.
  3. HRFD framework description: The use of statistical inference to quantify distribution divergence between liked and disliked sets is presented as overcoming FMs' failure in identifying preferred values under conflicting signals. However, without a specific formulation or proof that this method identifies 'exact' preferred feature values (as opposed to average tendencies), it is unclear how it handles multimodal or conflicting distributions within the liked set.
minor comments (2)
  1. Abstract: The term 'Foundation Models (FMs)' is introduced without prior definition or reference to specific models used in the experiments.
  2. Abstract: The three-tier hierarchy is mentioned but not detailed in terms of what the tiers correspond to (e.g., global vs local features).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We have carefully reviewed each major comment and provide point-by-point responses below, indicating where revisions will be made to address the concerns raised.

read point-by-point responses
  1. Referee: Abstract (HRFD components paragraph): The decoupling into independent single-feature preference inference tasks is claimed to bypass FM inference overload while recovering exact preferred feature values. However, this assumes that user preferences for features are separable and that per-feature statistical divergence can recover the joint optimum. No argument or test is provided to show that this holds when features have interdependencies, such as color interacting with composition in visual intent. This is load-bearing for the claim of robust multi-dimensional alignment.

    Authors: We acknowledge that the separability assumption requires explicit justification, particularly for interdependent features. The hierarchical organization is intended to mitigate this by resolving coarse features first, thereby constraining the search space for finer interdependent attributes. However, we agree that the current text does not provide a dedicated argument or empirical test for cases with strong interactions. In the revision, we will add a subsection in the framework description discussing this assumption, referencing related work on multi-attribute decision making, and include an ablation study examining performance under controlled feature interdependencies. revision: yes

  2. Referee: Abstract (experiments claim): The abstract asserts that 'extensive experiments demonstrate that HRFD effectively captures the user's true visual intent, significantly outperforming baseline approaches,' but provides no details on the baselines used, evaluation metrics, statistical significance testing, number of users or trials, or rules for data exclusion. This undermines the ability to assess the strength of the outperformance claim.

    Authors: The abstract is necessarily concise, with complete experimental protocols, baselines (such as direct FM-based inference and prior click-feedback methods), metrics (preference alignment score and user satisfaction ratings), statistical tests, participant counts, and exclusion criteria detailed in Section 4. To improve accessibility, we will revise the abstract to incorporate a brief clause summarizing the key experimental setup and significance results while remaining within length constraints. revision: yes

  3. Referee: HRFD framework description: The use of statistical inference to quantify distribution divergence between liked and disliked sets is presented as overcoming FMs' failure in identifying preferred values under conflicting signals. However, without a specific formulation or proof that this method identifies 'exact' preferred feature values (as opposed to average tendencies), it is unclear how it handles multimodal or conflicting distributions within the liked set.

    Authors: Section 3.3 presents the divergence computation using empirical distribution comparison (specifically, a symmetric divergence measure between binned feature histograms of liked and disliked sets), with the preferred value selected as the bin maximizing the divergence. This is not claimed to be a formal proof but follows from standard statistical mode-seeking under binary labels. For multimodal liked distributions, the current implementation selects the dominant mode via density comparison; we will expand the section with explicit equations, an illustrative example of multimodal handling, and a brief discussion of limitations when distributions are highly conflicting. revision: partial

Circularity Check

0 steps flagged

No circularity: HRFD claims rest on independent methodological choices

full rationale

The paper describes HRFD as organizing features into a three-tier hierarchy, decoupling into independent single-feature inference tasks, and applying statistical divergence measurement between liked and disliked image sets. No equations, derivations, or self-citations are shown that reduce any of these steps to tautological fits, renamed inputs, or load-bearing prior results by the same authors. The statistical inference is presented as an external measurement rather than a constructed prediction, and the framework is explicitly training-free and model-agnostic. The central claims therefore remain self-contained with independent content.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

Review based solely on abstract; full paper may contain additional parameters or assumptions not visible here. The framework rests on the stated recognition that multiple features do not converge simultaneously and that statistical measures can quantify preference without model retraining.

axioms (2)
  • domain assumption Multiple features struggle to converge simultaneously in preference alignment
    Explicitly recognized in abstract as motivation for hierarchical organization.
  • domain assumption Foundation models suffer from inference overload on multi-dimensional conflicting signals
    Stated as the reason for decoupling into single-feature tasks.
invented entities (1)
  • Hierarchical Relevance Feedback-Driven (HRFD) framework no independent evidence
    purpose: To enforce coarse-to-fine convergence and enable transparent preference measurement
    Newly proposed method in the paper; no independent evidence provided beyond the abstract claim of effectiveness.

pith-pipeline@v0.9.0 · 5811 in / 1428 out tokens · 33729 ms · 2026-05-21T11:02:25.153508+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · 4 internal anchors

  1. [1]

    Jingkun An, Yinghao Zhu, Zongjian Li, Enshen Zhou, Haoran Feng, Xijie Huang, Bohua Chen, Yemin Shi, and Chengwei Pan. 2025. Agfsync: Leveraging AI- generated feedback for preference optimization in text-to-image generation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 1746–1754

  2. [2]

    2023.Improving image generation with better captions

    James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al . 2023.Improving image generation with better captions. Technical Report. OpenAI. https://cdn.openai. com/papers/dall-e-3.pdf

  3. [3]

    Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. 2024. Training Diffusion Models with Reinforcement Learning. InInternational Con- ference on Learning Representations, B. Kim, Y. Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y. Sun (Eds.), Vol. 2024. 4965–4987

  4. [4]

    Mattia Broilo and Francesco G. B. De Natale. 2010. A Stochastic Approach to Image Retrieval Using Relevance Feedback and Particle Swarm Optimization. IEEE Transactions on Multimedia12, 4 (2010), 267–277

  5. [5]

    Xiaoliang Dai, Ji Hou, Chih-Yao Ma, Sam Tsai, Jialiang Wang, Rui Wang, Peizhao Zhang, Simon Vandenhende, Xiaofang Wang, Abhimanyu Dubey, Matthew Yu, Abhishek Kadian, Filip Radenovic, Dhruv Mahajan, Kunpeng Li, Yue Zhao, Vladan Petrovic, Mitesh Kumar Singh, Simran Motwani, Yi Wen, Yiwen Song, Roshan Sumbaly, Vignesh Ramanathan, Zijian He, Peter Vajda, and ...

  6. [6]

    arXiv:2309.15807

    Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack. arXiv:2309.15807

  7. [7]

    Ritendra Datta, Dhiraj Joshi, Jia Li, and James Z Wang. 2008. Image retrieval: Ideas, influences, and trends of the new age.ACM Computing Surveys (CSUR)40, 2 (2008), 1–60

  8. [8]

    Shivank Garg, Ayush Singh, and Gaurav Kumar Nayak. 2026. SIDiffAgent: Self-Improving Diffusion Agent. arXiv:2602.02051

  9. [9]

    Yuhan Guo, Hanning Shao, Can Liu, Kai Xu, and Xiaoru Yuan. 2025. PrompTHis: Visualizing the Process and Influence of Prompt Editing During Text-to-Image Creation.IEEE Transactions on Visualization and Computer Graphics31, 9 (2025), 4547–4559

  10. [10]

    Jonathan Ho, Tim Salimans, Alexey A Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. 2022. Video diffusion models. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 35. 3633–3646

  11. [11]

    P. Hong, Q. Tian, and T. S. Huang. 2000. Incorporate support vector machines to content-based image retrieval with relevant feedback. InProceedings of the IEEE International Conference on Image Processing (ICIP), Vol. 3. IEEE, 750–753

  12. [12]

    Liyao Jiang, Ruichen Chen, Chao Gao, and Di Niu. 2026. RAISE: Requirement- Adaptive Evolutionary Refinement for Training-Free Text-to-Image Alignment. arXiv:2603.00483

  13. [13]

    Kanimozhi and K

    T. Kanimozhi and K. Latha. 2015. An integrated approach to region based image retrieval using firefly algorithm and support vector machine.Neurocomputing 151 (2015), 1099–1111

  14. [14]

    DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

    Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav San- thanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. 2023. DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines. arXiv:2310.03714

  15. [15]

    Sunwoo Kim, Minkyu Kim, and Dongmin Park. 2025. Test-time Alignment of Diffusion Models without Reward Over-optimization. InThe Thirteenth Interna- tional Conference on Learning Representations. https://openreview.net/forum? id=vi3DjUhFVm

  16. [16]

    Shu Kong, Xiaohui Shen, Zhe Lin, Radomir Mech, and Charless Fowlkes. 2016. Photo aesthetics ranking network with attributes and content adaptation. In Proceedings of the European Conference on Computer Vision (ECCV), Vol. 9905. Springer, 662–679

  17. [17]

    Zhipeng Li, Yi-Chi Liao, and Christian Holz. 2026. Preference-Guided Prompt Optimization for Text-to-Image Generation. arXiv:2602.13131

  18. [18]

    Iain Mackie, Shubham Chatterjee, and Jeffrey Dalton. 2023. Generative relevance feedback with large language models. InProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2026–2031

  19. [19]

    Mahmood, M

    A. Mahmood, M. Imran, A. Irtaza, Q. Abbas, and H. Dhahri. 2022. Hybrid evolutionary algorithm based relevance feedback approach for image retrieval. Computers, Materials & Continua70, 1 (2022), 963–979

  20. [20]

    Chong Mou et al. 2024. T2I-Adapter: Learning adapters to dig out more con- trollable ability for text-to-image diffusion models. InProceedings of the AAAI Conference on Artificial Intelligence. 4296–4304

  21. [21]

    Keith E. Muller. 2012. Statistical power analysis for the behavioral sciences. Technometrics(2012)

  22. [22]

    Ofir Nabati, Guy Tennenholtz, ChihWei Hsu, Moonkyung Ryu, Deepak Ra- machandran, Yinlam Chow, Xiang Li, and Craig Boutilier. 2025. Preference Adaptive and Sequential Text-to-Image Generation. InForty-second International Conference on Machine Learning. https://openreview.net/forum?id=LCr6CIAEye

  23. [23]

    Savvas Petridis, Benjamin D Wedin, James Wexler, Mahima Pushkarna, Aaron Donsbach, Nitesh Goyal, Carrie J Cai, and Michael Terry. 2024. Constitution- Maker: Interactively Critiquing Large Language Models by Converting Feedback into Principles. InProceedings of the 29th International Conference on Intelli- gent User Interfaces(Greenville, SC, USA)(IUI ’24)...

  24. [24]

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. 2023. SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis. arXiv:2307.01952

  25. [25]

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen

  26. [26]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv:2204.06125

  27. [27]

    Joseph J Rocchio. 1971. Relevance feedback in information retrieval. InThe SMART Retrieval System: Experiments in Automatic Document Processing, G. Salton (Ed.). Prentice-Hall, 313–323

  28. [28]

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR). 10684–10695

  29. [29]

    Yong Rui, Thomas S Huang, Sharad Mehrotra, and Michael Ortega. 1998. Rele- vance feedback: A power tool for interactive content-based image retrieval.IEEE Transactions on Circuits and Systems for Video Technology8, 5 (1998), 644–655

  30. [30]

    Eyal Segalis, Dani Valevski, Danny Lumen, Yossi Matias, and Yaniv Leviathan

  31. [31]

    arXiv:2310.16656

    A Picture is Worth a Thousand Words: Principled Recaptioning Improves Image Generation. arXiv:2310.16656

  32. [32]

    Sharma, S

    U. Sharma, S. Rudinac, O. S. Khan, and B. Þ. Jónsson. 2025. Can relevance feedback, conversational search and foundation models work together for interactive video search and exploration?. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). 3779–3788

  33. [33]

    Jascha Sohl-Dickstein, Eric A Weiss, Niru Maheswaranathan, and Surya Ganguli

  34. [34]

    In Proceedings of the 32nd International Conference on Machine Learning (ICML), Vol

    Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the 32nd International Conference on Machine Learning (ICML), Vol. 37. 2256–2265

  35. [35]

    Jiao Sun, Deqing Fu, Yushi Hu, Su Wang, Royi Rassin, Da-Cheng Juan, Dana Alon, Charles Herrmann, Sjoerd Van Steenkiste, Ranjay Krishna, and Cyrus Rashtchian

  36. [36]

    InProceedings of the Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL)

    DreamSync: Aligning text-to-image generation with image understanding feedback. InProceedings of the Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL). 5920–5945

  37. [37]

    Zhiwei Tang, Jiangweizhi Peng, Jiasheng Tang, Mingyi Hong, Fan Wang, and Tsung-Hui Chang. 2025. Inference-Time Alignment of Diffusion Models with Direct Noise Optimization. InForty-second International Conference on Machine Learning. https://openreview.net/forum?id=JpbqiD7n9r

  38. [38]

    Dacheng Tao, Xiaoou Tang, Xuelong Li, and Xindong Wu. 2006. Asymmetric bagging and random subspace for support vector machines-based relevance feedback in image retrieval.IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)28, 7 (2006), 1088–1099

  39. [39]

    Simon Tong and Daphne Koller. 2001. Support vector machine active learning with applications to text classification.Journal of Machine Learning Research2, Nov (2001), 45–66

  40. [40]

    Kavana Venkatesh, Yusuf Dalva, Ismini Lourentzou, and Pinar Yanardag. 2025. RAVEL: Rare Concept Generation and Editing via Graph-driven Relational Guid- ance. arXiv:2412.09614

  41. [41]

    Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik

  42. [42]

    InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Diffusion Model Alignment Using Direct Preference Optimization. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 8228–8238

  43. [43]

    Jianhui Wang, Yangfan He, Yan Zhong, Xinyuan Song, Jiayi Su, Yuheng Feng, Ruoyu Wang, Hongyang He, Wenyu Zhu, Xinhang Yuan, Miao Zhang, Keqin Li, Jiaqi Chen, Tianyu Shi, and Xueqian Wang. 2026. Twin Co-Adaptive Dialogue for Progressive Image Generation. arXiv:2504.14868

  44. [44]

    Leijie Wang, Kathryn Yurechko, and Amy X. Zhang. 2025. Promptimizer: User- Led Prompt Optimization for Personal Content Classification. arXiv:2510.09009

  45. [45]

    Zhijie Wang, Yuheng Huang, Da Song, Lei Ma, and Tianyi Zhang. 2024. PromptCharm: Text-to-Image Generation through Multi-modal Prompting and Refinement. InProceedings of the CHI Conference on Human Factors in Computing Systems (CHI ’24). ACM, 1–21

  46. [46]

    Zhen Xing et al . 2024. SimDA: Simple diffusion adapter for efficient video generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 7827–7839

  47. [47]

    Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. 2023. ImageReward: Learning and evaluating human preferences for text-to-image generation. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 36. 15903–15935

  48. [48]

    L. Yang, H. Qian, Z. Zhang, J. Liu, and B. Cui. 2024. Structure-guided adversar- ial training of diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 7256–7266. Wang et al

  49. [49]

    Po-Hung Yeh, Kuang-Huei Lee, and Jun cheng Chen. 2025. Training-Free Dif- fusion Model Alignment with Sampling Demons. InThe Thirteenth Interna- tional Conference on Learning Representations. https://openreview.net/forum? id=tfemquulED

  50. [50]

    Zhang, X

    J. Zhang, X. Zhou, W. Wang, B. Shi, and J. Pei. 2005. Using high dimensional indexes to support relevance feedback based interactive images retrieval. In Proceedings of the 31st International Conference on Very Large Data Bases (VLDB). 1211–1214

  51. [51]

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. 2023. Adding conditional control to text-to-image diffusion models. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision (ICCV). 3836–3847

  52. [52]

    Xinchen Zhang, Ling Yang, Guohao Li, YaQi Cai, xie jiake, Yong Tang, Yu- jiu Yang, Mengdi Wang, and Bin CUI. 2025. IterComp: Iterative Composition- Aware Feedback Learning from Model Gallery for Text-to-Image Generation. InThe Thirteenth International Conference on Learning Representations. https: //openreview.net/forum?id=4w99NAikOE

  53. [53]

    Daquan Zhou, Weimin Wang, Hanshu Yan, Weiwei Lv, Yizhe Zhu, and Jiashi Feng

  54. [54]

    MagicVideo: Efficient Video Generation With Latent Diffusion Models. (2023). arXiv:2211.11018

  55. [55]

    Xiang Sean Zhou and Thomas S Huang. 2003. Relevance feedback in image retrieval: A comprehensive review.Multimedia Systems8, 6 (2003), 536–544

  56. [56]

    L. Zou, L. Xia, Z. Ding, J. Song, W. Liu, and D. Yin. 2019. Reinforcement learning to optimize long-term user engagement in recommender systems. InProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD). ACM, 2810–2818