Bridging the Intention-Expression Gap: Aligning Multi-Dimensional Preferences via Hierarchical Relevance Feedback in Text-to-Image Diffusion

Hongbin Liu; Junqi Zhang; Junyan Yuan; Mingqian Li; Wenxi Wang

arxiv: 2603.14936 · v3 · pith:7A6YQ5WXnew · submitted 2026-03-16 · 💻 cs.CV

Bridging the Intention-Expression Gap: Aligning Multi-Dimensional Preferences via Hierarchical Relevance Feedback in Text-to-Image Diffusion

Wenxi Wang , Hongbin Liu , Mingqian Li , Junyan Yuan , Junqi Zhang This is my paper

Pith reviewed 2026-05-21 11:02 UTC · model grok-4.3

classification 💻 cs.CV

keywords text-to-image diffusionpreference alignmentrelevance feedbackhierarchical feedbacktraining-free methodsmulti-dimensional preferencesuser intentstatistical inference

0 comments

The pith

HRFD aligns multi-dimensional user preferences in text-to-image diffusion by organizing features into a three-tier hierarchy, decoupling inference tasks, and applying statistical comparison of liked versus disliked image sets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Users often know the image they want but cannot put the exact details into words, creating a persistent mismatch with what diffusion models produce from text prompts. The paper argues this gap can be closed without retraining any model or asking users to write detailed descriptions. Instead, the framework turns simple binary likes and dislikes on generated images into precise feature adjustments. It does so by breaking the problem into ordered layers, handling one feature at a time, and measuring how feature values differ between the sets users approve and reject. If this works, image generation becomes more responsive to unspoken visual intent while staying fully outside the trained model.

Core claim

The paper claims that a Hierarchical Relevance Feedback-Driven framework captures true visual intent by placing features into a three-tier hierarchy that enforces coarse-to-fine convergence, splitting the alignment process into separate single-feature inference tasks, and using statistical inference to quantify distribution divergence between liked and disliked image sets, all while operating strictly in external text space without training or dependence on any specific foundation model.

What carries the argument

The Hierarchical Relevance Feedback-Driven (HRFD) framework, which structures features into three tiers for sequential convergence and replaces foundation-model semantic inference with statistical measurement of feature distribution differences between liked and disliked sets.

If this is right

Binary click feedback on images becomes sufficient to steer generations toward specific visual features without requiring users to articulate preferences in text.
Foundation models avoid overload because each feature is evaluated independently rather than jointly at the semantic level.
Preference alignment remains model-agnostic and training-free, allowing the same interface to work across different diffusion backbones.
Cognitive load drops because users respond only to concrete images rather than writing or refining complex prompts.
The process produces transparent preference measurements by directly comparing feature statistics instead of relying on opaque model inferences.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same hierarchy and statistical divergence method could be applied to refine preferences in video or 3D generation pipelines.
Iterative sessions could be shortened further if the three-tier structure is learned from past user sessions rather than fixed in advance.
The approach suggests a general pattern for turning binary feedback into continuous parameter updates in any generative system that exposes controllable features.
Combining the external statistical layer with occasional textual clarifications might handle edge cases where feature distributions overlap heavily.

Load-bearing premise

Statistical inference on the distribution divergence of features between liked and disliked image sets can reliably recover the user's exact preferred feature values even when signals conflict across dimensions.

What would settle it

A test in which users hold fixed but conflicting multi-dimensional preferences, supply binary feedback on successive image batches, and the final outputs are scored against a held-out set of images the same users judge as matching their intent; failure of the outputs to match would disprove the claim.

Figures

Figures reproduced from arXiv: 2603.14936 by Hongbin Liu, Junqi Zhang, Junyan Yuan, Mingqian Li, Wenxi Wang.

read the original abstract

Users often possess a clear visual intent but struggle to articulate it precisely in language. This intention-expression gap makes aligning generated images with latent visual preferences a fundamental challenge in text-to-image diffusion models. Existing methods either require model training, sacrificing flexibility, or rely on textual feedback, imposing a heavy cognitive burden. Although recent training-free methods use click-based binary preference feedback to reduce user effort, they force Foundation Models (FMs) to infer preferences at the semantic level. When faced with multi-dimensional preferences, FMs suffer from inference overload and fail to identify exact preferred feature values under conflicting user signals. Consequently, a flexible framework for multi-dimensional feature alignment remains absent. To address this, we propose a Hierarchical Relevance Feedback-Driven (HRFD) framework. Recognizing that multiple features struggle to converge simultaneously, HRFD organizes them into a three-tier hierarchy and adapts relevance feedback to enforce coarse-to-fine convergence, minimizing cognitive load. To bypass FM inference overload, HRFD decouples the process into independent single-feature preference inference tasks. Furthermore, to overcome FMs' failure in identifying preferred values, HRFD employs statistical inference to quantify the distribution divergence of features between "liked" and "disliked" image sets, achieving robust and transparent preference measurement. Crucially, HRFD operates entirely within the external text space, remaining strictly training-free and model-agnostic. Extensive experiments demonstrate that HRFD effectively captures the user's true visual intent, significantly outperforming baseline approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HRFD decouples multi-dimensional preferences into independent single-feature tasks with statistical divergence, but this risks missing joint feature interactions that users actually care about.

read the letter

The main takeaway is that this paper gives a training-free framework for handling several user preferences at once in text-to-image diffusion by sorting features into a three-tier hierarchy, splitting the work into separate single-feature checks, and measuring how liked and disliked image sets differ statistically. That combination is the actual new piece; prior click-based methods either overload the foundation model or stay stuck at semantic level without this kind of explicit decoupling and divergence step. It does handle a practical pain point well: users often know what they want visually but cannot put it into words, and the method tries to lower the cognitive load by letting them just click like or dislike while the system infers in the external text space without any retraining. The model-agnostic claim is also straightforward and useful if it holds up. The soft spot is the independence assumption. If a preference for color only makes sense together with a certain spatial layout, running separate divergence calculations on each feature will not recover the joint optimum, and the abstract offers no argument or derivation showing why the hierarchy plus per-feature stats preserves that surface. Experiments are described as extensive and outperforming baselines, yet the provided text gives no numbers on statistical significance, exact baseline choices, or how conflicting signals were managed, so the strength of the gains is still unclear. This work is aimed at researchers building interactive generation tools who want something that runs without fine-tuning. A reader focused on user feedback loops in diffusion models could pick up the framework and test it. I would send it to peer review so the experimental details and the separability claim can be checked directly.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces the Hierarchical Relevance Feedback-Driven (HRFD) framework to address the intention-expression gap in text-to-image diffusion models. It organizes multi-dimensional features into a three-tier hierarchy to enable coarse-to-fine convergence with reduced cognitive load, decouples the preference inference into independent single-feature tasks to avoid foundation model overload, and uses statistical inference to quantify distribution divergence between liked and disliked image sets for transparent preference measurement. The framework is strictly training-free and model-agnostic, operating in the external text space. The paper claims that extensive experiments show HRFD significantly outperforms baseline approaches in capturing the user's true visual intent.

Significance. If the central claims hold, this work could have significant impact on interactive text-to-image generation by providing a flexible, low-burden method for multi-dimensional preference alignment without requiring model retraining. The hierarchical organization and statistical divergence approach offer a novel way to handle complex preferences transparently. The training-free aspect enhances its practicality and generalizability across different foundation models. However, the significance is tempered by the need to validate the independence assumption for feature preferences.

major comments (3)

Abstract (HRFD components paragraph): The decoupling into independent single-feature preference inference tasks is claimed to bypass FM inference overload while recovering exact preferred feature values. However, this assumes that user preferences for features are separable and that per-feature statistical divergence can recover the joint optimum. No argument or test is provided to show that this holds when features have interdependencies, such as color interacting with composition in visual intent. This is load-bearing for the claim of robust multi-dimensional alignment.
Abstract (experiments claim): The abstract asserts that 'extensive experiments demonstrate that HRFD effectively captures the user's true visual intent, significantly outperforming baseline approaches,' but provides no details on the baselines used, evaluation metrics, statistical significance testing, number of users or trials, or rules for data exclusion. This undermines the ability to assess the strength of the outperformance claim.
HRFD framework description: The use of statistical inference to quantify distribution divergence between liked and disliked sets is presented as overcoming FMs' failure in identifying preferred values under conflicting signals. However, without a specific formulation or proof that this method identifies 'exact' preferred feature values (as opposed to average tendencies), it is unclear how it handles multimodal or conflicting distributions within the liked set.

minor comments (2)

Abstract: The term 'Foundation Models (FMs)' is introduced without prior definition or reference to specific models used in the experiments.
Abstract: The three-tier hierarchy is mentioned but not detailed in terms of what the tiers correspond to (e.g., global vs local features).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We have carefully reviewed each major comment and provide point-by-point responses below, indicating where revisions will be made to address the concerns raised.

read point-by-point responses

Referee: Abstract (HRFD components paragraph): The decoupling into independent single-feature preference inference tasks is claimed to bypass FM inference overload while recovering exact preferred feature values. However, this assumes that user preferences for features are separable and that per-feature statistical divergence can recover the joint optimum. No argument or test is provided to show that this holds when features have interdependencies, such as color interacting with composition in visual intent. This is load-bearing for the claim of robust multi-dimensional alignment.

Authors: We acknowledge that the separability assumption requires explicit justification, particularly for interdependent features. The hierarchical organization is intended to mitigate this by resolving coarse features first, thereby constraining the search space for finer interdependent attributes. However, we agree that the current text does not provide a dedicated argument or empirical test for cases with strong interactions. In the revision, we will add a subsection in the framework description discussing this assumption, referencing related work on multi-attribute decision making, and include an ablation study examining performance under controlled feature interdependencies. revision: yes
Referee: Abstract (experiments claim): The abstract asserts that 'extensive experiments demonstrate that HRFD effectively captures the user's true visual intent, significantly outperforming baseline approaches,' but provides no details on the baselines used, evaluation metrics, statistical significance testing, number of users or trials, or rules for data exclusion. This undermines the ability to assess the strength of the outperformance claim.

Authors: The abstract is necessarily concise, with complete experimental protocols, baselines (such as direct FM-based inference and prior click-feedback methods), metrics (preference alignment score and user satisfaction ratings), statistical tests, participant counts, and exclusion criteria detailed in Section 4. To improve accessibility, we will revise the abstract to incorporate a brief clause summarizing the key experimental setup and significance results while remaining within length constraints. revision: yes
Referee: HRFD framework description: The use of statistical inference to quantify distribution divergence between liked and disliked sets is presented as overcoming FMs' failure in identifying preferred values under conflicting signals. However, without a specific formulation or proof that this method identifies 'exact' preferred feature values (as opposed to average tendencies), it is unclear how it handles multimodal or conflicting distributions within the liked set.

Authors: Section 3.3 presents the divergence computation using empirical distribution comparison (specifically, a symmetric divergence measure between binned feature histograms of liked and disliked sets), with the preferred value selected as the bin maximizing the divergence. This is not claimed to be a formal proof but follows from standard statistical mode-seeking under binary labels. For multimodal liked distributions, the current implementation selects the dominant mode via density comparison; we will expand the section with explicit equations, an illustrative example of multimodal handling, and a brief discussion of limitations when distributions are highly conflicting. revision: partial

Circularity Check

0 steps flagged

No circularity: HRFD claims rest on independent methodological choices

full rationale

The paper describes HRFD as organizing features into a three-tier hierarchy, decoupling into independent single-feature inference tasks, and applying statistical divergence measurement between liked and disliked image sets. No equations, derivations, or self-citations are shown that reduce any of these steps to tautological fits, renamed inputs, or load-bearing prior results by the same authors. The statistical inference is presented as an external measurement rather than a constructed prediction, and the framework is explicitly training-free and model-agnostic. The central claims therefore remain self-contained with independent content.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

Review based solely on abstract; full paper may contain additional parameters or assumptions not visible here. The framework rests on the stated recognition that multiple features do not converge simultaneously and that statistical measures can quantify preference without model retraining.

axioms (2)

domain assumption Multiple features struggle to converge simultaneously in preference alignment
Explicitly recognized in abstract as motivation for hierarchical organization.
domain assumption Foundation models suffer from inference overload on multi-dimensional conflicting signals
Stated as the reason for decoupling into single-feature tasks.

invented entities (1)

Hierarchical Relevance Feedback-Driven (HRFD) framework no independent evidence
purpose: To enforce coarse-to-fine convergence and enable transparent preference measurement
Newly proposed method in the paper; no independent evidence provided beyond the abstract claim of effectiveness.

pith-pipeline@v0.9.0 · 5811 in / 1428 out tokens · 33729 ms · 2026-05-21T11:02:25.153508+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

HRFD organizes features into a three-tier hierarchy and decouples the process into independent single-feature preference inference tasks... employs statistical inference to quantify the distribution divergence of features between liked and disliked image sets (Odds Ratio, Cohen’s d)
IndisputableMonolith/Foundation/AbsoluteFloorClosure absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

constructs an expert-curated feature standard repository... mathematically grounded preference inference pipeline

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · 4 internal anchors

[1]

Jingkun An, Yinghao Zhu, Zongjian Li, Enshen Zhou, Haoran Feng, Xijie Huang, Bohua Chen, Yemin Shi, and Chengwei Pan. 2025. Agfsync: Leveraging AI- generated feedback for preference optimization in text-to-image generation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 1746–1754

work page 2025
[2]

2023.Improving image generation with better captions

James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al . 2023.Improving image generation with better captions. Technical Report. OpenAI. https://cdn.openai. com/papers/dall-e-3.pdf

work page 2023
[3]

Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. 2024. Training Diffusion Models with Reinforcement Learning. InInternational Con- ference on Learning Representations, B. Kim, Y. Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y. Sun (Eds.), Vol. 2024. 4965–4987

work page 2024
[4]

Mattia Broilo and Francesco G. B. De Natale. 2010. A Stochastic Approach to Image Retrieval Using Relevance Feedback and Particle Swarm Optimization. IEEE Transactions on Multimedia12, 4 (2010), 267–277

work page 2010
[5]

Xiaoliang Dai, Ji Hou, Chih-Yao Ma, Sam Tsai, Jialiang Wang, Rui Wang, Peizhao Zhang, Simon Vandenhende, Xiaofang Wang, Abhimanyu Dubey, Matthew Yu, Abhishek Kadian, Filip Radenovic, Dhruv Mahajan, Kunpeng Li, Yue Zhao, Vladan Petrovic, Mitesh Kumar Singh, Simran Motwani, Yi Wen, Yiwen Song, Roshan Sumbaly, Vignesh Ramanathan, Zijian He, Peter Vajda, and ...

work page
[6]

arXiv:2309.15807

Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack. arXiv:2309.15807

work page arXiv
[7]

Ritendra Datta, Dhiraj Joshi, Jia Li, and James Z Wang. 2008. Image retrieval: Ideas, influences, and trends of the new age.ACM Computing Surveys (CSUR)40, 2 (2008), 1–60

work page 2008
[8]

Shivank Garg, Ayush Singh, and Gaurav Kumar Nayak. 2026. SIDiffAgent: Self-Improving Diffusion Agent. arXiv:2602.02051

work page arXiv 2026
[9]

Yuhan Guo, Hanning Shao, Can Liu, Kai Xu, and Xiaoru Yuan. 2025. PrompTHis: Visualizing the Process and Influence of Prompt Editing During Text-to-Image Creation.IEEE Transactions on Visualization and Computer Graphics31, 9 (2025), 4547–4559

work page 2025
[10]

Jonathan Ho, Tim Salimans, Alexey A Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. 2022. Video diffusion models. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 35. 3633–3646

work page 2022
[11]

P. Hong, Q. Tian, and T. S. Huang. 2000. Incorporate support vector machines to content-based image retrieval with relevant feedback. InProceedings of the IEEE International Conference on Image Processing (ICIP), Vol. 3. IEEE, 750–753

work page 2000
[12]

Liyao Jiang, Ruichen Chen, Chao Gao, and Di Niu. 2026. RAISE: Requirement- Adaptive Evolutionary Refinement for Training-Free Text-to-Image Alignment. arXiv:2603.00483

work page arXiv 2026
[13]

Kanimozhi and K

T. Kanimozhi and K. Latha. 2015. An integrated approach to region based image retrieval using firefly algorithm and support vector machine.Neurocomputing 151 (2015), 1099–1111

work page 2015
[14]

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav San- thanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. 2023. DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines. arXiv:2310.03714

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Sunwoo Kim, Minkyu Kim, and Dongmin Park. 2025. Test-time Alignment of Diffusion Models without Reward Over-optimization. InThe Thirteenth Interna- tional Conference on Learning Representations. https://openreview.net/forum? id=vi3DjUhFVm

work page 2025
[16]

Shu Kong, Xiaohui Shen, Zhe Lin, Radomir Mech, and Charless Fowlkes. 2016. Photo aesthetics ranking network with attributes and content adaptation. In Proceedings of the European Conference on Computer Vision (ECCV), Vol. 9905. Springer, 662–679

work page 2016
[17]

Zhipeng Li, Yi-Chi Liao, and Christian Holz. 2026. Preference-Guided Prompt Optimization for Text-to-Image Generation. arXiv:2602.13131

work page arXiv 2026
[18]

Iain Mackie, Shubham Chatterjee, and Jeffrey Dalton. 2023. Generative relevance feedback with large language models. InProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2026–2031

work page 2023
[19]

Mahmood, M

A. Mahmood, M. Imran, A. Irtaza, Q. Abbas, and H. Dhahri. 2022. Hybrid evolutionary algorithm based relevance feedback approach for image retrieval. Computers, Materials & Continua70, 1 (2022), 963–979

work page 2022
[20]

Chong Mou et al. 2024. T2I-Adapter: Learning adapters to dig out more con- trollable ability for text-to-image diffusion models. InProceedings of the AAAI Conference on Artificial Intelligence. 4296–4304

work page 2024
[21]

Keith E. Muller. 2012. Statistical power analysis for the behavioral sciences. Technometrics(2012)

work page 2012
[22]

Ofir Nabati, Guy Tennenholtz, ChihWei Hsu, Moonkyung Ryu, Deepak Ra- machandran, Yinlam Chow, Xiang Li, and Craig Boutilier. 2025. Preference Adaptive and Sequential Text-to-Image Generation. InForty-second International Conference on Machine Learning. https://openreview.net/forum?id=LCr6CIAEye

work page 2025
[23]

Savvas Petridis, Benjamin D Wedin, James Wexler, Mahima Pushkarna, Aaron Donsbach, Nitesh Goyal, Carrie J Cai, and Michael Terry. 2024. Constitution- Maker: Interactively Critiquing Large Language Models by Converting Feedback into Principles. InProceedings of the 29th International Conference on Intelli- gent User Interfaces(Greenville, SC, USA)(IUI ’24)...

work page 2024
[24]

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. 2023. SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis. arXiv:2307.01952

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen

work page
[26]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv:2204.06125

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Joseph J Rocchio. 1971. Relevance feedback in information retrieval. InThe SMART Retrieval System: Experiments in Automatic Document Processing, G. Salton (Ed.). Prentice-Hall, 313–323

work page 1971
[28]

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR). 10684–10695

work page 2022
[29]

Yong Rui, Thomas S Huang, Sharad Mehrotra, and Michael Ortega. 1998. Rele- vance feedback: A power tool for interactive content-based image retrieval.IEEE Transactions on Circuits and Systems for Video Technology8, 5 (1998), 644–655

work page 1998
[30]

Eyal Segalis, Dani Valevski, Danny Lumen, Yossi Matias, and Yaniv Leviathan

work page
[31]

arXiv:2310.16656

A Picture is Worth a Thousand Words: Principled Recaptioning Improves Image Generation. arXiv:2310.16656

work page arXiv
[32]

Sharma, S

U. Sharma, S. Rudinac, O. S. Khan, and B. Þ. Jónsson. 2025. Can relevance feedback, conversational search and foundation models work together for interactive video search and exploration?. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). 3779–3788

work page 2025
[33]

Jascha Sohl-Dickstein, Eric A Weiss, Niru Maheswaranathan, and Surya Ganguli

work page
[34]

In Proceedings of the 32nd International Conference on Machine Learning (ICML), Vol

Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the 32nd International Conference on Machine Learning (ICML), Vol. 37. 2256–2265

work page
[35]

Jiao Sun, Deqing Fu, Yushi Hu, Su Wang, Royi Rassin, Da-Cheng Juan, Dana Alon, Charles Herrmann, Sjoerd Van Steenkiste, Ranjay Krishna, and Cyrus Rashtchian

work page
[36]

InProceedings of the Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL)

DreamSync: Aligning text-to-image generation with image understanding feedback. InProceedings of the Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL). 5920–5945

work page
[37]

Zhiwei Tang, Jiangweizhi Peng, Jiasheng Tang, Mingyi Hong, Fan Wang, and Tsung-Hui Chang. 2025. Inference-Time Alignment of Diffusion Models with Direct Noise Optimization. InForty-second International Conference on Machine Learning. https://openreview.net/forum?id=JpbqiD7n9r

work page 2025
[38]

Dacheng Tao, Xiaoou Tang, Xuelong Li, and Xindong Wu. 2006. Asymmetric bagging and random subspace for support vector machines-based relevance feedback in image retrieval.IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)28, 7 (2006), 1088–1099

work page 2006
[39]

Simon Tong and Daphne Koller. 2001. Support vector machine active learning with applications to text classification.Journal of Machine Learning Research2, Nov (2001), 45–66

work page 2001
[40]

Kavana Venkatesh, Yusuf Dalva, Ismini Lourentzou, and Pinar Yanardag. 2025. RAVEL: Rare Concept Generation and Editing via Graph-driven Relational Guid- ance. arXiv:2412.09614

work page arXiv 2025
[41]

Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik

work page
[42]

InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Diffusion Model Alignment Using Direct Preference Optimization. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 8228–8238

work page
[43]

Jianhui Wang, Yangfan He, Yan Zhong, Xinyuan Song, Jiayi Su, Yuheng Feng, Ruoyu Wang, Hongyang He, Wenyu Zhu, Xinhang Yuan, Miao Zhang, Keqin Li, Jiaqi Chen, Tianyu Shi, and Xueqian Wang. 2026. Twin Co-Adaptive Dialogue for Progressive Image Generation. arXiv:2504.14868

work page arXiv 2026
[44]

Leijie Wang, Kathryn Yurechko, and Amy X. Zhang. 2025. Promptimizer: User- Led Prompt Optimization for Personal Content Classification. arXiv:2510.09009

work page arXiv 2025
[45]

Zhijie Wang, Yuheng Huang, Da Song, Lei Ma, and Tianyi Zhang. 2024. PromptCharm: Text-to-Image Generation through Multi-modal Prompting and Refinement. InProceedings of the CHI Conference on Human Factors in Computing Systems (CHI ’24). ACM, 1–21

work page 2024
[46]

Zhen Xing et al . 2024. SimDA: Simple diffusion adapter for efficient video generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 7827–7839

work page 2024
[47]

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. 2023. ImageReward: Learning and evaluating human preferences for text-to-image generation. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 36. 15903–15935

work page 2023
[48]

L. Yang, H. Qian, Z. Zhang, J. Liu, and B. Cui. 2024. Structure-guided adversar- ial training of diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 7256–7266. Wang et al

work page 2024
[49]

Po-Hung Yeh, Kuang-Huei Lee, and Jun cheng Chen. 2025. Training-Free Dif- fusion Model Alignment with Sampling Demons. InThe Thirteenth Interna- tional Conference on Learning Representations. https://openreview.net/forum? id=tfemquulED

work page 2025
[50]

Zhang, X

J. Zhang, X. Zhou, W. Wang, B. Shi, and J. Pei. 2005. Using high dimensional indexes to support relevance feedback based interactive images retrieval. In Proceedings of the 31st International Conference on Very Large Data Bases (VLDB). 1211–1214

work page 2005
[51]

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. 2023. Adding conditional control to text-to-image diffusion models. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision (ICCV). 3836–3847

work page 2023
[52]

Xinchen Zhang, Ling Yang, Guohao Li, YaQi Cai, xie jiake, Yong Tang, Yu- jiu Yang, Mengdi Wang, and Bin CUI. 2025. IterComp: Iterative Composition- Aware Feedback Learning from Model Gallery for Text-to-Image Generation. InThe Thirteenth International Conference on Learning Representations. https: //openreview.net/forum?id=4w99NAikOE

work page 2025
[53]

Daquan Zhou, Weimin Wang, Hanshu Yan, Weiwei Lv, Yizhe Zhu, and Jiashi Feng

work page
[54]

MagicVideo: Efficient Video Generation With Latent Diffusion Models. (2023). arXiv:2211.11018

work page internal anchor Pith review Pith/arXiv arXiv 2023
[55]

Xiang Sean Zhou and Thomas S Huang. 2003. Relevance feedback in image retrieval: A comprehensive review.Multimedia Systems8, 6 (2003), 536–544

work page 2003
[56]

L. Zou, L. Xia, Z. Ding, J. Song, W. Liu, and D. Yin. 2019. Reinforcement learning to optimize long-term user engagement in recommender systems. InProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD). ACM, 2810–2818

work page 2019

[1] [1]

Jingkun An, Yinghao Zhu, Zongjian Li, Enshen Zhou, Haoran Feng, Xijie Huang, Bohua Chen, Yemin Shi, and Chengwei Pan. 2025. Agfsync: Leveraging AI- generated feedback for preference optimization in text-to-image generation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 1746–1754

work page 2025

[2] [2]

2023.Improving image generation with better captions

James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al . 2023.Improving image generation with better captions. Technical Report. OpenAI. https://cdn.openai. com/papers/dall-e-3.pdf

work page 2023

[3] [3]

Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. 2024. Training Diffusion Models with Reinforcement Learning. InInternational Con- ference on Learning Representations, B. Kim, Y. Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y. Sun (Eds.), Vol. 2024. 4965–4987

work page 2024

[4] [4]

Mattia Broilo and Francesco G. B. De Natale. 2010. A Stochastic Approach to Image Retrieval Using Relevance Feedback and Particle Swarm Optimization. IEEE Transactions on Multimedia12, 4 (2010), 267–277

work page 2010

[5] [5]

Xiaoliang Dai, Ji Hou, Chih-Yao Ma, Sam Tsai, Jialiang Wang, Rui Wang, Peizhao Zhang, Simon Vandenhende, Xiaofang Wang, Abhimanyu Dubey, Matthew Yu, Abhishek Kadian, Filip Radenovic, Dhruv Mahajan, Kunpeng Li, Yue Zhao, Vladan Petrovic, Mitesh Kumar Singh, Simran Motwani, Yi Wen, Yiwen Song, Roshan Sumbaly, Vignesh Ramanathan, Zijian He, Peter Vajda, and ...

work page

[6] [6]

arXiv:2309.15807

Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack. arXiv:2309.15807

work page arXiv

[7] [7]

Ritendra Datta, Dhiraj Joshi, Jia Li, and James Z Wang. 2008. Image retrieval: Ideas, influences, and trends of the new age.ACM Computing Surveys (CSUR)40, 2 (2008), 1–60

work page 2008

[8] [8]

Shivank Garg, Ayush Singh, and Gaurav Kumar Nayak. 2026. SIDiffAgent: Self-Improving Diffusion Agent. arXiv:2602.02051

work page arXiv 2026

[9] [9]

Yuhan Guo, Hanning Shao, Can Liu, Kai Xu, and Xiaoru Yuan. 2025. PrompTHis: Visualizing the Process and Influence of Prompt Editing During Text-to-Image Creation.IEEE Transactions on Visualization and Computer Graphics31, 9 (2025), 4547–4559

work page 2025

[10] [10]

Jonathan Ho, Tim Salimans, Alexey A Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. 2022. Video diffusion models. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 35. 3633–3646

work page 2022

[11] [11]

P. Hong, Q. Tian, and T. S. Huang. 2000. Incorporate support vector machines to content-based image retrieval with relevant feedback. InProceedings of the IEEE International Conference on Image Processing (ICIP), Vol. 3. IEEE, 750–753

work page 2000

[12] [12]

Liyao Jiang, Ruichen Chen, Chao Gao, and Di Niu. 2026. RAISE: Requirement- Adaptive Evolutionary Refinement for Training-Free Text-to-Image Alignment. arXiv:2603.00483

work page arXiv 2026

[13] [13]

Kanimozhi and K

T. Kanimozhi and K. Latha. 2015. An integrated approach to region based image retrieval using firefly algorithm and support vector machine.Neurocomputing 151 (2015), 1099–1111

work page 2015

[14] [14]

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav San- thanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. 2023. DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines. arXiv:2310.03714

work page internal anchor Pith review Pith/arXiv arXiv 2023

[15] [15]

Sunwoo Kim, Minkyu Kim, and Dongmin Park. 2025. Test-time Alignment of Diffusion Models without Reward Over-optimization. InThe Thirteenth Interna- tional Conference on Learning Representations. https://openreview.net/forum? id=vi3DjUhFVm

work page 2025

[16] [16]

Shu Kong, Xiaohui Shen, Zhe Lin, Radomir Mech, and Charless Fowlkes. 2016. Photo aesthetics ranking network with attributes and content adaptation. In Proceedings of the European Conference on Computer Vision (ECCV), Vol. 9905. Springer, 662–679

work page 2016

[17] [17]

Zhipeng Li, Yi-Chi Liao, and Christian Holz. 2026. Preference-Guided Prompt Optimization for Text-to-Image Generation. arXiv:2602.13131

work page arXiv 2026

[18] [18]

Iain Mackie, Shubham Chatterjee, and Jeffrey Dalton. 2023. Generative relevance feedback with large language models. InProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2026–2031

work page 2023

[19] [19]

Mahmood, M

A. Mahmood, M. Imran, A. Irtaza, Q. Abbas, and H. Dhahri. 2022. Hybrid evolutionary algorithm based relevance feedback approach for image retrieval. Computers, Materials & Continua70, 1 (2022), 963–979

work page 2022

[20] [20]

Chong Mou et al. 2024. T2I-Adapter: Learning adapters to dig out more con- trollable ability for text-to-image diffusion models. InProceedings of the AAAI Conference on Artificial Intelligence. 4296–4304

work page 2024

[21] [21]

Keith E. Muller. 2012. Statistical power analysis for the behavioral sciences. Technometrics(2012)

work page 2012

[22] [22]

Ofir Nabati, Guy Tennenholtz, ChihWei Hsu, Moonkyung Ryu, Deepak Ra- machandran, Yinlam Chow, Xiang Li, and Craig Boutilier. 2025. Preference Adaptive and Sequential Text-to-Image Generation. InForty-second International Conference on Machine Learning. https://openreview.net/forum?id=LCr6CIAEye

work page 2025

[23] [23]

Savvas Petridis, Benjamin D Wedin, James Wexler, Mahima Pushkarna, Aaron Donsbach, Nitesh Goyal, Carrie J Cai, and Michael Terry. 2024. Constitution- Maker: Interactively Critiquing Large Language Models by Converting Feedback into Principles. InProceedings of the 29th International Conference on Intelli- gent User Interfaces(Greenville, SC, USA)(IUI ’24)...

work page 2024

[24] [24]

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. 2023. SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis. arXiv:2307.01952

work page internal anchor Pith review Pith/arXiv arXiv 2023

[25] [25]

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen

work page

[26] [26]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv:2204.06125

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

Joseph J Rocchio. 1971. Relevance feedback in information retrieval. InThe SMART Retrieval System: Experiments in Automatic Document Processing, G. Salton (Ed.). Prentice-Hall, 313–323

work page 1971

[28] [28]

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR). 10684–10695

work page 2022

[29] [29]

Yong Rui, Thomas S Huang, Sharad Mehrotra, and Michael Ortega. 1998. Rele- vance feedback: A power tool for interactive content-based image retrieval.IEEE Transactions on Circuits and Systems for Video Technology8, 5 (1998), 644–655

work page 1998

[30] [30]

Eyal Segalis, Dani Valevski, Danny Lumen, Yossi Matias, and Yaniv Leviathan

work page

[31] [31]

arXiv:2310.16656

A Picture is Worth a Thousand Words: Principled Recaptioning Improves Image Generation. arXiv:2310.16656

work page arXiv

[32] [32]

Sharma, S

U. Sharma, S. Rudinac, O. S. Khan, and B. Þ. Jónsson. 2025. Can relevance feedback, conversational search and foundation models work together for interactive video search and exploration?. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). 3779–3788

work page 2025

[33] [33]

Jascha Sohl-Dickstein, Eric A Weiss, Niru Maheswaranathan, and Surya Ganguli

work page

[34] [34]

In Proceedings of the 32nd International Conference on Machine Learning (ICML), Vol

Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the 32nd International Conference on Machine Learning (ICML), Vol. 37. 2256–2265

work page

[35] [35]

Jiao Sun, Deqing Fu, Yushi Hu, Su Wang, Royi Rassin, Da-Cheng Juan, Dana Alon, Charles Herrmann, Sjoerd Van Steenkiste, Ranjay Krishna, and Cyrus Rashtchian

work page

[36] [36]

InProceedings of the Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL)

DreamSync: Aligning text-to-image generation with image understanding feedback. InProceedings of the Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL). 5920–5945

work page

[37] [37]

Zhiwei Tang, Jiangweizhi Peng, Jiasheng Tang, Mingyi Hong, Fan Wang, and Tsung-Hui Chang. 2025. Inference-Time Alignment of Diffusion Models with Direct Noise Optimization. InForty-second International Conference on Machine Learning. https://openreview.net/forum?id=JpbqiD7n9r

work page 2025

[38] [38]

Dacheng Tao, Xiaoou Tang, Xuelong Li, and Xindong Wu. 2006. Asymmetric bagging and random subspace for support vector machines-based relevance feedback in image retrieval.IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)28, 7 (2006), 1088–1099

work page 2006

[39] [39]

Simon Tong and Daphne Koller. 2001. Support vector machine active learning with applications to text classification.Journal of Machine Learning Research2, Nov (2001), 45–66

work page 2001

[40] [40]

Kavana Venkatesh, Yusuf Dalva, Ismini Lourentzou, and Pinar Yanardag. 2025. RAVEL: Rare Concept Generation and Editing via Graph-driven Relational Guid- ance. arXiv:2412.09614

work page arXiv 2025

[41] [41]

Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik

work page

[42] [42]

InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Diffusion Model Alignment Using Direct Preference Optimization. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 8228–8238

work page

[43] [43]

Jianhui Wang, Yangfan He, Yan Zhong, Xinyuan Song, Jiayi Su, Yuheng Feng, Ruoyu Wang, Hongyang He, Wenyu Zhu, Xinhang Yuan, Miao Zhang, Keqin Li, Jiaqi Chen, Tianyu Shi, and Xueqian Wang. 2026. Twin Co-Adaptive Dialogue for Progressive Image Generation. arXiv:2504.14868

work page arXiv 2026

[44] [44]

Leijie Wang, Kathryn Yurechko, and Amy X. Zhang. 2025. Promptimizer: User- Led Prompt Optimization for Personal Content Classification. arXiv:2510.09009

work page arXiv 2025

[45] [45]

Zhijie Wang, Yuheng Huang, Da Song, Lei Ma, and Tianyi Zhang. 2024. PromptCharm: Text-to-Image Generation through Multi-modal Prompting and Refinement. InProceedings of the CHI Conference on Human Factors in Computing Systems (CHI ’24). ACM, 1–21

work page 2024

[46] [46]

Zhen Xing et al . 2024. SimDA: Simple diffusion adapter for efficient video generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 7827–7839

work page 2024

[47] [47]

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. 2023. ImageReward: Learning and evaluating human preferences for text-to-image generation. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 36. 15903–15935

work page 2023

[48] [48]

L. Yang, H. Qian, Z. Zhang, J. Liu, and B. Cui. 2024. Structure-guided adversar- ial training of diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 7256–7266. Wang et al

work page 2024

[49] [49]

Po-Hung Yeh, Kuang-Huei Lee, and Jun cheng Chen. 2025. Training-Free Dif- fusion Model Alignment with Sampling Demons. InThe Thirteenth Interna- tional Conference on Learning Representations. https://openreview.net/forum? id=tfemquulED

work page 2025

[50] [50]

Zhang, X

J. Zhang, X. Zhou, W. Wang, B. Shi, and J. Pei. 2005. Using high dimensional indexes to support relevance feedback based interactive images retrieval. In Proceedings of the 31st International Conference on Very Large Data Bases (VLDB). 1211–1214

work page 2005

[51] [51]

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. 2023. Adding conditional control to text-to-image diffusion models. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision (ICCV). 3836–3847

work page 2023

[52] [52]

Xinchen Zhang, Ling Yang, Guohao Li, YaQi Cai, xie jiake, Yong Tang, Yu- jiu Yang, Mengdi Wang, and Bin CUI. 2025. IterComp: Iterative Composition- Aware Feedback Learning from Model Gallery for Text-to-Image Generation. InThe Thirteenth International Conference on Learning Representations. https: //openreview.net/forum?id=4w99NAikOE

work page 2025

[53] [53]

Daquan Zhou, Weimin Wang, Hanshu Yan, Weiwei Lv, Yizhe Zhu, and Jiashi Feng

work page

[54] [54]

MagicVideo: Efficient Video Generation With Latent Diffusion Models. (2023). arXiv:2211.11018

work page internal anchor Pith review Pith/arXiv arXiv 2023

[55] [55]

Xiang Sean Zhou and Thomas S Huang. 2003. Relevance feedback in image retrieval: A comprehensive review.Multimedia Systems8, 6 (2003), 536–544

work page 2003

[56] [56]

L. Zou, L. Xia, Z. Ding, J. Song, W. Liu, and D. Yin. 2019. Reinforcement learning to optimize long-term user engagement in recommender systems. InProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD). ACM, 2810–2818

work page 2019