arxiv: 2302.12192 · v1 · submitted 2023-02-23 · 💻 cs.LG · cs.AI· cs.CV

Recognition: no theorem link

Aligning Text-to-Image Models using Human Feedback

Kimin Lee , Hao Liu , Moonkyung Ryu , Olivia Watkins , Yuqing Du , Craig Boutilier , Pieter Abbeel , Mohammad Ghavamzadeh

show 1 more author

Shixiang Shane Gu

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:35 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CV

keywords text-to-imagehuman feedbackfine-tuningreward modelimage-text alignmentgenerative modelspreference learning

0 comments

The pith

Fine-tuning text-to-image models with human feedback improves accuracy on prompts specifying colors, counts, and backgrounds.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a three-stage method to better align text-to-image models with human intent. Raters first judge how well generated images match a diverse set of text prompts. These judgments train a reward function that scores new image-text pairs. The generative model is then updated by increasing the likelihood of high-reward outputs. This process yields images that more reliably depict the right object colors, quantities, and settings than the original model. The approach matters because current generators frequently miss or distort such concrete details in their outputs.

Core claim

The central claim is that a reward function trained on human assessments of image-text alignment can guide fine-tuning of a pre-trained text-to-image model through reward-weighted likelihood maximization, producing outputs that more accurately reflect specified colors, counts, and backgrounds.

What carries the argument

The reward-weighted likelihood fine-tuning step, which reweights the training objective using scores from a human-trained reward predictor to favor better-aligned images.

If this is right

The updated model produces more accurate renderings of objects with user-specified colors, quantities, and scene backgrounds.
Several design choices during reward training and fine-tuning must be tuned to avoid degrading image fidelity while gaining alignment.
Human preference data collected once can be reused to improve alignment on new prompts without retraining from scratch.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same human-feedback loop could be tested on text-to-video or text-to-3D generators where counting and attribute accuracy are also common failure modes.
If the initial prompt set used for feedback is narrow, the reward model may only improve performance on similar prompt styles.
Pairing this reward-weighted update with other techniques such as classifier-free guidance could produce additive gains in alignment.

Load-bearing premise

Human ratings of image-text alignment are consistent enough across raters and prompts that a learned reward function can generalize without adding new biases or lowering overall image quality.

What would settle it

Run the fine-tuned model on a fresh set of prompts that explicitly request particular colors, object counts, and backgrounds; if the fraction of correctly rendered images does not exceed that of the pre-trained model, or if visual quality drops measurably, the improvement claim is falsified.

read the original abstract

Deep generative models have shown impressive results in text-to-image synthesis. However, current text-to-image models often generate images that are inadequately aligned with text prompts. We propose a fine-tuning method for aligning such models using human feedback, comprising three stages. First, we collect human feedback assessing model output alignment from a set of diverse text prompts. We then use the human-labeled image-text dataset to train a reward function that predicts human feedback. Lastly, the text-to-image model is fine-tuned by maximizing reward-weighted likelihood to improve image-text alignment. Our method generates objects with specified colors, counts and backgrounds more accurately than the pre-trained model. We also analyze several design choices and find that careful investigations on such design choices are important in balancing the alignment-fidelity tradeoffs. Our results demonstrate the potential for learning from human feedback to significantly improve text-to-image models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper adapts the RLHF pipeline to text-to-image models with a three-stage process and shows gains on color/count/background control, but the reward model's generalization to new prompts is not demonstrated.

read the letter

The main takeaway is that the authors take the standard human-feedback alignment recipe from language models and apply it to text-to-image generation. They collect ratings on alignment for a set of prompts, train a reward model on those labels, and then fine-tune the generator by maximizing reward-weighted likelihood. The result is better accuracy on prompts that specify object colors, counts, and backgrounds compared to the base model. They also note that design choices matter for the alignment-fidelity tradeoff. That pipeline itself is the concrete new piece here; prior work had not spelled out these exact stages for diffusion or similar image models. The approach is straightforward and empirically grounded in the sense that it uses real human labels rather than synthetic proxies. The abstract is clear about the three stages and the qualitative improvements. The soft spot is generalization. The stress-test note is right: nothing in the provided text shows held-out prompt evaluation or OOD accuracy for the reward model. If the reward model mainly memorizes the feedback distribution, the fine-tuning gains could be narrow. The abstract mentions a diverse set of prompts but gives no numbers on how well the reward predicts on unseen ones, no error bars, and no ablation on prompt novelty. That leaves the central claim plausible but not fully secured. This paper is for researchers already working on alignment or fine-tuning of generative models who want a practical recipe to try. It is not a foundational theoretical result, but the method is simple enough that labs could reproduce the stages. It deserves a serious referee because the application is timely and the pipeline is reproducible in principle; reviewers can ask for the missing generalization checks and quantitative tables. I would send it to review rather than desk reject.

Referee Report

3 major / 2 minor

Summary. The paper proposes a three-stage fine-tuning pipeline for text-to-image models: (1) collect human ratings of image-text alignment on a set of diverse prompts, (2) train a reward model to predict those ratings, and (3) fine-tune the generative model by maximizing reward-weighted likelihood. It claims that the resulting model produces objects with specified colors, counts, and backgrounds more accurately than the pre-trained baseline and analyzes several design choices that affect the alignment-fidelity tradeoff.

Significance. If the reward model generalizes reliably, the approach offers a practical, human-in-the-loop route to improving prompt adherence in large-scale generative models without architectural redesign. The work adapts established RLHF ideas to diffusion-style generators and highlights the need to balance alignment gains against sample quality, which could inform future alignment pipelines in vision-language systems.

major comments (3)

[Abstract / Results] Abstract and experimental results: the central claim that the fine-tuned model generates objects with specified colors, counts, and backgrounds “more accurately” is stated without quantitative metrics, error bars, baseline comparisons, or statistical tests. The absence of these details prevents verification of the reported improvement.
[Reward Model Training] Reward-model section: no held-out prompt evaluation, OOD accuracy, or ablation on prompt novelty is reported. Because the fine-tuning step relies on the reward model ranking images for arbitrary new prompts, the lack of generalization evidence leaves the core assumption untested.
[Fine-Tuning Stage] Fine-tuning procedure: the weighting coefficient that multiplies the reward term in the likelihood objective is listed among the free parameters, yet no ablation or sensitivity analysis is provided despite the paper’s emphasis on design-choice tradeoffs.

minor comments (2)

[Abstract] The abstract states that “careful investigations on such design choices are important” but does not enumerate the specific choices examined or the quantitative trends observed; a brief summary table would improve clarity.
[Method] Notation for the reward-weighted likelihood objective should be introduced explicitly (e.g., as an equation) rather than described only in prose, to facilitate reproduction.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of results and analyses.

read point-by-point responses

Referee: [Abstract / Results] Abstract and experimental results: the central claim that the fine-tuned model generates objects with specified colors, counts, and backgrounds “more accurately” is stated without quantitative metrics, error bars, baseline comparisons, or statistical tests. The absence of these details prevents verification of the reported improvement.

Authors: We agree that quantitative support is needed for the central claim. In the revised manuscript we will report concrete accuracy percentages (with standard deviations across multiple random seeds) for color, count, and background adherence on a held-out evaluation set of 500 prompts, include direct numerical comparisons against the pre-trained baseline, and add paired t-tests to establish statistical significance of the observed gains. revision: yes
Referee: [Reward Model Training] Reward-model section: no held-out prompt evaluation, OOD accuracy, or ablation on prompt novelty is reported. Because the fine-tuning step relies on the reward model ranking images for arbitrary new prompts, the lack of generalization evidence leaves the core assumption untested.

Authors: We acknowledge the omission. The revised version will add a dedicated evaluation subsection reporting reward-model accuracy on a 20% held-out prompt split, plus accuracy on an out-of-distribution prompt set (e.g., rare object combinations and novel styles). We will also include a brief ablation showing how reward-model performance varies with prompt novelty. revision: yes
Referee: [Fine-Tuning Stage] Fine-tuning procedure: the weighting coefficient that multiplies the reward term in the likelihood objective is listed among the free parameters, yet no ablation or sensitivity analysis is provided despite the paper’s emphasis on design-choice tradeoffs.

Authors: We agree that an explicit sensitivity analysis is warranted. The revision will include a new figure and table that sweep the weighting coefficient over a range of values, reporting both alignment metrics and FID scores to quantify the alignment-fidelity tradeoff for each choice. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the human-feedback alignment pipeline

full rationale

The paper presents an empirical pipeline: collect human ratings on image-text alignment for a set of prompts, train a reward model via supervised learning to predict those ratings, then fine-tune the text-to-image model by maximizing reward-weighted likelihood. The claimed gains (better color/count/background accuracy) are measured post-hoc on held-out evaluations and are not shown to reduce by construction to the training labels or any fitted parameter. No equations, self-definitional steps, or load-bearing self-citations appear in the provided text; the derivation relies on external human data and standard optimization rather than tautological renaming or imported uniqueness theorems. This is the expected non-circular outcome for a supervised fine-tuning method.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that human alignment judgments form a learnable signal that can be optimized without collapsing image fidelity; this is treated as a domain assumption rather than derived.

free parameters (1)

fine-tuning hyperparameters and reward weighting coefficient
Chosen to balance alignment gains against fidelity degradation; values are not derived from first principles.

axioms (1)

domain assumption Human feedback on image-text alignment is sufficiently consistent and generalizable to train a predictive reward function
Invoked in the second stage to justify training the reward model from collected labels.

pith-pipeline@v0.9.0 · 5473 in / 1163 out tokens · 53894 ms · 2026-05-14T20:35:53.230683+00:00 · methodology

discussion (0)

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Flow-GRPO: Training Flow Matching Models via Online RL
cs.CV 2025-05 unverdicted novelty 8.0

Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.
CreFlow: Corrective Reflow for Sparse-Reward Embodied Video Diffusion RL
cs.CV 2026-05 conditional novelty 7.0

CreFlow combines LTL compositional rewards with credit-aware NFT and corrective reflow losses in online RL to improve embodied video diffusion models, raising downstream task success by 23.8 percentage points on eight...
Transfer Learning of Multiobjective Indirect Low-Thrust Trajectories Using Diffusion Models and Markov Chain Monte Carlo
eess.SY 2026-05 unverdicted novelty 7.0

A homotopy-plus-MCMC data-generation pipeline trains a mass-conditioned diffusion model that yields 40% more feasible initial costates and a better Pareto front for multiobjective indirect low-thrust transfers than ad...
Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation with Asynchronous Dual-Stream and Human-Centric Preference Distillation
cs.CV 2026-04 unverdicted novelty 7.0

Hallo-Live achieves 20.38 FPS real-time text-to-audio-video avatar generation with 0.94s latency using asynchronous dual-stream diffusion and HP-DMD preference distillation, matching teacher model quality at 16x highe...
Guiding Distribution Matching Distillation with Gradient-Based Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 7.0

GDMD replaces raw-sample rewards with distillation-gradient rewards in RL-guided diffusion distillation, yielding 4-step models that surpass their multi-step teachers on GenEval and human preference metrics.
UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models
cs.CV 2026-04 unverdicted novelty 7.0

UDM-GRPO is the first RL integration for uniform discrete diffusion models, using final clean samples as actions and forward-process trajectory reconstruction to raise GenEval accuracy from 69% to 96% and OCR accuracy...
Step-level Denoising-time Diffusion Alignment with Multiple Objectives
cs.LG 2026-04 unverdicted novelty 7.0

MSDDA derives a closed-form optimal reverse denoising distribution for multi-objective diffusion alignment that is exactly equivalent to step-level RL fine-tuning with no approximation error.
DiffusionNFT: Online Diffusion Reinforcement with Forward Process
cs.LG 2025-09 unverdicted novelty 7.0

DiffusionNFT performs online RL for diffusion models on the forward process via flow matching and positive-negative contrasts, delivering up to 25x efficiency gains and rapid benchmark improvements over prior reverse-...
MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE
cs.AI 2025-07 unverdicted novelty 7.0

MixGRPO speeds up GRPO for flow-based image generators by restricting SDE sampling and optimization to a sliding window while using ODE elsewhere, cutting training time by up to 71% with better alignment performance.
Unified Reward Model for Multimodal Understanding and Generation
cs.CV 2025-03 unverdicted novelty 7.0

UnifiedReward is the first unified reward model that jointly assesses multimodal understanding and generation to provide better preference signals for aligning vision models via DPO.
Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping
cs.CV 2026-05 unverdicted novelty 6.0

Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to r...
dFlowGRPO: Rate-Aware Policy Optimization for Discrete Flow Models
cs.LG 2026-05 unverdicted novelty 6.0

dFlowGRPO is a new rate-aware RL method for discrete flow models that outperforms prior GRPO approaches on image generation and matches continuous flow models while supporting broad probability paths.
From Synthetic to Real: Toward Identity-Consistent Makeup Transfer with Synthetic and Real Data
cs.CV 2026-05 unverdicted novelty 6.0

The work creates identity-consistent synthetic makeup data via ConsistentBeauty and adapts models to real images using reinforcement learning in RealBeauty, achieving better identity preservation and real-world perfor...
Response Time Enhances Alignment with Heterogeneous Preferences
cs.LG 2026-05 unverdicted novelty 6.0

Response times modeled as drift-diffusion processes enable consistent estimation of population-average preferences from heterogeneous anonymous binary choices.
Threshold-Guided Optimization for Visual Generative Models
cs.LG 2026-05 unverdicted novelty 6.0

A threshold-guided alignment method lets visual generative models be optimized directly from scalar human ratings instead of requiring paired preference data.
Anomaly-Preference Image Generation
cs.CV 2026-05 unverdicted novelty 6.0

Anomaly Preference Optimization reformulates anomalous image synthesis as preference learning with implicit alignment from real anomalies and a time-aware capacity allocation module for diffusion models to balance div...
V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think
cs.LG 2026-04 unverdicted novelty 6.0

V-GRPO makes ELBO surrogates stable and efficient for online RL alignment of denoising models, delivering SOTA text-to-image performance with 2-3x speedups over MixGRPO and DiffusionNFT.
FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling
cs.LG 2026-04 unverdicted novelty 6.0

Sol-RL decouples FP4-based candidate exploration from BF16 policy optimization in diffusion RL, delivering up to 4.64x faster convergence with maintained or superior alignment performance on models like FLUX.1 and SD3.5.
Improving Video Generation with Human Feedback
cs.CV 2025-01 unverdicted novelty 6.0

A human preference dataset and VideoReward model enable Flow-DPO and Flow-NRG to produce smoother, better-aligned videos from text prompts in flow-based generators.
Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis
cs.CV 2023-06 conditional novelty 6.0

HPD v2 is the largest human preference dataset for text-to-image images with 798k choices, and HPS v2 is the resulting CLIP-based scorer that better predicts human judgments and responds to model improvements.
Training Diffusion Models with Reinforcement Learning
cs.LG 2023-05 unverdicted novelty 6.0

DDPO uses policy gradients on the denoising process to optimize diffusion models for arbitrary rewards like human feedback or compressibility.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · cited by 21 Pith papers · 15 internal anchors

[1]

A General Language Assistant as a Laboratory for Alignment

Askell, A., Bai, Y ., Chen, A., Drain, D., Ganguli, D., Henighan, T., Jones, A., Joseph, N., Mann, B., DasSarma, N., et al. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

An Actor-Critic Algorithm for Sequence Prediction

Bahdanau, D., Brakel, P., Xu, K., Goyal, A., Lowe, R., Pineau, J., Courville, A., and Bengio, Y . An actor- critic algorithm for sequence prediction. arXiv preprint arXiv:1607.07086,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Bai, Y ., Jones, A., Ndousse, K., Askell, A., Chen, A., Das- Sarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., et al. Training a helpful and harmless assistant with rein- forcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022a. Bai, Y ., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirh...

work page internal anchor Pith review Pith/arXiv arXiv 2005
[4]

E., and Wang, W

Feng, W., He, X., Fu, T.-J., Jampani, V ., Akula, A., Narayana, P., Basu, S., Wang, X. E., and Wang, W. Y . Training-free structured diffusion guidance for compositional text-to-image synthesis. arXiv preprint arXiv:2212.05032,

work page arXiv
[5]

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

Gal, R., Alaluf, Y ., Atzmon, Y ., Patashnik, O., Bermano, A. H., Chechik, G., and Cohen-Or, D. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

CLIPScore: A Reference-free Evaluation Metric for Image Captioning

Hessel, J., Holtzman, A., Forbes, M., Bras, R. L., and Choi, Y . Clipscore: A reference-free evaluation metric for im- age captioning. arXiv preprint arXiv:2104.08718,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Kingma, D. P. and Welling, M. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Can Neural Machine Translation be Improved with User Feedback?

Kreutzer, J., Khadivi, S., Matusov, E., and Riezler, S. Can neural machine translation be improved with user feed- back? arXiv preprint arXiv:1804.05958,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Multi-concept customization of text-to-image diffusion

Kumari, N., Zhang, B., Zhang, R., Shechtman, E., and Zhu, J.-Y . Multi-concept customization of text-to-image diffusion. arXiv preprint arXiv:2212.04488,

work page arXiv
[10]

Chain of hindsight aligns language models with feedback

Liu, H., Sferrazza, C., and Abbeel, P. Chain of hindsight aligns language models with feedback. arXiv preprint arXiv: Arxiv-2302.02676,

work page arXiv
[11]

Liu, N., Li, S., Du, Y ., Torralba, A., and Tenenbaum, J. B. Compositional visual generation with composable diffu- sion models. arXiv preprint arXiv:2206.01714, 2022a. Liu, R., Garrette, D., Saharia, C., Chan, W., Roberts, A., Narang, S., Blok, I., Mical, R., Norouzi, M., and Constant, N. Character-aware models improve visual text rendering. arXiv prepri...

work page arXiv
[12]

VIFIDEL: Evaluating the Visual Fidelity of Image Descriptions

Madhyastha, P., Wang, J., and Specia, L. Viﬁdel: Evaluating the visual ﬁdelity of image descriptions.arXiv preprint arXiv:1907.09340,

work page internal anchor Pith review Pith/arXiv arXiv 1907
[13]

WebGPT: Browser-assisted question-answering with human feedback

Nakano, R., Hilton, J., Balaji, S., Wu, J., Ouyang, L., Kim, C., Hesse, C., Jain, S., Kosaraju, V ., Saunders, W., et al. Webgpt: Browser-assisted question-answering with hu- man feedback. arXiv preprint arXiv:2112.09332,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Training language models to follow instructions with human feedback

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. Training language models to fol- low instructions with human feedback. arXiv preprint arXiv:2203.02155,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and Chen, M. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

DreamBooth: Fine tuning text-to-image diffusion models for subject-driven gen- eration

Ruiz, N., Li, Y ., Jampani, V ., Pritch, Y ., Rubinstein, M., and Aberman, K. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. arXiv preprint arXiv:2208.12242,

work page arXiv
[17]

A., Chan, J

Scheurer, J., Campos, J. A., Chan, J. S., Chen, A., Cho, K., and Perez, E. Training language models with language feedback. arXiv preprint arXiv: Arxiv-2204.14146,

work page arXiv
[18]

LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

Schuhmann, C., Vencu, R., Beaumont, R., Kaczmarczyk, R., Mullis, C., Katta, A., Coombes, T., Jitsev, J., and Komatsuzaki, A. Laion-400m: Open dataset of clip- ﬁltered 400 million image-text pairs. arXiv preprint arXiv:2111.02114,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

LAION-5B: An open large-scale dataset for training next generation image-text models

Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., et al. Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Proximal Policy Optimization Algorithms

Aligning Text-to-Image Models using Human Feedback Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul Christiano

Stiennon, N., Ouyang, L., Wu, J., Ziegler, D. M., Lowe, R., V oss, C., Radford, A., Amodei, D., and Christiano, P. Learning to summarize from human feedback. arXiv preprint arXiv:2009.01325,

work page arXiv 2009
[22]

Recursively summarizing books with human feedback,

Wu, J., Ouyang, L., Ziegler, D. M., Stiennon, N., Lowe, R., Leike, J., and Christiano, P. Recursively summa- rizing books with human feedback. arXiv preprint arXiv:2109.10862,

work page arXiv
[23]

Coca: Contrastive captioners are image- text foundation models

Yu, J., Wang, Z., Vasudevan, V ., Yeung, L., Seyedhosseini, M., and Wu, Y . Coca: Contrastive captioners are image- text foundation models. arXiv preprint arXiv:2205.01917, 2022a. Yu, J., Xu, Y ., Koh, J. Y ., Luong, T., Baid, G., Wang, Z., Vasudevan, V ., Ku, A., Yang, Y ., Ayan, B. K., et al. Scal- ing autoregressive models for content-rich text-to-imag...

work page arXiv
[24]

Fine-Tuning Language Models from Human Preferences

Ziegler, D. M., Stiennon, N., Wu, J., Brown, T. B., Radford, A., Amodei, D., Christiano, P., and Irving, G. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593,

work page internal anchor Pith review Pith/arXiv arXiv 1909
[25]

Percentage of generated images from our ﬁne-tuned model that are better than (win), tied with, or worse than (lose) the compared to original stable diffusion model with rejection sampling in terms of image-text alignment and ﬁdelity. D. Experimental Details Model architecture. For our baseline generative model, we use stable diffusion v1.5 (Rombach et al....

work page 2022
[26]

FID measurement using MS-CoCo dataset

The model is trained in half-precision on 4 40GB NVIDIA A100 GPUs, with a per-GPU batch size of 8, resulting in a toal batch size of 512 (256 for pre-training data and 256 for model-generated data).16 It is trained for a total of 10,000 updates. FID measurement using MS-CoCo dataset . We measure FID scores to evaluate the ﬁdelity of different models using...

work page 2022