The Reward Was in Your Data All Along: Correcting Flow Matching with Discriminator-Guided RL

Adriana Romero-Soriano; Felix Friedrich; Michal Drozdzal; Nicolas Beltran-Velez; Reyhane Askari-Hemmat; Xiaochuang Han; Zhang Xiaofeng

arxiv: 2606.19162 · v1 · pith:626O2D7Nnew · submitted 2026-06-17 · 💻 cs.LG · cs.CV

The Reward Was in Your Data All Along: Correcting Flow Matching with Discriminator-Guided RL

Nicolas Beltran-Velez , Felix Friedrich , Zhang Xiaofeng , Reyhane Askari-Hemmat , Xiaochuang Han , Adriana Romero-Soriano , Michal Drozdzal This is my paper

Pith reviewed 2026-06-26 20:51 UTC · model grok-4.3

classification 💻 cs.LG cs.CV

keywords flow matchingdiscriminator guided RLgenerative modelsreinforcement learningimage generationFIDdensity ratio estimation

0 comments

The pith

Discriminator-Guided RL uses a data-versus-model classifier logit as reward to steer flow matching models toward the true data distribution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Score- and flow-matching models rely on regression losses that measure velocity or score error under training marginals, yet these losses align poorly with the visual realism and object coherence that matter at sampling time. The paper claims this mismatch forces practitioners to reach for preference RL even when the goal is simply to recover properties already present in the training data. Discriminator-Guided RL trains a binary discriminator inside a fixed pretrained representation space to separate real data from base-model samples, then inserts the discriminator logit directly into KL-regularized RL as the reward. Because the optimal discriminator logit equals the log density ratio, the reward steers the model distribution exactly toward the data distribution. The resulting models show large drops in guidance-free FID and semantic feature distance across four different flow backbones, plus improved human preference scores they were never trained on.

Core claim

Discriminator-Guided RL trains a discriminator to separate data from base-model samples in a pretrained representation space and uses its logit as the reward in KL-regularized RL; the logit estimates the log-likelihood ratio between data and model, which is the optimal reward for targeting the data distribution.

What carries the argument

Discriminator-Guided RL (DRL), which converts the logit of a binary classifier trained on data versus base-model samples into an RL reward signal.

If this is right

Guidance-free FID falls substantially, for example from 9.38 to 2.62 on SiT.
Semantic-space FD improves, for example from 88.2 to 19.3 on DINOv3 features for SiT.
Improvements hold across SiT, JiT, REPA, and RAE backbones.
Human-preference rewards rise even though the method trains on no human labels.
Subsequent preference-based post-training reaches a better Pareto frontier between preference reward and image fidelity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Representation spaces from existing vision models can serve as a proxy for perceptual quality when constructing rewards.
The same logit-reward construction could be tested on other generative training objectives that exhibit a training-inference mismatch.
DRL may reduce the volume of human preference data needed when it is later combined with preference RL.

Load-bearing premise

The logit of the discriminator estimates the log-likelihood ratio between data and model, which is the optimal reward for targeting the data distribution.

What would settle it

Measure whether the discriminator logit on held-out samples correlates with the true log density ratio between data and model distributions, or whether ablating the pretrained representation space removes the reported FID and FD gains.

read the original abstract

Score- and flow-matching models often rely on preference-based reinforcement learning for two purposes: aligning with subjective preferences and, surprisingly, recovering properties such as visual realism and coherent object structure that matching-based training is intended to learn from the data itself. We argue that this reflects a structural mismatch. Matching losses measure $\ell_2$ regression error on the velocity or score field under training-time marginals, a proxy poorly aligned with the visual and semantic properties that determine sample quality at inference. Given a reward aligned with these properties, RL sidesteps the mismatch by evaluating the model on its own samples and following the reward landscape directly. The challenge is to obtain such a reward without relying on human preferences, which are expensive and conflate data realism with annotator inclinations. We propose Discriminator-Guided RL (DRL). DRL trains a discriminator to separate data from base-model samples in a pretrained representation space and uses its logit as the reward in KL-regularized RL. The pretrained space restricts the discriminator to perceptually meaningful directions, and the logit estimates the log-likelihood ratio between data and model, which is the optimal reward for targeting the data distribution. Across SiT, JiT, REPA, and RAE, DRL reduces guidance-free FID (e.g., $9.38 \to 2.62$ on SiT) and semantic-space FD (e.g., $88.2 \to 19.3$ on DINOv3 for SiT), with consistent gains across all backbones, and improves human-preference rewards without training on them. It also yields a better Pareto frontier between preference reward and image fidelity under subsequent preference-based post-training, increasing alignment while reducing low-level artifacts such as oversaturation and excessive brightness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DRL shows practical FID gains on flow matching models via a fixed discriminator reward, but the optimality claim does not survive the policy shift during RL.

read the letter

The core move is training a discriminator once to tell data apart from base-model samples inside a frozen pretrained representation, then feeding its logit straight into KL-regularized RL as the reward. This produces the reported drops in guidance-free FID (9.38 to 2.62 on SiT) and semantic FD across SiT, JiT, REPA, and RAE, plus better human-preference scores without ever training on them.

What is actually new is the specific construction: discriminator logit from a pretrained space used as the fixed reward inside the RL loop for flow-matching correction. The experiments demonstrate that this yields consistent metric lifts on several backbones and improves the Pareto front when preference tuning is applied afterward.

The soft spot is the justification that the logit equals the optimal log-likelihood ratio for targeting the data distribution. That equality holds only against the distribution the discriminator saw during its training. Once RL moves the policy away from the base model, the static logit no longer matches log(p_data / p_current). The abstract gives no sign that the discriminator is retrained or adapted during RL, so the central optimality argument does not go through. The reported numbers are also given without error bars, ablations, or protocol details, which leaves the size and reliability of the gains hard to judge.

This is for groups already running flow-matching or diffusion pipelines who want a data-only signal to reduce reliance on human labels. It deserves a serious referee because the empirical pattern is worth checking and the method is simple enough to test, but the review will need to press on whether the theoretical framing survives the distribution shift.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Discriminator-Guided RL (DRL) to address a structural mismatch in score- and flow-matching models, where matching losses are poorly aligned with inference-time sample quality. DRL trains a discriminator in a pretrained representation space to separate real data from base-model samples, then uses the discriminator logit as a fixed reward inside KL-regularized RL. The central claim is that this logit provides the optimal reward for targeting the data distribution. Experiments report large gains in guidance-free FID (e.g., 9.38→2.62 on SiT) and semantic FD across SiT, JiT, REPA, and RAE backbones, plus improved human-preference scores and a better Pareto frontier under subsequent preference post-training.

Significance. If the empirical gains and the optimality justification hold, the work shows that a data-derived discriminator reward can recover visual and semantic properties that matching losses fail to capture, while also improving alignment without direct human-preference training. The multi-backbone consistency and the reported improvement in the preference-fidelity trade-off would be notable contributions to post-training of generative models.

major comments (2)

[Abstract] Abstract (paragraph on DRL): The claim that 'the logit estimates the log-likelihood ratio between data and model, which is the optimal reward for targeting the data distribution' is load-bearing for the assertion that DRL directly targets the data distribution without human preferences. The discriminator is trained only on base-model samples and remains fixed during RL; once the policy updates, the current model distribution diverges from the base model, so the static logit no longer equals log(p_data / p_current). The manuscript must either derive why the fixed reward remains optimal or provide an alternative justification that does not rely on this equality.
[Abstract] Abstract (experimental claims): The reported FID and FD reductions (e.g., 9.38→2.62 on SiT, 88.2→19.3 on DINOv3) are central to the contribution, yet the abstract provides no error bars, number of runs, or ablation details on discriminator training or RL hyperparameters. Full experimental tables and protocols are required to assess whether the gains are robust and attributable to the proposed reward rather than other factors.

minor comments (2)

The abstract refers to 'pretrained representation space' without naming the specific space or backbone used for the discriminator; this should be stated explicitly in the method section.
Notation for the discriminator logit and the KL-regularized objective should be introduced with equation numbers in the main text rather than only in the abstract.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below with the strongest honest defense, indicating planned revisions where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract (paragraph on DRL): The claim that 'the logit estimates the log-likelihood ratio between data and model, which is the optimal reward for targeting the data distribution' is load-bearing for the assertion that DRL directly targets the data distribution without human preferences. The discriminator is trained only on base-model samples and remains fixed during RL; once the policy updates, the current model distribution diverges from the base model, so the static logit no longer equals log(p_data / p_current). The manuscript must either derive why the fixed reward remains optimal or provide an alternative justification that does not rely on this equality.

Authors: We acknowledge the referee's point that the logit equals log(p_data / p_base) exactly and that this equality does not hold for p_current after policy updates. The KL regularization in the RL objective constrains the policy to remain close to the base model, preserving the reward's utility. As an alternative justification independent of the exact equality, the discriminator operates in a fixed pretrained representation space that captures perceptually relevant directions; the resulting reward signal is thus a stable, data-derived objective that RL can optimize directly on model samples. We will revise the abstract to remove the load-bearing phrasing and add a derivation in the main text explaining both the KL-proximity argument and the representation-space stability. revision: partial
Referee: [Abstract] Abstract (experimental claims): The reported FID and FD reductions (e.g., 9.38→2.62 on SiT, 88.2→19.3 on DINOv3) are central to the contribution, yet the abstract provides no error bars, number of runs, or ablation details on discriminator training or RL hyperparameters. Full experimental tables and protocols are required to assess whether the gains are robust and attributable to the proposed reward rather than other factors.

Authors: We agree that the abstract should convey experimental robustness. The full manuscript reports results over multiple independent runs with standard deviations and includes ablations on discriminator training and RL hyperparameters in the appendix. We will revise the abstract to note the number of runs and reference the full protocols and tables. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper trains a separate discriminator on held-out data versus fixed base-model samples, then uses its logit (as log p_data / p_base) as a static reward inside KL-regularized RL. This construction does not define the reward in terms of any fitted parameter of the evolving policy, nor does it rename a fitted quantity as a prediction. No self-citation chains, uniqueness theorems, or ansatzes are invoked to justify the central step. Empirical gains (FID, FD) are measured against external benchmarks and do not reduce to the input data by construction. The optimality justification is a standard likelihood-ratio argument applied once to the base distribution; it is not internally circular even if its validity after policy shift is debatable on other grounds.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claim rests on the domain assumption that a discriminator logit in pretrained space supplies an optimal reward; no explicit free parameters or invented entities are named in the abstract.

axioms (1)

domain assumption The pretrained representation space restricts the discriminator to perceptually meaningful directions
Abstract states this property enables the discriminator to focus on visual and semantic quality.

pith-pipeline@v0.9.1-grok · 5891 in / 1259 out tokens · 42124 ms · 2026-06-26T20:51:54.717590+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

152 extracted references · 1 canonical work pages

[1]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=
[2]

International Conference on Learning Representations (ICLR) , year=

Adjoint matching: Fine-tuning flow and diffusion generative models with memoryless stochastic optimal control , author=. International Conference on Learning Representations (ICLR) , year=
[3]

2026 , eprint=

Reinforcement Learning via Self-Distillation , author=. 2026 , eprint=

2026
[4]

2026 , eprint=

Self-Distillation Enables Continual Learning , author=. 2026 , eprint=

2026
[5]

2024 , booktitle=

On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes , author=. 2024 , booktitle=

2024
[6]

Thinking Machines Lab: Connectionism , year =

Kevin Lu and Thinking Machines Lab , title =. Thinking Machines Lab: Connectionism , year =
[7]

Deng, Jia and Dong, Wei and Socher, Richard and Li, Li-Jia and Li, Kai and Li, Fei-Fei , booktitle=
[8]

2024 , organization=

Ma, Nanye and Goldstein, Mark and Albergo, Michael S and Boffi, Nicholas M and Vanden-Eijnden, Eric and Xie, Saining , booktitle=. 2024 , organization=

2024
[9]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Scalable diffusion models with transformers , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=
[10]

Proceedings of the International Conference on Learning Representations (ICLR) , year=

Building Normalizing Flows with Stochastic Interpolants , author=. Proceedings of the International Conference on Learning Representations (ICLR) , year=
[11]

Which training methods for

Mescheder, Lars and Geiger, Andreas and Nowozin, Sebastian , booktitle=. Which training methods for. 2018 , organization=

2018
[14]

Proceedings of the International Conference on Learning Representations (ICLR) , year=

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author=. Proceedings of the International Conference on Learning Representations (ICLR) , year=
[17]

Heusel, Martin and Ramsauer, Hubert and Unterthiner, Thomas and Nessler, Bernhard and Hochreiter, Sepp , booktitle=
[18]

International Conference on Learning Representations (ICLR) , year=

Demystifying MMD GANs , author=. International Conference on Learning Representations (ICLR) , year=
[19]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Assessing Generative Models via Precision and Recall , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=
[20]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Improved Precision and Recall Metric for Assessing Generative Models , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=
[21]

Advances in Neural Information Processing Systems , volume=

Applying guidance in a limited interval improves sample and distribution quality in diffusion models , author=. Advances in Neural Information Processing Systems , volume=
[22]

International Conference on Machine Learning (ICML) , year=

Reliable Fidelity and Diversity Metrics for Generative Models , author=. International Conference on Machine Learning (ICML) , year=
[23]

Transactions on Machine Learning Research (TMLR) , year=

Oquab, Maxime and Darcet, Timoth. Transactions on Machine Learning Research (TMLR) , year=
[25]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Sigmoid loss for language image pre-training , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=
[26]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Rethinking the inception architecture for computer vision , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
[27]

Xu, Jiazheng and Liu, Xiao and Wu, Yuchen and Tong, Yuxuan and Li, Qinkai and Ding, Ming and Tang, Jie and Dong, Yuxiao , booktitle=
[28]

Kirstain, Yuval and Polyak, Adam and Singer, Uriel and Matiana, Shahbuland and Penna, Joe and Levy, Omer , booktitle=
[29]

Schuhmann, Christoph and Beaumont, Romain and Vencu, Richard and Gordon, Cade and Wightman, Ross and Cherti, Mehdi and Coombes, Theo and Katta, Aarush and Mullis, Clayton and Wortsman, Mitchell and Schramowski, Patrick and Kundurthy, Srivatsa and Crowson, Katherine and Schmidt, Ludwig and Kaczmarczyk, Robert and Jitsev, Jenia , booktitle=
[30]

2024 , howpublished =

Aesthetic Predictor. 2024 , howpublished =

2024
[32]

Proceedings of the thirteenth international conference on artificial intelligence and statistics , pages=

Efficient reductions for imitation learning , author=. Proceedings of the thirteenth international conference on artificial intelligence and statistics , pages=. 2010 , organization=

2010
[33]

Proceedings of the fourteenth international conference on artificial intelligence and statistics , pages=

A reduction of imitation learning and structured prediction to no-regret online learning , author=. Proceedings of the fourteenth international conference on artificial intelligence and statistics , pages=. 2011 , organization=

2011
[34]

Journal of the American Statistical Association , volume=

Convexity, classification, and risk bounds , author=. Journal of the American Statistical Association , volume=. 2006 , publisher=

2006
[35]

The Journal of Machine Learning Research , volume=

Composite binary losses , author=. The Journal of Machine Learning Research , volume=. 2010 , publisher=

2010
[36]

Journal of the American Statistical Association , volume=

Strictly proper scoring rules, prediction, and estimation , author=. Journal of the American Statistical Association , volume=. 2007 , publisher=

2007
[37]

arXiv preprint arXiv:2401.11237 , year=

Closing the Gap between TD Learning and Supervised Learning--A Generalisation Point of View , author=. arXiv preprint arXiv:2401.11237 , year=

arXiv
[42]

The Twelfth International Conference on Learning Representations , year=

Training Diffusion Models with Reinforcement Learning , author=. The Twelfth International Conference on Learning Representations , year=
[43]

2025 , eprint=

Data-regularized Reinforcement Learning for Diffusion Models at Scale , author=. 2025 , eprint=

2025
[45]

Fan, Ying and Watkins, Olivia and Du, Yuqing and Liu, Hao and Ryu, Moonkyung and Boutilier, Craig and Abbeel, Pieter and Ghavamzadeh, Mohammad and Lee, Kangwook and Lee, Kimin , journal=
[46]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Diffusion model alignment using direct preference optimization , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=
[47]

arXiv preprint arXiv:2310.03739 , year=

Aligning Text-to-Image Diffusion Models with Reward Backpropagation , author=. arXiv preprint arXiv:2310.03739 , year=

arXiv
[48]

arXiv preprint arXiv:2309.17400 , year=

Directly fine-tuning diffusion models on differentiable rewards , author=. arXiv preprint arXiv:2309.17400 , year=

Pith/arXiv arXiv
[49]

Advances in Neural Information Processing Systems , volume=

Generating images with perceptual similarity metrics based on deep networks , author=. Advances in Neural Information Processing Systems , volume=
[50]

European Conference on Computer Vision (ECCV) , pages=

Perceptual losses for real-time style transfer and super-resolution , author=. European Conference on Computer Vision (ECCV) , pages=
[51]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Adding conditional control to text-to-image diffusion models , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
[53]

European Conference on Computer Vision , pages=

Adversarial diffusion distillation , author=. European Conference on Computer Vision , pages=. 2024 , organization=

2024
[54]

Xu, Yanwu and Zhao, Yang and Xiao, Zhisheng and Hou, Tingbo , booktitle=
[55]

Advances in neural information processing systems , year=

Improved distribution matching distillation for fast image synthesis , author=. Advances in neural information processing systems , year=
[56]

Tackling the generative learning trilemma with denoising diffusion

Xiao, Zhisheng and Kreis, Karsten and Vahdat, Arash , journal=. Tackling the generative learning trilemma with denoising diffusion
[57]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=

The unreasonable effectiveness of deep features as a perceptual metric , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=
[58]

Advances in Neural Information Processing Systems , year=

Generative adversarial nets , author=. Advances in Neural Information Processing Systems , year=
[59]

Advances in Neural Information Processing Systems , year=

f-GAN: Training generative neural samplers using variational divergence minimization , author=. Advances in Neural Information Processing Systems , year=
[60]

IEEE Transactions on Information Theory , year=

Estimating divergence functionals and the likelihood ratio by convex risk minimization , author=. IEEE Transactions on Information Theory , year=
[61]

Density Ratio Estimation in Machine Learning , author=
[62]

Advances in Neural Information Processing Systems , year=

Generative adversarial imitation learning , author=. Advances in Neural Information Processing Systems , year=
[63]

International Conference on Learning Representations (ICLR) , year=

Learning robust rewards with adversarial inverse reinforcement learning , author=. International Conference on Learning Representations (ICLR) , year=
[65]

Advances in neural information processing systems , year=

Denoising diffusion probabilistic models , author=. Advances in neural information processing systems , year=
[67]

, author=

Estimation of non-normalized statistical models by score matching. , author=. Journal of Machine Learning Research , year=
[68]

Neural computation , year=

A connection between score matching and denoising autoencoders , author=. Neural computation , year=
[69]

Journal of the American Statistical Association , year=

Tweedie’s formula and selection bias , author=. Journal of the American Statistical Association , year=
[70]

Theory of Probability & Its Applications , year=

On transforming a certain class of stochastic processes by absolutely continuous substitution of measures , author=. Theory of Probability & Its Applications , year=
[71]

2010 , month=

Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy , author=. 2010 , month=

2010
[72]

Machine learning , year=

Simple statistical gradient-following algorithms for connectionist reinforcement learning , author=. Machine learning , year=
[75]

Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Bi, Xiao and Zhang, Haowei and Zhang, Mingchuan and Li, YK and Wu, Yang and others , journal=
[77]

Advances in neural information processing systems , year=

Elucidating the design space of diffusion-based generative models , author=. Advances in neural information processing systems , year=
[78]

Advances in Neural Information Processing Systems , year=

Guiding a diffusion model with a bad version of itself , author=. Advances in Neural Information Processing Systems , year=
[79]

V. I. Arnold , title =. 1978 , address =

1978
[80]

, title =

Arnold, Vladimir I. , title =
[81]

TMLR , year =

Flow map matching with stochastic interpolants: A mathematical framework for consistency models , author =. TMLR , year =
[82]

2017 , organization=

Laskey, Michael and Lee, Jonathan and Fox, Roy and Dragan, Anca and Goldberg, Ken , booktitle=. 2017 , organization=

2017
[83]

Advances in Neural Information Processing Systems , year=

Toward the fundamental limits of imitation learning , author=. Advances in Neural Information Processing Systems , year=
[84]

International Conference on Machine Learning , pages=

Of moments and matching: A game-theoretic framework for closing the imitation gap , author=. International Conference on Machine Learning , pages=. 2021 , organization=

2021
[85]

Proceedings of the twenty-first international conference on Machine learning , year=

Apprenticeship learning via inverse reinforcement learning , author=. Proceedings of the twenty-first international conference on Machine learning , year=
[86]

Maximum Entropy Inverse Reinforcement Learning , author=. Proc. AAAI , pages=
[87]

International Conference on Machine Learning (ICML) , pages=

Algorithms for Inverse Reinforcement Learning , author=. International Conference on Machine Learning (ICML) , pages=
[89]

Online Reward-Weighted Fine-Tuning of Flow Matching with

Fan, Jiajun and Shen, Shuaike and Cheng, Chaoran and Chen, Yuxin and Liang, Chumeng and Liu, Ge , booktitle=. Online Reward-Weighted Fine-Tuning of Flow Matching with
[90]

Proceedings of the 24th international conference on Machine learning , pages=

Reinforcement learning by reward-weighted regression for operational space control , author=. Proceedings of the 24th international conference on Machine learning , pages=
[92]

The International Conference on Learning Representations (ICLR) , year=

A distributional approach to controlled text generation , author=. The International Conference on Learning Representations (ICLR) , year=
[94]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

High-resolution image synthesis with latent diffusion models , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[95]

Apprenticeship learning via inverse reinforcement learning

Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twenty-first international conference on Machine learning, 2004

2004
[96]

On-policy distillation of language models: Learning from self-generated mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. In International Conference on Learning Representations (ICLR), 2024

2024
[97]

Building normalizing flows with stochastic interpolants

Michael S Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants. In Proceedings of the International Conference on Learning Representations (ICLR), 2023

2023
[98]

V. I. Arnold. Ordinary Differential Equations. MIT Press, Cambridge, MA, 1978

1978
[99]

Training diffusion models with reinforcement learning

Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning. In The Twelfth International Conference on Learning Representations, 2024. https://openreview.net/forum?id=YCWjhGrJFD

2024
[100]

Flow map matching with stochastic interpolants: A mathematical framework for consistency models

Nicholas M Boffi, Michael S Albergo, and Eric Vanden-Eijnden. Flow map matching with stochastic interpolants: A mathematical framework for consistency models. TMLR, 2025

2025

Showing first 80 references.

[1] [1]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

[2] [2]

International Conference on Learning Representations (ICLR) , year=

Adjoint matching: Fine-tuning flow and diffusion generative models with memoryless stochastic optimal control , author=. International Conference on Learning Representations (ICLR) , year=

[3] [3]

2026 , eprint=

Reinforcement Learning via Self-Distillation , author=. 2026 , eprint=

2026

[4] [4]

2026 , eprint=

Self-Distillation Enables Continual Learning , author=. 2026 , eprint=

2026

[5] [5]

2024 , booktitle=

On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes , author=. 2024 , booktitle=

2024

[6] [6]

Thinking Machines Lab: Connectionism , year =

Kevin Lu and Thinking Machines Lab , title =. Thinking Machines Lab: Connectionism , year =

[7] [7]

Deng, Jia and Dong, Wei and Socher, Richard and Li, Li-Jia and Li, Kai and Li, Fei-Fei , booktitle=

[8] [8]

2024 , organization=

Ma, Nanye and Goldstein, Mark and Albergo, Michael S and Boffi, Nicholas M and Vanden-Eijnden, Eric and Xie, Saining , booktitle=. 2024 , organization=

2024

[9] [9]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Scalable diffusion models with transformers , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

[10] [10]

Proceedings of the International Conference on Learning Representations (ICLR) , year=

Building Normalizing Flows with Stochastic Interpolants , author=. Proceedings of the International Conference on Learning Representations (ICLR) , year=

[11] [11]

Which training methods for

Mescheder, Lars and Geiger, Andreas and Nowozin, Sebastian , booktitle=. Which training methods for. 2018 , organization=

2018

[12] [14]

Proceedings of the International Conference on Learning Representations (ICLR) , year=

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author=. Proceedings of the International Conference on Learning Representations (ICLR) , year=

[13] [17]

Heusel, Martin and Ramsauer, Hubert and Unterthiner, Thomas and Nessler, Bernhard and Hochreiter, Sepp , booktitle=

[14] [18]

International Conference on Learning Representations (ICLR) , year=

Demystifying MMD GANs , author=. International Conference on Learning Representations (ICLR) , year=

[15] [19]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Assessing Generative Models via Precision and Recall , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

[16] [20]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Improved Precision and Recall Metric for Assessing Generative Models , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

[17] [21]

Advances in Neural Information Processing Systems , volume=

Applying guidance in a limited interval improves sample and distribution quality in diffusion models , author=. Advances in Neural Information Processing Systems , volume=

[18] [22]

International Conference on Machine Learning (ICML) , year=

Reliable Fidelity and Diversity Metrics for Generative Models , author=. International Conference on Machine Learning (ICML) , year=

[19] [23]

Transactions on Machine Learning Research (TMLR) , year=

Oquab, Maxime and Darcet, Timoth. Transactions on Machine Learning Research (TMLR) , year=

[20] [25]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Sigmoid loss for language image pre-training , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

[21] [26]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Rethinking the inception architecture for computer vision , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

[22] [27]

Xu, Jiazheng and Liu, Xiao and Wu, Yuchen and Tong, Yuxuan and Li, Qinkai and Ding, Ming and Tang, Jie and Dong, Yuxiao , booktitle=

[23] [28]

Kirstain, Yuval and Polyak, Adam and Singer, Uriel and Matiana, Shahbuland and Penna, Joe and Levy, Omer , booktitle=

[24] [29]

Schuhmann, Christoph and Beaumont, Romain and Vencu, Richard and Gordon, Cade and Wightman, Ross and Cherti, Mehdi and Coombes, Theo and Katta, Aarush and Mullis, Clayton and Wortsman, Mitchell and Schramowski, Patrick and Kundurthy, Srivatsa and Crowson, Katherine and Schmidt, Ludwig and Kaczmarczyk, Robert and Jitsev, Jenia , booktitle=

[25] [30]

2024 , howpublished =

Aesthetic Predictor. 2024 , howpublished =

2024

[26] [32]

Proceedings of the thirteenth international conference on artificial intelligence and statistics , pages=

Efficient reductions for imitation learning , author=. Proceedings of the thirteenth international conference on artificial intelligence and statistics , pages=. 2010 , organization=

2010

[27] [33]

Proceedings of the fourteenth international conference on artificial intelligence and statistics , pages=

A reduction of imitation learning and structured prediction to no-regret online learning , author=. Proceedings of the fourteenth international conference on artificial intelligence and statistics , pages=. 2011 , organization=

2011

[28] [34]

Journal of the American Statistical Association , volume=

Convexity, classification, and risk bounds , author=. Journal of the American Statistical Association , volume=. 2006 , publisher=

2006

[29] [35]

The Journal of Machine Learning Research , volume=

Composite binary losses , author=. The Journal of Machine Learning Research , volume=. 2010 , publisher=

2010

[30] [36]

Journal of the American Statistical Association , volume=

Strictly proper scoring rules, prediction, and estimation , author=. Journal of the American Statistical Association , volume=. 2007 , publisher=

2007

[31] [37]

arXiv preprint arXiv:2401.11237 , year=

Closing the Gap between TD Learning and Supervised Learning--A Generalisation Point of View , author=. arXiv preprint arXiv:2401.11237 , year=

arXiv

[32] [42]

The Twelfth International Conference on Learning Representations , year=

Training Diffusion Models with Reinforcement Learning , author=. The Twelfth International Conference on Learning Representations , year=

[33] [43]

2025 , eprint=

Data-regularized Reinforcement Learning for Diffusion Models at Scale , author=. 2025 , eprint=

2025

[34] [45]

Fan, Ying and Watkins, Olivia and Du, Yuqing and Liu, Hao and Ryu, Moonkyung and Boutilier, Craig and Abbeel, Pieter and Ghavamzadeh, Mohammad and Lee, Kangwook and Lee, Kimin , journal=

[35] [46]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Diffusion model alignment using direct preference optimization , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

[36] [47]

arXiv preprint arXiv:2310.03739 , year=

Aligning Text-to-Image Diffusion Models with Reward Backpropagation , author=. arXiv preprint arXiv:2310.03739 , year=

arXiv

[37] [48]

arXiv preprint arXiv:2309.17400 , year=

Directly fine-tuning diffusion models on differentiable rewards , author=. arXiv preprint arXiv:2309.17400 , year=

Pith/arXiv arXiv

[38] [49]

Advances in Neural Information Processing Systems , volume=

Generating images with perceptual similarity metrics based on deep networks , author=. Advances in Neural Information Processing Systems , volume=

[39] [50]

European Conference on Computer Vision (ECCV) , pages=

Perceptual losses for real-time style transfer and super-resolution , author=. European Conference on Computer Vision (ECCV) , pages=

[40] [51]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Adding conditional control to text-to-image diffusion models , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

[41] [53]

European Conference on Computer Vision , pages=

Adversarial diffusion distillation , author=. European Conference on Computer Vision , pages=. 2024 , organization=

2024

[42] [54]

Xu, Yanwu and Zhao, Yang and Xiao, Zhisheng and Hou, Tingbo , booktitle=

[43] [55]

Advances in neural information processing systems , year=

Improved distribution matching distillation for fast image synthesis , author=. Advances in neural information processing systems , year=

[44] [56]

Tackling the generative learning trilemma with denoising diffusion

Xiao, Zhisheng and Kreis, Karsten and Vahdat, Arash , journal=. Tackling the generative learning trilemma with denoising diffusion

[45] [57]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=

The unreasonable effectiveness of deep features as a perceptual metric , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=

[46] [58]

Advances in Neural Information Processing Systems , year=

Generative adversarial nets , author=. Advances in Neural Information Processing Systems , year=

[47] [59]

Advances in Neural Information Processing Systems , year=

f-GAN: Training generative neural samplers using variational divergence minimization , author=. Advances in Neural Information Processing Systems , year=

[48] [60]

IEEE Transactions on Information Theory , year=

Estimating divergence functionals and the likelihood ratio by convex risk minimization , author=. IEEE Transactions on Information Theory , year=

[49] [61]

Density Ratio Estimation in Machine Learning , author=

[50] [62]

Advances in Neural Information Processing Systems , year=

Generative adversarial imitation learning , author=. Advances in Neural Information Processing Systems , year=

[51] [63]

International Conference on Learning Representations (ICLR) , year=

Learning robust rewards with adversarial inverse reinforcement learning , author=. International Conference on Learning Representations (ICLR) , year=

[52] [65]

Advances in neural information processing systems , year=

Denoising diffusion probabilistic models , author=. Advances in neural information processing systems , year=

[53] [67]

, author=

Estimation of non-normalized statistical models by score matching. , author=. Journal of Machine Learning Research , year=

[54] [68]

Neural computation , year=

A connection between score matching and denoising autoencoders , author=. Neural computation , year=

[55] [69]

Journal of the American Statistical Association , year=

Tweedie’s formula and selection bias , author=. Journal of the American Statistical Association , year=

[56] [70]

Theory of Probability & Its Applications , year=

On transforming a certain class of stochastic processes by absolutely continuous substitution of measures , author=. Theory of Probability & Its Applications , year=

[57] [71]

2010 , month=

Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy , author=. 2010 , month=

2010

[58] [72]

Machine learning , year=

Simple statistical gradient-following algorithms for connectionist reinforcement learning , author=. Machine learning , year=

[59] [75]

Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Bi, Xiao and Zhang, Haowei and Zhang, Mingchuan and Li, YK and Wu, Yang and others , journal=

[60] [77]

Advances in neural information processing systems , year=

Elucidating the design space of diffusion-based generative models , author=. Advances in neural information processing systems , year=

[61] [78]

Advances in Neural Information Processing Systems , year=

Guiding a diffusion model with a bad version of itself , author=. Advances in Neural Information Processing Systems , year=

[62] [79]

V. I. Arnold , title =. 1978 , address =

1978

[63] [80]

, title =

Arnold, Vladimir I. , title =

[64] [81]

TMLR , year =

Flow map matching with stochastic interpolants: A mathematical framework for consistency models , author =. TMLR , year =

[65] [82]

2017 , organization=

Laskey, Michael and Lee, Jonathan and Fox, Roy and Dragan, Anca and Goldberg, Ken , booktitle=. 2017 , organization=

2017

[66] [83]

Advances in Neural Information Processing Systems , year=

Toward the fundamental limits of imitation learning , author=. Advances in Neural Information Processing Systems , year=

[67] [84]

International Conference on Machine Learning , pages=

Of moments and matching: A game-theoretic framework for closing the imitation gap , author=. International Conference on Machine Learning , pages=. 2021 , organization=

2021

[68] [85]

Proceedings of the twenty-first international conference on Machine learning , year=

Apprenticeship learning via inverse reinforcement learning , author=. Proceedings of the twenty-first international conference on Machine learning , year=

[69] [86]

Maximum Entropy Inverse Reinforcement Learning , author=. Proc. AAAI , pages=

[70] [87]

International Conference on Machine Learning (ICML) , pages=

Algorithms for Inverse Reinforcement Learning , author=. International Conference on Machine Learning (ICML) , pages=

[71] [89]

Online Reward-Weighted Fine-Tuning of Flow Matching with

Fan, Jiajun and Shen, Shuaike and Cheng, Chaoran and Chen, Yuxin and Liang, Chumeng and Liu, Ge , booktitle=. Online Reward-Weighted Fine-Tuning of Flow Matching with

[72] [90]

Proceedings of the 24th international conference on Machine learning , pages=

Reinforcement learning by reward-weighted regression for operational space control , author=. Proceedings of the 24th international conference on Machine learning , pages=

[73] [92]

The International Conference on Learning Representations (ICLR) , year=

A distributional approach to controlled text generation , author=. The International Conference on Learning Representations (ICLR) , year=

[74] [94]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

High-resolution image synthesis with latent diffusion models , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

[75] [95]

Apprenticeship learning via inverse reinforcement learning

Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twenty-first international conference on Machine learning, 2004

2004

[76] [96]

On-policy distillation of language models: Learning from self-generated mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. In International Conference on Learning Representations (ICLR), 2024

2024

[77] [97]

Building normalizing flows with stochastic interpolants

Michael S Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants. In Proceedings of the International Conference on Learning Representations (ICLR), 2023

2023

[78] [98]

V. I. Arnold. Ordinary Differential Equations. MIT Press, Cambridge, MA, 1978

1978

[79] [99]

Training diffusion models with reinforcement learning

Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning. In The Twelfth International Conference on Learning Representations, 2024. https://openreview.net/forum?id=YCWjhGrJFD

2024

[80] [100]

Flow map matching with stochastic interpolants: A mathematical framework for consistency models

Nicholas M Boffi, Michael S Albergo, and Eric Vanden-Eijnden. Flow map matching with stochastic interpolants: A mathematical framework for consistency models. TMLR, 2025

2025