RankE: End-to-End Post-Training for Discrete Text-to-Image Generation with Decoder Co-Evolution

Cheng Tan; Huan Wang; Luyuan Zhang; Siyong Jian; Siyuan Li; Xin Jin; Ying Li; Zedong Wang

arxiv: 2605.21195 · v1 · pith:3ESQ4MXTnew · submitted 2026-05-20 · 💻 cs.CV

RankE: End-to-End Post-Training for Discrete Text-to-Image Generation with Decoder Co-Evolution

Siyong Jian , Siyuan Li , Luyuan Zhang , Zedong Wang , Xin Jin , Ying Li , Cheng Tan , Huan Wang This is my paper

Pith reviewed 2026-05-21 05:44 UTC · model grok-4.3

classification 💻 cs.CV

keywords text-to-image generationdiscrete autoregressive modelspost-trainingdecoder co-evolutionlatent covariate shiftranking-based optimizationFID improvementCLIP score

0 comments

The pith

Policy-only optimization in discrete text-to-image models creates a token-distribution mismatch with the frozen decoder that improves rewards while harming image quality, but co-evolving both resolves the trade-off.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Discrete autoregressive text-to-image models pair a VQ tokenizer with an AR policy and currently optimize only the policy while keeping the decoder frozen. This policy-only approach induces latent covariate shift as the evolving token distribution diverges from the decoder's original training data, so reward scores rise while decoded image quality falls. RankE instead alternates optimization between the two modules so each maximizes a ranking-based alignment objective while staying regularized by a stability anchor. On LlamaGen-XL the method simultaneously raises CLIP score and lowers FID; the same pattern holds on Janus-Pro. The result converts reward optimization into measurable pixel-space gains rather than trading one for the other.

Core claim

Policy-only post-training induces latent covariate shift between the policy's token distribution and the decoder's original training distribution, producing higher reward metrics but lower decoded image quality; RankE eliminates this mismatch by co-evolving both components through alternating optimization in which each module maximizes a ranking-based alignment objective while being regularized by a stability-preserving anchor suited to its parameter space.

What carries the argument

Alternating optimization of policy and decoder, each maximizing a ranking-based alignment objective while regularized by a stability-preserving anchor.

If this is right

Standard RL improves CLIP but degrades FID, while RankE improves both simultaneously.
On LlamaGen-XL (775M) the method reaches FID 15.21 and CLIP 33.76 on MS-COCO 30K.
Consistent gains appear on Janus-Pro (1B) as well.
Reward optimization is converted directly into pixel-space quality improvements instead of a trade-off.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same co-evolution pattern could be tested on discrete autoregressive models for video or audio to check whether token-distribution mismatch is a general problem.
If stability anchors prove unnecessary in later work, full joint gradient updates between policy and decoder might become feasible.
The ranking objective used here might be replaced by other preference signals without changing the core need for decoder updates.

Load-bearing premise

The observed drop in decoded image quality after policy-only optimization is caused by divergence between the new token distribution and the decoder's original training distribution, and alternating co-evolution with anchors can correct the mismatch without introducing fresh instabilities or overfitting.

What would settle it

Sample a large set of tokens from the fully optimized policy, train a fresh decoder on those tokens alone, and measure whether the resulting FID matches or exceeds the co-evolved decoder's FID on the same prompts.

Figures

Figures reproduced from arXiv: 2605.21195 by Cheng Tan, Huan Wang, Luyuan Zhang, Siyong Jian, Siyuan Li, Xin Jin, Ying Li, Zedong Wang.

**Figure 1.** Figure 1: Latent Covariate Shift intensifies under RL and is mitigated by RankE. Left: KL divergence between each model’s VQ token distribution and that of 5,000 real MS-COCO images; the dashed line marks Real, a natural-variation lower bound computed from two independently sampled real-image sets encoded by the same frozen tokenizer. The shift grows progressively across pre-training, SFT, and RL—standard RL widens … view at source ↗

**Figure 2.** Figure 2: Comparison of existing AR post-training and RankE framework. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the RankE co-evolution framework. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Evolution of generation metrics over training steps. Comparison of RankE against a Standard RL baseline with a frozen decoder. While Standard RL achieves marginal improvements in alignment, it suffers from stagnant or degrading visual fidelity (b) as the frozen decoder cannot adapt to policy-induced latent drift. In contrast, the co-evolution mechanism of RankE effectively translates reward optimization in… view at source ↗

**Figure 5.** Figure 5: Visualization of T2I generation. RankE yields precise attributes and details according to [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Latent Covariate Shift and token entropy during training. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

read the original abstract

Discrete autoregressive (AR) text-to-image (T2I) models pair a VQ tokenizer with an AR policy, and current post-training pipelines optimize only the policy while keeping the VQ decoder frozen. Recent diffusion T2I work, exemplified by REPA-E, has shown that the VAE itself constitutes a key alignment bottleneck, yet no analogous investigation exists for discrete AR models. We show that policy-only optimization induces Latent Covariate Shift: as the policy evolves, the resulting token distribution diverges from the ground-truth distribution on which the decoder was trained, such that reward scores improve while decoded image quality degrades. To address this mismatch, we propose RankE, the first end-to-end post-training framework for discrete T2I generation. Rather than optimizing the policy against a fixed decoder, RankE co-evolves both components through alternating optimization: each module maximizes a ranking-based alignment objective while being regularized by a stability-preserving anchor suited to its parameter space. This co-evolution breaks the fidelity--alignment trade-off that plagues frozen-decoder approaches: on LlamaGen-XL (775M), standard RL improves CLIP but degrades FID, whereas RankE improves both simultaneously (FID 15.21, CLIP 33.76 on MS-COCO 30K). Consistent gains on Janus-Pro (1B) confirm that decoder co-evolution reliably converts reward optimization into pixel-space quality improvements.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RankE proposes decoder co-evolution to fix the alignment-fidelity trade-off in discrete AR T2I post-training, with reported gains on both CLIP and FID, but the causal role of latent covariate shift lacks direct measurements or isolating ablations.

read the letter

The main thing to know is that this paper presents RankE as the first end-to-end post-training setup for discrete autoregressive text-to-image models. Instead of freezing the VQ decoder while only updating the policy, it alternates optimization between the two using a ranking-based alignment objective plus stability anchors tailored to each module. On LlamaGen-XL the method improves both FID to 15.21 and CLIP to 33.76 on MS-COCO 30K, while standard RL only lifts CLIP at the expense of FID; similar patterns hold on Janus-Pro. That contrast is the clearest empirical contribution here.

Referee Report

3 major / 2 minor

Summary. The paper claims that policy-only post-training of discrete autoregressive text-to-image models induces latent covariate shift between the evolving token distribution and the frozen VQ decoder, improving alignment (CLIP) at the expense of pixel quality (FID). RankE addresses this via alternating optimization that co-evolves the policy and decoder using ranking-based alignment objectives regularized by stability anchors. On LlamaGen-XL (775M) it reports simultaneous gains (FID 15.21, CLIP 33.76 on MS-COCO 30K) versus standard RL, with consistent results on Janus-Pro (1B).

Significance. If the reported metric improvements are robust and the covariate-shift mechanism is validated, the work offers a practical route to breaking the fidelity-alignment trade-off in discrete T2I post-training. The alternating co-evolution strategy with ranking objectives and stability anchors is a concrete contribution that could generalize beyond the tested models. However, the significance is limited by the absence of direct mechanistic evidence or isolating ablations, leaving the justification for decoder co-evolution as the necessary remedy open to alternative explanations.

major comments (3)

[Abstract and §3] Abstract and §3: The central claim that policy-only RL induces latent covariate shift (and that this is the cause of FID degradation) is presented without quantitative characterization of the shift, such as token-histogram divergence, per-layer activation statistics, or reconstruction error on policy-generated samples. No ablation isolates decoder co-evolution from other effects of the alternating schedule.
[Experiments] Experiments section (results on LlamaGen-XL and Janus-Pro): The reported FID and CLIP numbers lack error bars, number of random seeds, or statistical significance tests. It is therefore unclear whether the simultaneous improvement over standard RL is reliable or sensitive to hyper-parameter choices.
[§4] §4 (ablation studies): The manuscript provides no ablation that removes the stability anchors or the ranking objective individually while keeping the alternating schedule, making it impossible to attribute gains specifically to decoder co-evolution rather than increased optimization capacity or regularization.

minor comments (2)

[§3.2] The precise mathematical form of the stability-preserving anchor for the decoder (versus the policy) is only sketched; an explicit equation in the main text would improve reproducibility.
[Figure 2] Figure 2 (qualitative examples) would benefit from side-by-side comparison with the standard RL baseline at matched CLIP score to illustrate the claimed quality difference.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point-by-point below, proposing specific revisions to strengthen the manuscript while maintaining the integrity of our claims.

read point-by-point responses

Referee: [Abstract and §3] The central claim that policy-only RL induces latent covariate shift (and that this is the cause of FID degradation) is presented without quantitative characterization of the shift, such as token-histogram divergence, per-layer activation statistics, or reconstruction error on policy-generated samples. No ablation isolates decoder co-evolution from other effects of the alternating schedule.

Authors: We agree that direct quantitative evidence of the covariate shift would strengthen the central claim. In the revised manuscript, we will add token-histogram KL divergence and reconstruction error metrics on policy-generated samples versus training data in §3. For isolating decoder co-evolution, we will include a new ablation comparing the full alternating RankE against a variant that performs alternating updates but freezes the decoder after initial steps. While perfect isolation is inherently limited by the coupled optimization, this ablation will clarify the contribution of decoder evolution beyond the alternating schedule alone. revision: yes
Referee: The reported FID and CLIP numbers lack error bars, number of random seeds, or statistical significance tests. It is therefore unclear whether the simultaneous improvement over standard RL is reliable or sensitive to hyper-parameter choices.

Authors: We acknowledge this limitation in statistical reporting. In the revision, we will rerun the main results on LlamaGen-XL and Janus-Pro using at least three random seeds, reporting means and standard deviations for FID and CLIP. We will also add a brief discussion of hyperparameter sensitivity based on our existing tuning logs to address reliability concerns. revision: yes
Referee: The manuscript provides no ablation that removes the stability anchors or the ranking objective individually while keeping the alternating schedule, making it impossible to attribute gains specifically to decoder co-evolution rather than increased optimization capacity or regularization.

Authors: We thank the referee for highlighting this gap. In the updated §4, we will add two targeted ablations while preserving the alternating schedule: (1) disabling stability anchors, and (2) replacing the ranking objective with standard RL loss. These will help attribute performance gains more precisely to decoder co-evolution versus other regularization or optimization effects. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical observation and method proposal remain self-contained

full rationale

The paper motivates RankE from an empirical observation that policy-only RL on discrete AR T2I models improves reward metrics while degrading FID, attributing this to an induced divergence between the evolving token distribution and the decoder's original training support. This observation is presented as a measured phenomenon rather than derived from equations that presuppose the conclusion. The proposed alternating optimization with ranking objectives and stability anchors is introduced as a direct response to the observed mismatch, with performance gains demonstrated via direct comparison to frozen-decoder RL baselines on LlamaGen-XL and Janus-Pro. No load-bearing step reduces a prediction to a fitted parameter by construction, invokes a self-citation as an unverified uniqueness theorem, or renames a known result under new coordinates. The derivation chain is therefore grounded in external experimental contrasts rather than internal redefinition.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the existence of latent covariate shift as the cause of quality degradation and on the effectiveness of alternating optimization with per-module stability anchors; no explicit free parameters or invented entities are named in the abstract.

axioms (2)

domain assumption Policy-only optimization induces a divergence between generated token distribution and the decoder's original training distribution.
Invoked to explain why reward improves while decoded quality degrades.
ad hoc to paper Alternating optimization with ranking objectives and stability anchors can break the fidelity-alignment trade-off without introducing instability.
Core premise of the RankE method.

pith-pipeline@v0.9.0 · 5812 in / 1377 out tokens · 22687 ms · 2026-05-21T05:44:23.025566+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

policy-only optimization induces Latent Covariate Shift... RankE co-evolves both components through alternating optimization
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Generalized EM interpretation... stability-preserving anchor

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · 13 internal anchors

[1]

Scheduled sampling for sequence prediction with recurrent neural networks

Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. InNeurIPS, 2015. 2

work page 2015
[2]

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation.arXiv preprint arXiv:1308.3432, 2013. 2, 3, 15

work page internal anchor Pith review Pith/arXiv arXiv 2013
[3]

Improving image generation with better captions.Computer Science, 2023

James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions.Computer Science, 2023. 21

work page 2023
[4]

Training diffusion models with reinforcement learning

Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning. InICLR, 2024. 3

work page 2024
[5]

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

MaskGIT: Masked generative image transformer

Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. MaskGIT: Masked generative image transformer. InCVPR, 2022. 15

work page 2022
[7]

Muse: Text-to-image generation via masked generative transformers

Huiwen Chang, Han Zhang, Jarred Barber, AJ Maschinot, Jose Lezama, Lu Jiang, Ming-Hsuan Yang, Kevin Murphy, William T Freeman, Michael Rubinstein, et al. Muse: Text-to-image generation via masked generative transformers. InICML, 2023. 15

work page 2023
[8]

Softvq-vae: Efficient 1-dimensional continuous tokenizer

Hao Chen, Ze Wang, Xiang Li, Ximeng Sun, Fangyi Chen, Jiang Liu, Jindong Wang, Bhiksha Raj, Zicheng Liu, and Emad Barsoum. Softvq-vae: Efficient 1-dimensional continuous tokenizer. InCVPR, 2025. 15

work page 2025
[9]

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset.arXiv preprint arXiv:2505.09568, 2025. 21

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025. 1, 3, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Directly fine-tuning diffusion models on differentiable rewards

Kevin Clark, Paul Vicol, Kevin Swersky, and David J Fleet. Directly fine-tuning diffusion models on differentiable rewards. InICLR, 2024. 2, 3, 5, 7

work page 2024
[12]

Reward model ensembles help mitigate overoptimization

Thomas Coste, Usman Anwar, Robert Kirk, and David Krueger. Reward model ensembles help mitigate overoptimization. InICLR, 2024. 4

work page 2024
[13]

Maximum likelihood from incomplete data via the EM algorithm.Journal of the Royal Statistical Society: Series B, 1977

Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the EM algorithm.Journal of the Royal Statistical Society: Series B, 1977. 17

work page 1977
[14]

CogView: Mastering text-to-image generation via transformers

Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, and Jie Tang. CogView: Mastering text-to-image generation via transformers. In NeurIPS, 2021. 15

work page 2021
[15]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. InCVPR, 2021. 1, 6, 15

work page 2021
[16]

Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models

Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Mo- hammad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models. InNeurIPS, 2024. 3, 7

work page 2024
[17]

Scaling laws for reward model overoptimization

Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. InICML,

work page
[18]

GenEval: An Object-Focused Framework for Evaluating Text-to-Image Alignment, 2023

Dhruba Ghosh, Hannaneh Hajishirzi, and Alexander Schwing. GenEval: An object-focused framework for evaluating text-to-image alignment.arXiv preprint arXiv:2310.11513, 2023. 6, 21

work page arXiv 2023
[19]

Generative adversarial nets

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. InNeurIPS, 2014. 5, 17

work page 2014
[20]

Bootstrap your own latent: A new approach to self-supervised learning

Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent: A new approach to self-supervised learning. InNeurIPS, 2020. 6

work page 2020
[21]

Gans trained by a two time-scale update rule converge to a local nash equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. InNeurIPS, 2017. 6 11

work page 2017
[22]

Straightening out the straight-through estimator: Overcoming optimization challenges in vector quantized networks.ICML, 2023

Minyoung Huh, Brian Cheung, Pulkit Agrawal, and Phillip Isola. Straightening out the straight-through estimator: Overcoming optimization challenges in vector quantized networks.ICML, 2023. 2, 16

work page 2023
[23]

Image-to-image translation with conditional adversarial networks

Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. InCVPR, 2017. 20

work page 2017
[24]

Categorical reparameterization with Gumbel-softmax

Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with Gumbel-softmax. InICLR,

work page
[25]

T2i-r1: Reinforcing image generation with collaborative semantic-level and token-level cot.arXiv preprint arXiv:2505.00703, 2025

Dongzhi Jiang, Ziyu Guo, Renrui Zhang, Zhuofan Zong, Hao Li, Le Zhuo, Shilin Yan, Pheng-Ann Heng, and Hongsheng Li. T2i-r1: Reinforcing image generation with collaborative semantic-level and token-level cot.arXiv preprint arXiv:2505.00703, 2025. 1, 3

work page arXiv 2025
[26]

Fast decoding in sequence models using discrete latent variables

Łukasz Kaiser, Aurko Roy, Ashish Vaswani, Niki Parmar, Samy Bengio, Jakob Unkber, and Noam Shazeer. Fast decoding in sequence models using discrete latent variables. InICML, 2018. 2, 16

work page 2018
[27]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020. 1, 15

work page internal anchor Pith review Pith/arXiv arXiv 2001
[28]

Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 2017

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 2017. 6

work page 2017
[29]

Rl with kl penalties is better viewed as bayesian inference

Tomasz Korbak, Hady Elsahar, Germán Kruszewski, and Marc Dymetman. Rl with kl penalties is better viewed as bayesian inference. InEMNLP, 2022. 3, 4

work page 2022
[30]

Repa-e: Unlocking vae for end-to-end tuning of latent diffusion transformers

Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. Repa-e: Unlocking vae for end-to-end tuning of latent diffusion transformers. InCVPR, 2025. 1, 2, 3, 7, 16

work page 2025
[31]

Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review

Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review.arXiv preprint arXiv:1805.00909, 2018. 3, 4, 16

work page internal anchor Pith review Pith/arXiv arXiv 2018
[32]

Mergevq: A unified framework for visual generation and representation with disentangled token merging and quantization

Siyuan Li, Luyuan Zhang, Zedong Wang, Juanxi Tian, Cheng Tan, Zicheng Liu, Chang Yu, Qingsong Xie, Haonan Lu, Haoqian Wang, and Zhen Lei. Mergevq: A unified framework for visual generation and representation with disentangled token merging and quantization. InCVPR, 2025. 15

work page 2025
[33]

Va-π: Variational policy alignment for pixel-aware autoregressive generation

Xinyao Liao, Qiyuan He, Kai Xu, Xiaoye Qu, Yicong Li, Wei Wei, and Angela Yao. Va-π: Variational policy alignment for pixel-aware autoregressive generation. InCVPR, 2026. 1, 3, 7

work page 2026
[34]

Microsoft COCO: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. InECCV, 2014. 6, 18, 21

work page 2014
[35]

Flow matching for generative modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. InICLR, 2023. 7

work page 2023
[36]

Flow-grpo: Training flow matching models via online rl

Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl. InNeurIPS, 2025. 7

work page 2025
[37]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InICLR, 2019. 20

work page 2019
[38]

Open-magvit2: An open-source project toward democratizing auto-regressive visual gener- ation

Zhuoyan Luo, Fengyuan Shi, Yixiao Ge, Yujiu Yang, Limin Wang, and Ying Shan. Open-magvit2: An open- source project toward democratizing auto-regressive visual generation.arXiv preprint arXiv:2409.04410,

work page arXiv
[39]

A view of the em algorithm that justifies incremental, sparse, and other variants

Radford M Neal and Geoffrey E Hinton. A view of the em algorithm that justifies incremental, sparse, and other variants. InLearning in graphical models, pages 355–368. Springer, 1998. 5, 17

work page 1998
[40]

Training language models to follow instructions with human feedback.NeurIPS, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.NeurIPS, 2022. 10

work page 2022
[41]

Freeman, and Yu-Xiong Wang

Ziqi Pang, Tianyuan Zhang, Fujun Luan, Yunze Man, Hao Tan, Kai Zhang, William T. Freeman, and Yu-Xiong Wang. Randar: Decoder-only autoregressive visual generation in random orders. InCVPR, 2024. 15

work page 2024
[42]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, 2023. 7

work page 2023
[43]

Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

Xue Bin Peng, Aravind Kumar, Grace Zhang, and Sergey Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning.arXiv preprint arXiv:1910.00177, 2019. 5, 17 12

work page internal anchor Pith review Pith/arXiv arXiv 1910
[44]

Reinforcement learning by reward-weighted regression for operational space control

Jan Peters and Stefan Schaal. Reinforcement learning by reward-weighted regression for operational space control. InICML, 2007. 5, 17

work page 2007
[45]

Sdxl: Improving latent diffusion models for high-resolution image synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. In ICLR, 2024. 7

work page 2024
[46]

Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig Bou tilier, Pieter Abbeel, Mohammad Ghavamzadeh, and Shixiang Shane Gu

Mihir Prabhudesai, Anirudh Goyal, Deepak Pathak, and Katerina Fragkiadaki. Aligning text-to-image diffusion models with reward backpropagation.arXiv preprint arXiv:2310.03739, 2023. 2, 3, 5

work page arXiv 2023
[47]

Qwen2.5 technical report.arXiv preprint, 2024

Qwen Team. Qwen2.5 technical report.arXiv preprint, 2024. 21, 22

work page 2024
[48]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InICML, 2021. 6, 20

work page 2021
[49]

Zero-shot text-to-image generation

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. InICML, 2021. 15

work page 2021
[50]

Sequence level training with recurrent neural networks

Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence level training with recurrent neural networks. InICLR, 2016. 2

work page 2016
[51]

Generating diverse high-fidelity images with VQ-V AE-2

Ali Razavi, Aaron Van den Oord, and Oriol Vinyals. Generating diverse high-fidelity images with VQ-V AE-2. InNeurIPS, 2019. 1

work page 2019
[52]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InCVPR, 2022. 7

work page 2022
[53]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. InarXiv preprint arXiv:1707.06347, 2017. 4, 16

work page internal anchor Pith review Pith/arXiv arXiv 2017
[54]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, YK Li, Y Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 2, 3, 4, 6, 20

work page internal anchor Pith review Pith/arXiv arXiv 2024
[55]

Scalable image tokenization with index backpropagation quantization

Fengyuan Shi, Zhuoyan Luo, Yixiao Ge, Yujiu Yang, Ying Shan, and Limin Wang. Scalable image tokenization with index backpropagation quantization. InCVPR, 2025. 15

work page 2025
[56]

Journeydb: A benchmark for generative image understanding.NeurIPS, 2023

Keqiang Sun, Junting Pan, Yuying Ge, Hao Li, Haodong Duan, Xiaoshi Wu, Renrui Zhang, Aojun Zhou, Zipeng Qin, Yi Wang, et al. Journeydb: A benchmark for generative image understanding.NeurIPS, 2023. 21

work page 2023
[57]

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autore- gressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525,

work page internal anchor Pith review Pith/arXiv arXiv
[58]

Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised learning results

Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised learning results. InNeurIPS, 2017. 6

work page 2017
[59]

Visual autoregressive modeling: Scalable image generation via next-scale prediction.NeurIPS, 2024

Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction.NeurIPS, 2024. 7, 15

work page 2024
[60]

Neural discrete representation learning

Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. InNeurIPS, 2017. 1, 15

work page 2017
[61]

Diffusion model alignment using direct preference optimization

Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization. InCVPR, 2024. 3, 7

work page 2024
[62]

Simplear: Pushing the frontier of autoregressive visual generation through pretraining, sft, and rl

Junke Wang, Zhi Tian, Xun Wang, Xinyu Zhang, Weilin Huang, Zuxuan Wu, and Yu-Gang Jiang. Simplear: Pushing the frontier of autoregressive visual generation through pretraining, sft, and rl.arXiv preprint arXiv:2504.11455, 2025. 1, 3

work page arXiv 2025
[63]

Emu3: Next-Token Prediction is All You Need

Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024. 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[64]

Williams

Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning.Machine Learning, 1992. 4, 16 13

work page 1992
[65]

C. F. Jeff Wu. On the convergence properties of the EM algorithm.The Annals of Statistics, 11(1):95–103,

work page
[66]

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341, 2023. 3, 6, 20

work page internal anchor Pith review Pith/arXiv arXiv 2023
[67]

Gigatok: Scaling visual tokenizers to 3 billion parameters for autoregressive image generation

Tianwei Xiong, Jun Hao Liew, Zilong Huang, Jiashi Feng, and Xihui Liu. Gigatok: Scaling visual tokenizers to 3 billion parameters for autoregressive image generation. InCVPR, 2025. 15

work page 2025
[68]

Imagereward: Learning and evaluating human preferences for text-to-image generation

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. InNeurIPS, 2023. 3

work page 2023
[69]

Scaling autoregressive models for content-rich text-to-image generation.TMLR, 2022

Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Amin Karbasi, et al. Scaling autoregressive models for content-rich text-to-image generation.TMLR, 2022. 7, 15

work page 2022
[70]

An image is worth 32 tokens for reconstruction and generation

Qihang Yu, Mark Weber, Xueqing Deng, Xiaohui Shen, Daniel Cremers, and Liang-Chieh Chen. An image is worth 32 tokens for reconstruction and generation. InNeurIPS, 2024. 15

work page 2024
[71]

Group critical-token policy optimization for autoregressive image generation

Guohui Zhang, Hu Yu, Xiaoxiao Ma, Jinghao Zhang, Yaning Pan, Mingde Yao, Jie Xiao, Linjiang Huang, and Feng Zhao. Group critical-token policy optimization for autoregressive image generation. InICLR,

work page
[72]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, 2018. 6

work page 2018
[73]

Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

Chunting Zhou, Lili Yu, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, and Omer Levy. Transfusion: Predict the next token and diffuse images with one multi-modal model.arXiv preprint arXiv:2408.11039, 2024. 7 14 Appendix for RankE Roadmap The appendix is organized into three parts, progressing from ...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[1] [1]

Scheduled sampling for sequence prediction with recurrent neural networks

Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. InNeurIPS, 2015. 2

work page 2015

[2] [2]

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation.arXiv preprint arXiv:1308.3432, 2013. 2, 3, 15

work page internal anchor Pith review Pith/arXiv arXiv 2013

[3] [3]

Improving image generation with better captions.Computer Science, 2023

James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions.Computer Science, 2023. 21

work page 2023

[4] [4]

Training diffusion models with reinforcement learning

Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning. InICLR, 2024. 3

work page 2024

[5] [5]

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

MaskGIT: Masked generative image transformer

Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. MaskGIT: Masked generative image transformer. InCVPR, 2022. 15

work page 2022

[7] [7]

Muse: Text-to-image generation via masked generative transformers

Huiwen Chang, Han Zhang, Jarred Barber, AJ Maschinot, Jose Lezama, Lu Jiang, Ming-Hsuan Yang, Kevin Murphy, William T Freeman, Michael Rubinstein, et al. Muse: Text-to-image generation via masked generative transformers. InICML, 2023. 15

work page 2023

[8] [8]

Softvq-vae: Efficient 1-dimensional continuous tokenizer

Hao Chen, Ze Wang, Xiang Li, Ximeng Sun, Fangyi Chen, Jiang Liu, Jindong Wang, Bhiksha Raj, Zicheng Liu, and Emad Barsoum. Softvq-vae: Efficient 1-dimensional continuous tokenizer. InCVPR, 2025. 15

work page 2025

[9] [9]

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset.arXiv preprint arXiv:2505.09568, 2025. 21

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025. 1, 3, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

Directly fine-tuning diffusion models on differentiable rewards

Kevin Clark, Paul Vicol, Kevin Swersky, and David J Fleet. Directly fine-tuning diffusion models on differentiable rewards. InICLR, 2024. 2, 3, 5, 7

work page 2024

[12] [12]

Reward model ensembles help mitigate overoptimization

Thomas Coste, Usman Anwar, Robert Kirk, and David Krueger. Reward model ensembles help mitigate overoptimization. InICLR, 2024. 4

work page 2024

[13] [13]

Maximum likelihood from incomplete data via the EM algorithm.Journal of the Royal Statistical Society: Series B, 1977

Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the EM algorithm.Journal of the Royal Statistical Society: Series B, 1977. 17

work page 1977

[14] [14]

CogView: Mastering text-to-image generation via transformers

Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, and Jie Tang. CogView: Mastering text-to-image generation via transformers. In NeurIPS, 2021. 15

work page 2021

[15] [15]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. InCVPR, 2021. 1, 6, 15

work page 2021

[16] [16]

Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models

Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Mo- hammad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models. InNeurIPS, 2024. 3, 7

work page 2024

[17] [17]

Scaling laws for reward model overoptimization

Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. InICML,

work page

[18] [18]

GenEval: An Object-Focused Framework for Evaluating Text-to-Image Alignment, 2023

Dhruba Ghosh, Hannaneh Hajishirzi, and Alexander Schwing. GenEval: An object-focused framework for evaluating text-to-image alignment.arXiv preprint arXiv:2310.11513, 2023. 6, 21

work page arXiv 2023

[19] [19]

Generative adversarial nets

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. InNeurIPS, 2014. 5, 17

work page 2014

[20] [20]

Bootstrap your own latent: A new approach to self-supervised learning

Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent: A new approach to self-supervised learning. InNeurIPS, 2020. 6

work page 2020

[21] [21]

Gans trained by a two time-scale update rule converge to a local nash equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. InNeurIPS, 2017. 6 11

work page 2017

[22] [22]

Straightening out the straight-through estimator: Overcoming optimization challenges in vector quantized networks.ICML, 2023

Minyoung Huh, Brian Cheung, Pulkit Agrawal, and Phillip Isola. Straightening out the straight-through estimator: Overcoming optimization challenges in vector quantized networks.ICML, 2023. 2, 16

work page 2023

[23] [23]

Image-to-image translation with conditional adversarial networks

Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. InCVPR, 2017. 20

work page 2017

[24] [24]

Categorical reparameterization with Gumbel-softmax

Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with Gumbel-softmax. InICLR,

work page

[25] [25]

T2i-r1: Reinforcing image generation with collaborative semantic-level and token-level cot.arXiv preprint arXiv:2505.00703, 2025

Dongzhi Jiang, Ziyu Guo, Renrui Zhang, Zhuofan Zong, Hao Li, Le Zhuo, Shilin Yan, Pheng-Ann Heng, and Hongsheng Li. T2i-r1: Reinforcing image generation with collaborative semantic-level and token-level cot.arXiv preprint arXiv:2505.00703, 2025. 1, 3

work page arXiv 2025

[26] [26]

Fast decoding in sequence models using discrete latent variables

Łukasz Kaiser, Aurko Roy, Ashish Vaswani, Niki Parmar, Samy Bengio, Jakob Unkber, and Noam Shazeer. Fast decoding in sequence models using discrete latent variables. InICML, 2018. 2, 16

work page 2018

[27] [27]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020. 1, 15

work page internal anchor Pith review Pith/arXiv arXiv 2001

[28] [28]

Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 2017

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 2017. 6

work page 2017

[29] [29]

Rl with kl penalties is better viewed as bayesian inference

Tomasz Korbak, Hady Elsahar, Germán Kruszewski, and Marc Dymetman. Rl with kl penalties is better viewed as bayesian inference. InEMNLP, 2022. 3, 4

work page 2022

[30] [30]

Repa-e: Unlocking vae for end-to-end tuning of latent diffusion transformers

Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. Repa-e: Unlocking vae for end-to-end tuning of latent diffusion transformers. InCVPR, 2025. 1, 2, 3, 7, 16

work page 2025

[31] [31]

Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review

Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review.arXiv preprint arXiv:1805.00909, 2018. 3, 4, 16

work page internal anchor Pith review Pith/arXiv arXiv 2018

[32] [32]

Mergevq: A unified framework for visual generation and representation with disentangled token merging and quantization

Siyuan Li, Luyuan Zhang, Zedong Wang, Juanxi Tian, Cheng Tan, Zicheng Liu, Chang Yu, Qingsong Xie, Haonan Lu, Haoqian Wang, and Zhen Lei. Mergevq: A unified framework for visual generation and representation with disentangled token merging and quantization. InCVPR, 2025. 15

work page 2025

[33] [33]

Va-π: Variational policy alignment for pixel-aware autoregressive generation

Xinyao Liao, Qiyuan He, Kai Xu, Xiaoye Qu, Yicong Li, Wei Wei, and Angela Yao. Va-π: Variational policy alignment for pixel-aware autoregressive generation. InCVPR, 2026. 1, 3, 7

work page 2026

[34] [34]

Microsoft COCO: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. InECCV, 2014. 6, 18, 21

work page 2014

[35] [35]

Flow matching for generative modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. InICLR, 2023. 7

work page 2023

[36] [36]

Flow-grpo: Training flow matching models via online rl

Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl. InNeurIPS, 2025. 7

work page 2025

[37] [37]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InICLR, 2019. 20

work page 2019

[38] [38]

Open-magvit2: An open-source project toward democratizing auto-regressive visual gener- ation

Zhuoyan Luo, Fengyuan Shi, Yixiao Ge, Yujiu Yang, Limin Wang, and Ying Shan. Open-magvit2: An open- source project toward democratizing auto-regressive visual generation.arXiv preprint arXiv:2409.04410,

work page arXiv

[39] [39]

A view of the em algorithm that justifies incremental, sparse, and other variants

Radford M Neal and Geoffrey E Hinton. A view of the em algorithm that justifies incremental, sparse, and other variants. InLearning in graphical models, pages 355–368. Springer, 1998. 5, 17

work page 1998

[40] [40]

Training language models to follow instructions with human feedback.NeurIPS, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.NeurIPS, 2022. 10

work page 2022

[41] [41]

Freeman, and Yu-Xiong Wang

Ziqi Pang, Tianyuan Zhang, Fujun Luan, Yunze Man, Hao Tan, Kai Zhang, William T. Freeman, and Yu-Xiong Wang. Randar: Decoder-only autoregressive visual generation in random orders. InCVPR, 2024. 15

work page 2024

[42] [42]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, 2023. 7

work page 2023

[43] [43]

Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

Xue Bin Peng, Aravind Kumar, Grace Zhang, and Sergey Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning.arXiv preprint arXiv:1910.00177, 2019. 5, 17 12

work page internal anchor Pith review Pith/arXiv arXiv 1910

[44] [44]

Reinforcement learning by reward-weighted regression for operational space control

Jan Peters and Stefan Schaal. Reinforcement learning by reward-weighted regression for operational space control. InICML, 2007. 5, 17

work page 2007

[45] [45]

Sdxl: Improving latent diffusion models for high-resolution image synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. In ICLR, 2024. 7

work page 2024

[46] [46]

Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig Bou tilier, Pieter Abbeel, Mohammad Ghavamzadeh, and Shixiang Shane Gu

Mihir Prabhudesai, Anirudh Goyal, Deepak Pathak, and Katerina Fragkiadaki. Aligning text-to-image diffusion models with reward backpropagation.arXiv preprint arXiv:2310.03739, 2023. 2, 3, 5

work page arXiv 2023

[47] [47]

Qwen2.5 technical report.arXiv preprint, 2024

Qwen Team. Qwen2.5 technical report.arXiv preprint, 2024. 21, 22

work page 2024

[48] [48]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InICML, 2021. 6, 20

work page 2021

[49] [49]

Zero-shot text-to-image generation

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. InICML, 2021. 15

work page 2021

[50] [50]

Sequence level training with recurrent neural networks

Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence level training with recurrent neural networks. InICLR, 2016. 2

work page 2016

[51] [51]

Generating diverse high-fidelity images with VQ-V AE-2

Ali Razavi, Aaron Van den Oord, and Oriol Vinyals. Generating diverse high-fidelity images with VQ-V AE-2. InNeurIPS, 2019. 1

work page 2019

[52] [52]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InCVPR, 2022. 7

work page 2022

[53] [53]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. InarXiv preprint arXiv:1707.06347, 2017. 4, 16

work page internal anchor Pith review Pith/arXiv arXiv 2017

[54] [54]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, YK Li, Y Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 2, 3, 4, 6, 20

work page internal anchor Pith review Pith/arXiv arXiv 2024

[55] [55]

Scalable image tokenization with index backpropagation quantization

Fengyuan Shi, Zhuoyan Luo, Yixiao Ge, Yujiu Yang, Ying Shan, and Limin Wang. Scalable image tokenization with index backpropagation quantization. InCVPR, 2025. 15

work page 2025

[56] [56]

Journeydb: A benchmark for generative image understanding.NeurIPS, 2023

Keqiang Sun, Junting Pan, Yuying Ge, Hao Li, Haodong Duan, Xiaoshi Wu, Renrui Zhang, Aojun Zhou, Zipeng Qin, Yi Wang, et al. Journeydb: A benchmark for generative image understanding.NeurIPS, 2023. 21

work page 2023

[57] [57]

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autore- gressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525,

work page internal anchor Pith review Pith/arXiv arXiv

[58] [58]

Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised learning results

Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised learning results. InNeurIPS, 2017. 6

work page 2017

[59] [59]

Visual autoregressive modeling: Scalable image generation via next-scale prediction.NeurIPS, 2024

Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction.NeurIPS, 2024. 7, 15

work page 2024

[60] [60]

Neural discrete representation learning

Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. InNeurIPS, 2017. 1, 15

work page 2017

[61] [61]

Diffusion model alignment using direct preference optimization

Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization. InCVPR, 2024. 3, 7

work page 2024

[62] [62]

Simplear: Pushing the frontier of autoregressive visual generation through pretraining, sft, and rl

Junke Wang, Zhi Tian, Xun Wang, Xinyu Zhang, Weilin Huang, Zuxuan Wu, and Yu-Gang Jiang. Simplear: Pushing the frontier of autoregressive visual generation through pretraining, sft, and rl.arXiv preprint arXiv:2504.11455, 2025. 1, 3

work page arXiv 2025

[63] [63]

Emu3: Next-Token Prediction is All You Need

Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024. 7

work page internal anchor Pith review Pith/arXiv arXiv 2024

[64] [64]

Williams

Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning.Machine Learning, 1992. 4, 16 13

work page 1992

[65] [65]

C. F. Jeff Wu. On the convergence properties of the EM algorithm.The Annals of Statistics, 11(1):95–103,

work page

[66] [66]

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341, 2023. 3, 6, 20

work page internal anchor Pith review Pith/arXiv arXiv 2023

[67] [67]

Gigatok: Scaling visual tokenizers to 3 billion parameters for autoregressive image generation

Tianwei Xiong, Jun Hao Liew, Zilong Huang, Jiashi Feng, and Xihui Liu. Gigatok: Scaling visual tokenizers to 3 billion parameters for autoregressive image generation. InCVPR, 2025. 15

work page 2025

[68] [68]

Imagereward: Learning and evaluating human preferences for text-to-image generation

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. InNeurIPS, 2023. 3

work page 2023

[69] [69]

Scaling autoregressive models for content-rich text-to-image generation.TMLR, 2022

Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Amin Karbasi, et al. Scaling autoregressive models for content-rich text-to-image generation.TMLR, 2022. 7, 15

work page 2022

[70] [70]

An image is worth 32 tokens for reconstruction and generation

Qihang Yu, Mark Weber, Xueqing Deng, Xiaohui Shen, Daniel Cremers, and Liang-Chieh Chen. An image is worth 32 tokens for reconstruction and generation. InNeurIPS, 2024. 15

work page 2024

[71] [71]

Group critical-token policy optimization for autoregressive image generation

Guohui Zhang, Hu Yu, Xiaoxiao Ma, Jinghao Zhang, Yaning Pan, Mingde Yao, Jie Xiao, Linjiang Huang, and Feng Zhao. Group critical-token policy optimization for autoregressive image generation. InICLR,

work page

[72] [72]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, 2018. 6

work page 2018

[73] [73]

Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

Chunting Zhou, Lili Yu, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, and Omer Levy. Transfusion: Predict the next token and diffuse images with one multi-modal model.arXiv preprint arXiv:2408.11039, 2024. 7 14 Appendix for RankE Roadmap The appendix is organized into three parts, progressing from ...

work page internal anchor Pith review Pith/arXiv arXiv 2024