pith. sign in

arxiv: 2605.21195 · v1 · pith:3ESQ4MXTnew · submitted 2026-05-20 · 💻 cs.CV

RankE: End-to-End Post-Training for Discrete Text-to-Image Generation with Decoder Co-Evolution

Pith reviewed 2026-05-21 05:44 UTC · model grok-4.3

classification 💻 cs.CV
keywords text-to-image generationdiscrete autoregressive modelspost-trainingdecoder co-evolutionlatent covariate shiftranking-based optimizationFID improvementCLIP score
0
0 comments X

The pith

Policy-only optimization in discrete text-to-image models creates a token-distribution mismatch with the frozen decoder that improves rewards while harming image quality, but co-evolving both resolves the trade-off.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Discrete autoregressive text-to-image models pair a VQ tokenizer with an AR policy and currently optimize only the policy while keeping the decoder frozen. This policy-only approach induces latent covariate shift as the evolving token distribution diverges from the decoder's original training data, so reward scores rise while decoded image quality falls. RankE instead alternates optimization between the two modules so each maximizes a ranking-based alignment objective while staying regularized by a stability anchor. On LlamaGen-XL the method simultaneously raises CLIP score and lowers FID; the same pattern holds on Janus-Pro. The result converts reward optimization into measurable pixel-space gains rather than trading one for the other.

Core claim

Policy-only post-training induces latent covariate shift between the policy's token distribution and the decoder's original training distribution, producing higher reward metrics but lower decoded image quality; RankE eliminates this mismatch by co-evolving both components through alternating optimization in which each module maximizes a ranking-based alignment objective while being regularized by a stability-preserving anchor suited to its parameter space.

What carries the argument

Alternating optimization of policy and decoder, each maximizing a ranking-based alignment objective while regularized by a stability-preserving anchor.

If this is right

  • Standard RL improves CLIP but degrades FID, while RankE improves both simultaneously.
  • On LlamaGen-XL (775M) the method reaches FID 15.21 and CLIP 33.76 on MS-COCO 30K.
  • Consistent gains appear on Janus-Pro (1B) as well.
  • Reward optimization is converted directly into pixel-space quality improvements instead of a trade-off.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same co-evolution pattern could be tested on discrete autoregressive models for video or audio to check whether token-distribution mismatch is a general problem.
  • If stability anchors prove unnecessary in later work, full joint gradient updates between policy and decoder might become feasible.
  • The ranking objective used here might be replaced by other preference signals without changing the core need for decoder updates.

Load-bearing premise

The observed drop in decoded image quality after policy-only optimization is caused by divergence between the new token distribution and the decoder's original training distribution, and alternating co-evolution with anchors can correct the mismatch without introducing fresh instabilities or overfitting.

What would settle it

Sample a large set of tokens from the fully optimized policy, train a fresh decoder on those tokens alone, and measure whether the resulting FID matches or exceeds the co-evolved decoder's FID on the same prompts.

Figures

Figures reproduced from arXiv: 2605.21195 by Cheng Tan, Huan Wang, Luyuan Zhang, Siyong Jian, Siyuan Li, Xin Jin, Ying Li, Zedong Wang.

Figure 1
Figure 1. Figure 1: Latent Covariate Shift intensifies under RL and is mitigated by RankE. Left: KL divergence between each model’s VQ token distribution and that of 5,000 real MS-COCO images; the dashed line marks Real, a natural-variation lower bound computed from two independently sampled real-image sets encoded by the same frozen tokenizer. The shift grows progressively across pre-training, SFT, and RL—standard RL widens … view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of existing AR post-training and RankE framework. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the RankE co-evolution framework. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Evolution of generation metrics over training steps. Comparison of RankE against a Standard RL baseline with a frozen decoder. While Standard RL achieves marginal improvements in alignment, it suffers from stagnant or degrading visual fidelity (b) as the frozen decoder cannot adapt to policy-induced latent drift. In contrast, the co-evolution mechanism of RankE effectively translates reward optimization in… view at source ↗
Figure 5
Figure 5. Figure 5: Visualization of T2I generation. RankE yields precise attributes and details according to [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Latent Covariate Shift and token entropy during training. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
read the original abstract

Discrete autoregressive (AR) text-to-image (T2I) models pair a VQ tokenizer with an AR policy, and current post-training pipelines optimize only the policy while keeping the VQ decoder frozen. Recent diffusion T2I work, exemplified by REPA-E, has shown that the VAE itself constitutes a key alignment bottleneck, yet no analogous investigation exists for discrete AR models. We show that policy-only optimization induces Latent Covariate Shift: as the policy evolves, the resulting token distribution diverges from the ground-truth distribution on which the decoder was trained, such that reward scores improve while decoded image quality degrades. To address this mismatch, we propose RankE, the first end-to-end post-training framework for discrete T2I generation. Rather than optimizing the policy against a fixed decoder, RankE co-evolves both components through alternating optimization: each module maximizes a ranking-based alignment objective while being regularized by a stability-preserving anchor suited to its parameter space. This co-evolution breaks the fidelity--alignment trade-off that plagues frozen-decoder approaches: on LlamaGen-XL (775M), standard RL improves CLIP but degrades FID, whereas RankE improves both simultaneously (FID 15.21, CLIP 33.76 on MS-COCO 30K). Consistent gains on Janus-Pro (1B) confirm that decoder co-evolution reliably converts reward optimization into pixel-space quality improvements.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that policy-only post-training of discrete autoregressive text-to-image models induces latent covariate shift between the evolving token distribution and the frozen VQ decoder, improving alignment (CLIP) at the expense of pixel quality (FID). RankE addresses this via alternating optimization that co-evolves the policy and decoder using ranking-based alignment objectives regularized by stability anchors. On LlamaGen-XL (775M) it reports simultaneous gains (FID 15.21, CLIP 33.76 on MS-COCO 30K) versus standard RL, with consistent results on Janus-Pro (1B).

Significance. If the reported metric improvements are robust and the covariate-shift mechanism is validated, the work offers a practical route to breaking the fidelity-alignment trade-off in discrete T2I post-training. The alternating co-evolution strategy with ranking objectives and stability anchors is a concrete contribution that could generalize beyond the tested models. However, the significance is limited by the absence of direct mechanistic evidence or isolating ablations, leaving the justification for decoder co-evolution as the necessary remedy open to alternative explanations.

major comments (3)
  1. [Abstract and §3] Abstract and §3: The central claim that policy-only RL induces latent covariate shift (and that this is the cause of FID degradation) is presented without quantitative characterization of the shift, such as token-histogram divergence, per-layer activation statistics, or reconstruction error on policy-generated samples. No ablation isolates decoder co-evolution from other effects of the alternating schedule.
  2. [Experiments] Experiments section (results on LlamaGen-XL and Janus-Pro): The reported FID and CLIP numbers lack error bars, number of random seeds, or statistical significance tests. It is therefore unclear whether the simultaneous improvement over standard RL is reliable or sensitive to hyper-parameter choices.
  3. [§4] §4 (ablation studies): The manuscript provides no ablation that removes the stability anchors or the ranking objective individually while keeping the alternating schedule, making it impossible to attribute gains specifically to decoder co-evolution rather than increased optimization capacity or regularization.
minor comments (2)
  1. [§3.2] The precise mathematical form of the stability-preserving anchor for the decoder (versus the policy) is only sketched; an explicit equation in the main text would improve reproducibility.
  2. [Figure 2] Figure 2 (qualitative examples) would benefit from side-by-side comparison with the standard RL baseline at matched CLIP score to illustrate the claimed quality difference.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point-by-point below, proposing specific revisions to strengthen the manuscript while maintaining the integrity of our claims.

read point-by-point responses
  1. Referee: [Abstract and §3] The central claim that policy-only RL induces latent covariate shift (and that this is the cause of FID degradation) is presented without quantitative characterization of the shift, such as token-histogram divergence, per-layer activation statistics, or reconstruction error on policy-generated samples. No ablation isolates decoder co-evolution from other effects of the alternating schedule.

    Authors: We agree that direct quantitative evidence of the covariate shift would strengthen the central claim. In the revised manuscript, we will add token-histogram KL divergence and reconstruction error metrics on policy-generated samples versus training data in §3. For isolating decoder co-evolution, we will include a new ablation comparing the full alternating RankE against a variant that performs alternating updates but freezes the decoder after initial steps. While perfect isolation is inherently limited by the coupled optimization, this ablation will clarify the contribution of decoder evolution beyond the alternating schedule alone. revision: yes

  2. Referee: The reported FID and CLIP numbers lack error bars, number of random seeds, or statistical significance tests. It is therefore unclear whether the simultaneous improvement over standard RL is reliable or sensitive to hyper-parameter choices.

    Authors: We acknowledge this limitation in statistical reporting. In the revision, we will rerun the main results on LlamaGen-XL and Janus-Pro using at least three random seeds, reporting means and standard deviations for FID and CLIP. We will also add a brief discussion of hyperparameter sensitivity based on our existing tuning logs to address reliability concerns. revision: yes

  3. Referee: The manuscript provides no ablation that removes the stability anchors or the ranking objective individually while keeping the alternating schedule, making it impossible to attribute gains specifically to decoder co-evolution rather than increased optimization capacity or regularization.

    Authors: We thank the referee for highlighting this gap. In the updated §4, we will add two targeted ablations while preserving the alternating schedule: (1) disabling stability anchors, and (2) replacing the ranking objective with standard RL loss. These will help attribute performance gains more precisely to decoder co-evolution versus other regularization or optimization effects. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical observation and method proposal remain self-contained

full rationale

The paper motivates RankE from an empirical observation that policy-only RL on discrete AR T2I models improves reward metrics while degrading FID, attributing this to an induced divergence between the evolving token distribution and the decoder's original training support. This observation is presented as a measured phenomenon rather than derived from equations that presuppose the conclusion. The proposed alternating optimization with ranking objectives and stability anchors is introduced as a direct response to the observed mismatch, with performance gains demonstrated via direct comparison to frozen-decoder RL baselines on LlamaGen-XL and Janus-Pro. No load-bearing step reduces a prediction to a fitted parameter by construction, invokes a self-citation as an unverified uniqueness theorem, or renames a known result under new coordinates. The derivation chain is therefore grounded in external experimental contrasts rather than internal redefinition.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the existence of latent covariate shift as the cause of quality degradation and on the effectiveness of alternating optimization with per-module stability anchors; no explicit free parameters or invented entities are named in the abstract.

axioms (2)
  • domain assumption Policy-only optimization induces a divergence between generated token distribution and the decoder's original training distribution.
    Invoked to explain why reward improves while decoded quality degrades.
  • ad hoc to paper Alternating optimization with ranking objectives and stability anchors can break the fidelity-alignment trade-off without introducing instability.
    Core premise of the RankE method.

pith-pipeline@v0.9.0 · 5812 in / 1377 out tokens · 22687 ms · 2026-05-21T05:44:23.025566+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · 13 internal anchors

  1. [1]

    Scheduled sampling for sequence prediction with recurrent neural networks

    Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. InNeurIPS, 2015. 2

  2. [2]

    Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

    Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation.arXiv preprint arXiv:1308.3432, 2013. 2, 3, 15

  3. [3]

    Improving image generation with better captions.Computer Science, 2023

    James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions.Computer Science, 2023. 21

  4. [4]

    Training diffusion models with reinforcement learning

    Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning. InICLR, 2024. 3

  5. [5]

    Chameleon: Mixed-Modal Early-Fusion Foundation Models

    Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024. 1

  6. [6]

    MaskGIT: Masked generative image transformer

    Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. MaskGIT: Masked generative image transformer. InCVPR, 2022. 15

  7. [7]

    Muse: Text-to-image generation via masked generative transformers

    Huiwen Chang, Han Zhang, Jarred Barber, AJ Maschinot, Jose Lezama, Lu Jiang, Ming-Hsuan Yang, Kevin Murphy, William T Freeman, Michael Rubinstein, et al. Muse: Text-to-image generation via masked generative transformers. InICML, 2023. 15

  8. [8]

    Softvq-vae: Efficient 1-dimensional continuous tokenizer

    Hao Chen, Ze Wang, Xiang Li, Ximeng Sun, Fangyi Chen, Jiang Liu, Jindong Wang, Bhiksha Raj, Zicheng Liu, and Emad Barsoum. Softvq-vae: Efficient 1-dimensional continuous tokenizer. InCVPR, 2025. 15

  9. [9]

    BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

    Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset.arXiv preprint arXiv:2505.09568, 2025. 21

  10. [10]

    Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025. 1, 3, 6, 7

  11. [11]

    Directly fine-tuning diffusion models on differentiable rewards

    Kevin Clark, Paul Vicol, Kevin Swersky, and David J Fleet. Directly fine-tuning diffusion models on differentiable rewards. InICLR, 2024. 2, 3, 5, 7

  12. [12]

    Reward model ensembles help mitigate overoptimization

    Thomas Coste, Usman Anwar, Robert Kirk, and David Krueger. Reward model ensembles help mitigate overoptimization. InICLR, 2024. 4

  13. [13]

    Maximum likelihood from incomplete data via the EM algorithm.Journal of the Royal Statistical Society: Series B, 1977

    Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the EM algorithm.Journal of the Royal Statistical Society: Series B, 1977. 17

  14. [14]

    CogView: Mastering text-to-image generation via transformers

    Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, and Jie Tang. CogView: Mastering text-to-image generation via transformers. In NeurIPS, 2021. 15

  15. [15]

    Taming transformers for high-resolution image synthesis

    Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. InCVPR, 2021. 1, 6, 15

  16. [16]

    Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models

    Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Mo- hammad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models. InNeurIPS, 2024. 3, 7

  17. [17]

    Scaling laws for reward model overoptimization

    Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. InICML,

  18. [18]

    GenEval: An Object-Focused Framework for Evaluating Text-to-Image Alignment, 2023

    Dhruba Ghosh, Hannaneh Hajishirzi, and Alexander Schwing. GenEval: An object-focused framework for evaluating text-to-image alignment.arXiv preprint arXiv:2310.11513, 2023. 6, 21

  19. [19]

    Generative adversarial nets

    Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. InNeurIPS, 2014. 5, 17

  20. [20]

    Bootstrap your own latent: A new approach to self-supervised learning

    Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent: A new approach to self-supervised learning. InNeurIPS, 2020. 6

  21. [21]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. InNeurIPS, 2017. 6 11

  22. [22]

    Straightening out the straight-through estimator: Overcoming optimization challenges in vector quantized networks.ICML, 2023

    Minyoung Huh, Brian Cheung, Pulkit Agrawal, and Phillip Isola. Straightening out the straight-through estimator: Overcoming optimization challenges in vector quantized networks.ICML, 2023. 2, 16

  23. [23]

    Image-to-image translation with conditional adversarial networks

    Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. InCVPR, 2017. 20

  24. [24]

    Categorical reparameterization with Gumbel-softmax

    Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with Gumbel-softmax. InICLR,

  25. [25]

    T2i-r1: Reinforcing image generation with collaborative semantic-level and token-level cot.arXiv preprint arXiv:2505.00703, 2025

    Dongzhi Jiang, Ziyu Guo, Renrui Zhang, Zhuofan Zong, Hao Li, Le Zhuo, Shilin Yan, Pheng-Ann Heng, and Hongsheng Li. T2i-r1: Reinforcing image generation with collaborative semantic-level and token-level cot.arXiv preprint arXiv:2505.00703, 2025. 1, 3

  26. [26]

    Fast decoding in sequence models using discrete latent variables

    Łukasz Kaiser, Aurko Roy, Ashish Vaswani, Niki Parmar, Samy Bengio, Jakob Unkber, and Noam Shazeer. Fast decoding in sequence models using discrete latent variables. InICML, 2018. 2, 16

  27. [27]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020. 1, 15

  28. [28]

    Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 2017

    James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 2017. 6

  29. [29]

    Rl with kl penalties is better viewed as bayesian inference

    Tomasz Korbak, Hady Elsahar, Germán Kruszewski, and Marc Dymetman. Rl with kl penalties is better viewed as bayesian inference. InEMNLP, 2022. 3, 4

  30. [30]

    Repa-e: Unlocking vae for end-to-end tuning of latent diffusion transformers

    Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. Repa-e: Unlocking vae for end-to-end tuning of latent diffusion transformers. InCVPR, 2025. 1, 2, 3, 7, 16

  31. [31]

    Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review

    Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review.arXiv preprint arXiv:1805.00909, 2018. 3, 4, 16

  32. [32]

    Mergevq: A unified framework for visual generation and representation with disentangled token merging and quantization

    Siyuan Li, Luyuan Zhang, Zedong Wang, Juanxi Tian, Cheng Tan, Zicheng Liu, Chang Yu, Qingsong Xie, Haonan Lu, Haoqian Wang, and Zhen Lei. Mergevq: A unified framework for visual generation and representation with disentangled token merging and quantization. InCVPR, 2025. 15

  33. [33]

    Va-π: Variational policy alignment for pixel-aware autoregressive generation

    Xinyao Liao, Qiyuan He, Kai Xu, Xiaoye Qu, Yicong Li, Wei Wei, and Angela Yao. Va-π: Variational policy alignment for pixel-aware autoregressive generation. InCVPR, 2026. 1, 3, 7

  34. [34]

    Microsoft COCO: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. InECCV, 2014. 6, 18, 21

  35. [35]

    Flow matching for generative modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. InICLR, 2023. 7

  36. [36]

    Flow-grpo: Training flow matching models via online rl

    Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl. InNeurIPS, 2025. 7

  37. [37]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InICLR, 2019. 20

  38. [38]

    Open-magvit2: An open-source project toward democratizing auto-regressive visual gener- ation

    Zhuoyan Luo, Fengyuan Shi, Yixiao Ge, Yujiu Yang, Limin Wang, and Ying Shan. Open-magvit2: An open- source project toward democratizing auto-regressive visual generation.arXiv preprint arXiv:2409.04410,

  39. [39]

    A view of the em algorithm that justifies incremental, sparse, and other variants

    Radford M Neal and Geoffrey E Hinton. A view of the em algorithm that justifies incremental, sparse, and other variants. InLearning in graphical models, pages 355–368. Springer, 1998. 5, 17

  40. [40]

    Training language models to follow instructions with human feedback.NeurIPS, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.NeurIPS, 2022. 10

  41. [41]

    Freeman, and Yu-Xiong Wang

    Ziqi Pang, Tianyuan Zhang, Fujun Luan, Yunze Man, Hao Tan, Kai Zhang, William T. Freeman, and Yu-Xiong Wang. Randar: Decoder-only autoregressive visual generation in random orders. InCVPR, 2024. 15

  42. [42]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, 2023. 7

  43. [43]

    Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

    Xue Bin Peng, Aravind Kumar, Grace Zhang, and Sergey Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning.arXiv preprint arXiv:1910.00177, 2019. 5, 17 12

  44. [44]

    Reinforcement learning by reward-weighted regression for operational space control

    Jan Peters and Stefan Schaal. Reinforcement learning by reward-weighted regression for operational space control. InICML, 2007. 5, 17

  45. [45]

    Sdxl: Improving latent diffusion models for high-resolution image synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. In ICLR, 2024. 7

  46. [46]

    Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig Bou tilier, Pieter Abbeel, Mohammad Ghavamzadeh, and Shixiang Shane Gu

    Mihir Prabhudesai, Anirudh Goyal, Deepak Pathak, and Katerina Fragkiadaki. Aligning text-to-image diffusion models with reward backpropagation.arXiv preprint arXiv:2310.03739, 2023. 2, 3, 5

  47. [47]

    Qwen2.5 technical report.arXiv preprint, 2024

    Qwen Team. Qwen2.5 technical report.arXiv preprint, 2024. 21, 22

  48. [48]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InICML, 2021. 6, 20

  49. [49]

    Zero-shot text-to-image generation

    Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. InICML, 2021. 15

  50. [50]

    Sequence level training with recurrent neural networks

    Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence level training with recurrent neural networks. InICLR, 2016. 2

  51. [51]

    Generating diverse high-fidelity images with VQ-V AE-2

    Ali Razavi, Aaron Van den Oord, and Oriol Vinyals. Generating diverse high-fidelity images with VQ-V AE-2. InNeurIPS, 2019. 1

  52. [52]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InCVPR, 2022. 7

  53. [53]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. InarXiv preprint arXiv:1707.06347, 2017. 4, 16

  54. [54]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, YK Li, Y Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 2, 3, 4, 6, 20

  55. [55]

    Scalable image tokenization with index backpropagation quantization

    Fengyuan Shi, Zhuoyan Luo, Yixiao Ge, Yujiu Yang, Ying Shan, and Limin Wang. Scalable image tokenization with index backpropagation quantization. InCVPR, 2025. 15

  56. [56]

    Journeydb: A benchmark for generative image understanding.NeurIPS, 2023

    Keqiang Sun, Junting Pan, Yuying Ge, Hao Li, Haodong Duan, Xiaoshi Wu, Renrui Zhang, Aojun Zhou, Zipeng Qin, Yi Wang, et al. Journeydb: A benchmark for generative image understanding.NeurIPS, 2023. 21

  57. [57]

    Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

    Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autore- gressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525,

  58. [58]

    Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised learning results

    Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised learning results. InNeurIPS, 2017. 6

  59. [59]

    Visual autoregressive modeling: Scalable image generation via next-scale prediction.NeurIPS, 2024

    Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction.NeurIPS, 2024. 7, 15

  60. [60]

    Neural discrete representation learning

    Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. InNeurIPS, 2017. 1, 15

  61. [61]

    Diffusion model alignment using direct preference optimization

    Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization. InCVPR, 2024. 3, 7

  62. [62]

    Simplear: Pushing the frontier of autoregressive visual generation through pretraining, sft, and rl

    Junke Wang, Zhi Tian, Xun Wang, Xinyu Zhang, Weilin Huang, Zuxuan Wu, and Yu-Gang Jiang. Simplear: Pushing the frontier of autoregressive visual generation through pretraining, sft, and rl.arXiv preprint arXiv:2504.11455, 2025. 1, 3

  63. [63]

    Emu3: Next-Token Prediction is All You Need

    Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024. 7

  64. [64]

    Williams

    Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning.Machine Learning, 1992. 4, 16 13

  65. [65]

    C. F. Jeff Wu. On the convergence properties of the EM algorithm.The Annals of Statistics, 11(1):95–103,

  66. [66]

    Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

    Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341, 2023. 3, 6, 20

  67. [67]

    Gigatok: Scaling visual tokenizers to 3 billion parameters for autoregressive image generation

    Tianwei Xiong, Jun Hao Liew, Zilong Huang, Jiashi Feng, and Xihui Liu. Gigatok: Scaling visual tokenizers to 3 billion parameters for autoregressive image generation. InCVPR, 2025. 15

  68. [68]

    Imagereward: Learning and evaluating human preferences for text-to-image generation

    Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. InNeurIPS, 2023. 3

  69. [69]

    Scaling autoregressive models for content-rich text-to-image generation.TMLR, 2022

    Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Amin Karbasi, et al. Scaling autoregressive models for content-rich text-to-image generation.TMLR, 2022. 7, 15

  70. [70]

    An image is worth 32 tokens for reconstruction and generation

    Qihang Yu, Mark Weber, Xueqing Deng, Xiaohui Shen, Daniel Cremers, and Liang-Chieh Chen. An image is worth 32 tokens for reconstruction and generation. InNeurIPS, 2024. 15

  71. [71]

    Group critical-token policy optimization for autoregressive image generation

    Guohui Zhang, Hu Yu, Xiaoxiao Ma, Jinghao Zhang, Yaning Pan, Mingde Yao, Jie Xiao, Linjiang Huang, and Feng Zhao. Group critical-token policy optimization for autoregressive image generation. InICLR,

  72. [72]

    The unreasonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, 2018. 6

  73. [73]

    Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

    Chunting Zhou, Lili Yu, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, and Omer Levy. Transfusion: Predict the next token and diffuse images with one multi-modal model.arXiv preprint arXiv:2408.11039, 2024. 7 14 Appendix for RankE Roadmap The appendix is organized into three parts, progressing from ...