pith. sign in

arxiv: 2606.32039 · v1 · pith:T6ZXFIM2new · submitted 2026-06-30 · 💻 cs.CV

GEAR: Guided End-to-End AutoRegression for Image Synthesis

Pith reviewed 2026-07-01 05:16 UTC · model grok-4.3

classification 💻 cs.CV
keywords autoregressive image synthesisvector quantizationend-to-end trainingrepresentation alignmenttokenizer guidanceImageNet generation
0
0 comments X

The pith

GEAR jointly trains a vector-quantized tokenizer and autoregressive generator end-to-end by using dual readouts of codebook assignments so the tokenizer produces indices the generator can model more easily.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Visual generative models are normally trained in two separate stages: a tokenizer is first optimized for reconstruction and then frozen before an autoregressive generator is trained on its discrete indices. This separation means the tokenizer has no information about which indices the generator finds easy or hard to predict. GEAR removes the separation by training both components together, passing gradients back to the tokenizer through a differentiable soft readout while the autoregressive model continues to use the standard hard one-hot indices for next-token prediction. The representation-alignment loss therefore steers the tokenizer toward index distributions that the autoregressive model can predict more readily, reversing the usual direction of alignment. The resulting joint system reaches lower gFID scores on ImageNet substantially faster than two-stage baselines while producing more coherent spatial features.

Core claim

GEAR resolves the non-differentiability of vector-quantized indices by maintaining two readouts of the same codebook assignment: a hard one-hot branch that trains the autoregressive model with standard next-token prediction and a differentiable soft branch that carries a representation-alignment loss back to the tokenizer alone. This arrangement lets the autoregressive model guide the tokenizer toward index statistics it can model more easily, shifting the alignment burden from the tokenizer to the generator and producing tokenizer features that are less DINOv2-like while the autoregressive features become more so.

What carries the argument

Dual read-out of the codebook assignment: a hard one-hot branch for autoregressive next-token training and a differentiable soft branch that carries the representation-alignment loss to the tokenizer.

If this is right

  • ImageNet gFID convergence accelerates by up to 10x relative to strong two-stage autoregressive baselines.
  • Patch-level and spatially coherent features improve markedly under the joint objective.
  • The approach works across multiple quantizers including VQVAE, LFQ and IBQ.
  • The same joint-training recipe extends directly to text-to-image generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The reversal of alignment direction (tokenizer becoming less semantic while the autoregressive model becomes more semantic) suggests that latent design choices for autoregressive models may need to be reconsidered relative to diffusion models.
  • If the speedup persists when the representation-alignment loss is replaced by other auxiliary objectives, the core benefit may stem mainly from the differentiability mechanism rather than the specific alignment target.
  • Extending the dual-readout pattern to continuous latent autoregressive models could test whether the same gradient-routing idea applies outside discrete codebooks.

Load-bearing premise

The dual read-out successfully passes gradients to the tokenizer without causing codebook collapse or training instability.

What would settle it

A controlled experiment in which the joint GEAR training run exhibits the same gFID convergence curve as the two-stage LlamaGen-REPA baseline would falsify the central speedup claim.

read the original abstract

Visual generative models are typically trained in two stages. A tokenizer is first trained for reconstruction and then frozen, after which a generator is trained on its discrete indices or continuous latents. This decoupling leaves the tokenizer unaware of what the generator finds easy to model. We present GEAR (Guided End-to-end AutoRegression), which trains a vector-quantized (VQ) tokenizer and an autoregressive (AR) generator jointly and end-to-end, guided by representation alignment. The key obstacle is that the VQ index fed to the AR model is non-differentiable, so gradients cannot reach the tokenizer, and a straight-through estimator collapses. GEAR resolves this with a dual read-out of the codebook assignment. A hard, one-hot branch trains the AR with next-token prediction, while a differentiable soft branch carries a representation-alignment loss that flows back to guide only the tokenizer. The AR model thereby steers its tokenizer toward an index distribution it can predict more easily. This shifts the alignment burden from the tokenizer to the AR: the tokenizer's own features become less DINOv2-like while the AR's become more so, the opposite of diffusion-side recipes that make the latent itself semantic. GEAR speeds up ImageNet gFID convergence by up to 10x relative to the strong LlamaGen-REPA baseline, learns markedly better patch-level and spatially-coherent features, and generalizes across quantizers (VQVAE, LFQ, IBQ) and to text-to-image generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces GEAR, a method for jointly training a vector-quantized tokenizer and an autoregressive generator end-to-end for image synthesis. It uses a dual read-out of codebook assignments—a hard one-hot branch for next-token AR training and a differentiable soft branch for a representation-alignment loss that backpropagates only to the tokenizer—claiming this allows the AR model to guide the tokenizer toward more predictable indices. The abstract reports up to 10x faster ImageNet gFID convergence versus LlamaGen-REPA, improved patch-level and spatial features, a shift in DINOv2 alignment (tokenizer less semantic, AR more so), and generalization across VQVAE/LFQ/IBQ quantizers plus text-to-image tasks.

Significance. If the dual read-out mechanism proves stable and the reported speedups hold under controlled conditions, GEAR would represent a meaningful advance in autoregressive generative modeling by closing the tokenizer-generator decoupling gap. The claimed generalization across multiple quantizers and extension to text-to-image would strengthen its practical value; the opposite alignment shift relative to diffusion recipes is an interesting empirical observation worth confirming.

major comments (3)
  1. [Abstract] Abstract: The central 10x gFID speedup claim rests on the dual read-out successfully routing gradients through the soft branch to guide the tokenizer without collapse or degradation of the hard-branch AR training. However, the abstract provides no equations defining the soft assignment (e.g., distance metric to codebook entries, temperature, or softmax formulation), no ablation on collapse prevention, and no training curves or stability metrics, leaving the load-bearing assumption unverified.
  2. [Abstract] Abstract: The reported shift ('tokenizer's own features become less DINOv2-like while the AR's become more so') is presented as a direct consequence of the method, yet no quantitative evidence (e.g., cosine similarities, feature maps, or tables comparing DINOv2 alignment before/after) is referenced, undermining the interpretation that alignment burden has been shifted.
  3. [Abstract] Abstract: Generalization claims across VQVAE, LFQ, and IBQ quantizers, as well as to text-to-image, require that the dual read-out be adapted without introducing new instabilities; the abstract states the outcome but supplies no implementation details, loss-weighting schedules, or per-quantizer ablations that would confirm the mechanism transfers.
minor comments (1)
  1. [Abstract] The abstract would benefit from a brief parenthetical on the representation-alignment loss (e.g., which layers or features are aligned) to orient readers before the method section.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed feedback on the abstract. We address each major comment below and will revise the manuscript to incorporate additional details and references where appropriate.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central 10x gFID speedup claim rests on the dual read-out successfully routing gradients through the soft branch to guide the tokenizer without collapse or degradation of the hard-branch AR training. However, the abstract provides no equations defining the soft assignment (e.g., distance metric to codebook entries, temperature, or softmax formulation), no ablation on collapse prevention, and no training curves or stability metrics, leaving the load-bearing assumption unverified.

    Authors: The manuscript defines the soft assignment via temperature-scaled softmax over Euclidean distances in Section 3.2 (Equation 3). Ablations addressing collapse prevention appear in Section 4.3, and training curves with stability metrics are in Figure 4. We will revise the abstract to include a concise reference to the formulation and these supporting results. revision: yes

  2. Referee: [Abstract] Abstract: The reported shift ('tokenizer's own features become less DINOv2-like while the AR's become more so') is presented as a direct consequence of the method, yet no quantitative evidence (e.g., cosine similarities, feature maps, or tables comparing DINOv2 alignment before/after) is referenced, undermining the interpretation that alignment burden has been shifted.

    Authors: Table 3 in the manuscript reports the quantitative DINOv2 cosine similarity comparisons for both tokenizer and AR features before and after training. We will update the abstract to reference this table. revision: yes

  3. Referee: [Abstract] Abstract: Generalization claims across VQVAE, LFQ, and IBQ quantizers, as well as to text-to-image, require that the dual read-out be adapted without introducing new instabilities; the abstract states the outcome but supplies no implementation details, loss-weighting schedules, or per-quantizer ablations that would confirm the mechanism transfers.

    Authors: Section 4.4 and the appendix provide the adaptation details for each quantizer, including loss-weighting schedules. Per-quantizer ablations and text-to-image results are in Tables 4–5 and Section 5.3; no additional instabilities were observed. We will revise the abstract to reference these sections and tables. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical joint-training procedure validated by experiments

full rationale

The paper presents GEAR as an empirical training recipe that jointly optimizes a VQ tokenizer and AR generator via a dual-readout mechanism (hard one-hot for next-token loss, soft branch for representation alignment). All performance claims (10x gFID speedup, better patch features, generalization across quantizers) are stated as experimental outcomes relative to baselines such as LlamaGen-REPA. No equations, fitted parameters, or self-citations are invoked in the provided text to derive the central results; the method does not reduce any claimed prediction to a quantity defined by its own inputs. The derivation chain is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the contribution is a training procedure rather than new theoretical constructs or fitted constants.

pith-pipeline@v0.9.1-grok · 5832 in / 1227 out tokens · 40555 ms · 2026-07-01T05:16:31.706513+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 26 canonical work pages · 20 internal anchors

  1. [1]

    End-to-end object detection with transformers

    Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. InEuropean conference on computer vision, pages 213–229. Springer, 2020

  2. [2]

    GPIC: A Giant Permissive Image Corpus for Visual Generation

    Keshigeyan Chandrasegaran, Kyle Sargent, Suchir Agarwal, Michael Jang, Michael Poli, Juan Carlos Niebles, Justin Johnson, Jiajun Wu, and Li Fei-Fei. Gpic: A giant permissive image corpus for visual generation.arXiv preprint arXiv:2605.30341, 2026

  3. [3]

    Maskgit: Masked generative image transformer

    Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11315–11325, 2022

  4. [4]

    Masked autoencoders are effective tokenizers for diffusion models

    Hao Chen, Yujin Han, Fangyi Chen, Xiang Li, Yidong Wang, Jindong Wang, Ze Wang, Zicheng Liu, Difan Zou, and Bhiksha Raj. Masked autoencoders are effective tokenizers for diffusion models. InForty-second International Conference on Machine Learning, 2025

  5. [5]

    Pixart-alpha: Fast training of diffusion transformer for photorealistic text-to-image synthesis

    Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-alpha: Fast training of diffusion transformer for photorealistic text-to-image synthesis. InInternational conference on learning representations, volume 2024, pages 57611–57640, 2024

  6. [6]

    End-to-End Autoregressive Image Generation with 1D Semantic Tokenizer

    Wenda Chu, Bingliang Zhang, Jiaqi Han, Yizhuo Li, Linjie Yang, Yisong Yue, and Qiushan Guo. End-to-end autoregressive image generation with 1d semantic tokenizer.arXiv preprint arXiv:2605.00503, 2026

  7. [7]

    Emerging Properties in Unified Multimodal Pretraining

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

  8. [8]

    SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

    Haiwen Diao, Penghao Wu, Hanming Deng, Jiahao Wang, Shihao Bai, Silei Wu, Weichen Fan, Wenjie Ye, Wenwen Tong, Xiangyu Fan, et al. Sensenova-u1: Unifying multimodal understanding and generation with neo-unify architecture.arXiv preprint arXiv:2605.12500, 2026

  9. [9]

    Vqrae: Representation quantization autoencoders for multimodal understanding, generation and reconstruction

    Sinan Du, Jiahao Guo, Bo Li, Shuhao Cui, Zhengzhuo Xu, Yifu Luo, Yongxian Wei, Kun Gai, Xinggang Wang, Kai Wu, et al. Vqrae: Representation quantization autoencoders for multimodal understanding, generation and reconstruction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 30322–30334, 2026

  10. [10]

    Taming transformers for high-resolution image synthesis

    Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021

  11. [11]

    Masked diffusion transformer is a strong image synthesizer

    Shanghua Gao, Pan Zhou, Ming-Ming Cheng, and Shuicheng Yan. Masked diffusion transformer is a strong image synthesizer. InProceedings of the IEEE/CVF international conference on computer vision, pages 23164–23173, 2023

  12. [12]

    Mdtv2: Masked diffusion transformer is a strong image synthesizer.arXiv preprint arXiv:2303.14389, 2023

    Shanghua Gao, Pan Zhou, Ming-Ming Cheng, and Shuicheng Yan. Mdtv2: Masked diffusion transformer is a strong image synthesizer.arXiv preprint arXiv:2303.14389, 2023

  13. [13]

    arXiv preprint arXiv:2507.22058 (2025)

    Zigang Geng, Yibing Wang, Yeyao Ma, Chen Li, Yongming Rao, Shuyang Gu, Zhao Zhong, Qinglin Lu, Han Hu, Xiaosong Zhang, et al. X-omni: Reinforcement learning makes discrete autoregressive image generative models great again.arXiv preprint arXiv:2507.22058, 2025

  14. [14]

    Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

    Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

  15. [15]

    Rich feature hierarchies for accurate object detection and semantic segmentation

    Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 580–587, 2014

  16. [16]

    Vision as a dialect: Unifying visual understanding and generation via text-aligned representations.Advances in Neural Information Processing Systems, 38:158430–158459, 2026

    Jiaming Han, Hao Chen, Yang Zhao, Hanyu Wang, Qi Zhao, Ziyan Yang, Hao He, Xiangyu Yue, and Lu Jiang. Vision as a dialect: Unifying visual understanding and generation via text-aligned representations.Advances in Neural Information Processing Systems, 38:158430–158459, 2026

  17. [17]

    ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

    Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135, 2024. 17

  18. [18]

    The Platonic Representation Hypothesis

    Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. The platonic representation hypothesis.arXiv preprint arXiv:2405.07987, 2024

  19. [19]

    Similarity of neural network represen- tations revisited

    Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network represen- tations revisited. InInternational conference on machine learning, pages 3519–3529. PMlR, 2019

  20. [20]

    Repa-e: Unlocking vae for end-to-end tuning of latent diffusion transformers

    Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. Repa-e: Unlocking vae for end-to-end tuning of latent diffusion transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18262–18272, 2025

  21. [21]

    Back to basics: Let denoising generative models denoise

    Tianhong Li and Kaiming He. Back to basics: Let denoising generative models denoise. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 36115–36125, 2026

  22. [22]

    Autoregressive image generation without vector quantization.Advances in Neural Information Processing Systems, 37:56424–56445, 2024

    Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vector quantization.Advances in Neural Information Processing Systems, 37:56424–56445, 2024

  23. [23]

    ifsq: Improving fsq for image generation with 1 line of code.arXiv preprint arXiv:2601.17124, 2026

    Bin Lin, Zongjian Li, Yuwei Niu, Kaixiong Gong, Yunyang Ge, Yunlong Lin, Mingzhe Zheng, JianWei Zhang, Miles Yang, Zhao Zhong, et al. ifsq: Improving fsq for image generation with 1 line of code.arXiv preprint arXiv:2601.17124, 2026

  24. [24]

    Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation

    Zhiheng Liu, Weiming Ren, Xiaoke Huang, Shoufa Chen, Tianhong Li, Mengzhao Chen, Yatai Ji, Sen He, Jonas Schult, Belinda Zeng, et al. Tuna-2: Pixel embeddings beat vision encoders for multimodal understanding and generation.arXiv preprint arXiv:2604.24763, 2026

  25. [25]

    Open-magvit2: An open-source project toward democratizing auto-regressive visual generation.arXiv preprint arXiv:2409.04410, 2024

    Zhuoyan Luo, Fengyuan Shi, Yixiao Ge, Yujiu Yang, Limin Wang, and Ying Shan. Open-magvit2: An open-source project toward democratizing auto-regressive visual generation.arXiv preprint arXiv:2409.04410, 2024

  26. [26]

    Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers

    Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. InEuropean Conference on Computer Vision, pages 23–40. Springer, 2024

  27. [27]

    Notes on continuous stochastic phenomena.Biometrika, 37(1/2):17–23, 1950

    Patrick AP Moran. Notes on continuous stochastic phenomena.Biometrika, 37(1/2):17–23, 1950

  28. [28]

    V-JEPA 2.1: Unlocking Dense Features in Video Self-Supervised Learning

    Lorenzo Mur-Labadia, Matthew Muckley, Amir Bar, Mido Assran, Koustuv Sinha, Mike Rabbat, Yann LeCun, Nicolas Ballas, and Adrien Bardes. V-jepa 2.1: Unlocking dense features in video self-supervised learning.arXiv preprint arXiv:2603.14482, 2026

  29. [29]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

  30. [30]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022

  31. [31]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

  32. [32]

    Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

  33. [33]

    You only look once: Unified, real-time object detection

    Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016

  34. [34]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  35. [35]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

  36. [36]

    Scalable image tokenization with index backpropagation quantization

    Fengyuan Shi, Zhuoyan Luo, Yixiao Ge, Yujiu Yang, Ying Shan, and Limin Wang. Scalable image tokenization with index backpropagation quantization. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16037–16046, 2025. 18

  37. [37]

    Representation alignment for just image transformers is not easier than you think.arXiv preprint arXiv:2603.14366, 2026

    Jaeyo Shin, Jiwook Kim, and Hyunjung Shim. Representation alignment for just image transformers is not easier than you think.arXiv preprint arXiv:2603.14366, 2026

  38. [38]

    DINOv3

    Oriane Siméoni, Huy V Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025

  39. [39]

    What matters for representation alignment: Global information or spatial structure? arXiv preprint arXiv:2512.10794,

    Jaskirat Singh, Xingjian Leng, Zongze Wu, Liang Zheng, Richard Zhang, Eli Shechtman, and Saining Xie. What matters for representation alignment: Global information or spatial structure?arXiv preprint arXiv:2512.10794, 2025

  40. [40]

    Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

    Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525, 2024

  41. [41]

    Chameleon: Mixed-Modal Early-Fusion Foundation Models

    Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024

  42. [42]

    Visual autoregressive modeling: Scalable image generation via next-scale prediction.Advances in neural information processing systems, 37:84839–84865, 2024

    Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction.Advances in neural information processing systems, 37:84839–84865, 2024

  43. [43]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025

  44. [44]

    Neural discrete representation learning.Advances in neural information processing systems, 30, 2017

    Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Advances in neural information processing systems, 30, 2017

  45. [45]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  46. [46]

    Reconstruction vs

    Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15703–15712, 2025

  47. [47]

    Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

    Lijun Yu, José Lezama, Nitesh B Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Vighnesh Birodkar, Agrim Gupta, Xiuye Gu, et al. Language model beats diffusion–tokenizer is key to visual generation. arXiv preprint arXiv:2310.05737, 2023

  48. [48]

    Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

    Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think.arXiv preprint arXiv:2410.06940, 2024

  49. [49]

    Pixeldit: Pixel diffusion transformers for image generation

    Yongsheng Yu, Wei Xiong, Weili Nie, Yichen Sheng, Shiqiu Liu, and Jiebo Luo. Pixeldit: Pixel diffusion transformers for image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14273–14282, 2026

  50. [50]

    What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion

    Zhengrong Yue, Taihang Hu, Mengting Chen, Haiyu Zhang, Zihao Pan, Tao Liu, Zikang Wang, Jinsong Lan, Xiaoyong Zhu, Bo Zheng, et al. What matters for diffusion-friendly latent manifold? prior-aligned autoencoders for latent diffusion.arXiv preprint arXiv:2605.07915, 2026

  51. [51]

    The unreasonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018

  52. [52]

    Diffusion Transformers with Representation Autoencoders

    Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders.arXiv preprint arXiv:2510.11690, 2025. 19 A Ablation Training Configurations All ablation studies fine-tune the GEAR-L model end-to-end on top of the warm-up tokenizer (except the initialization study, which also trains the tokenizer from scrat...