GEAR: Guided End-to-End AutoRegression for Image Synthesis

Bin Lin; Chenguo Lin; Jianwei Zhang; Liefeng Bo; Li Yuan; Miles Yang; Sixiang Chen; Yunlong Lin; Yunyang Ge; Zhao Zhong

arxiv: 2606.32039 · v1 · pith:T6ZXFIM2new · submitted 2026-06-30 · 💻 cs.CV

GEAR: Guided End-to-End AutoRegression for Image Synthesis

Bin Lin , Zheyuan Liu , Chenguo Lin , Sixiang Chen , Yunyang Ge , Yunlong Lin , Jianwei Zhang , Miles Yang

show 3 more authors

Zhao Zhong Liefeng Bo Li Yuan

This is my paper

Pith reviewed 2026-07-01 05:16 UTC · model grok-4.3

classification 💻 cs.CV

keywords autoregressive image synthesisvector quantizationend-to-end trainingrepresentation alignmenttokenizer guidanceImageNet generation

0 comments

The pith

GEAR jointly trains a vector-quantized tokenizer and autoregressive generator end-to-end by using dual readouts of codebook assignments so the tokenizer produces indices the generator can model more easily.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Visual generative models are normally trained in two separate stages: a tokenizer is first optimized for reconstruction and then frozen before an autoregressive generator is trained on its discrete indices. This separation means the tokenizer has no information about which indices the generator finds easy or hard to predict. GEAR removes the separation by training both components together, passing gradients back to the tokenizer through a differentiable soft readout while the autoregressive model continues to use the standard hard one-hot indices for next-token prediction. The representation-alignment loss therefore steers the tokenizer toward index distributions that the autoregressive model can predict more readily, reversing the usual direction of alignment. The resulting joint system reaches lower gFID scores on ImageNet substantially faster than two-stage baselines while producing more coherent spatial features.

Core claim

GEAR resolves the non-differentiability of vector-quantized indices by maintaining two readouts of the same codebook assignment: a hard one-hot branch that trains the autoregressive model with standard next-token prediction and a differentiable soft branch that carries a representation-alignment loss back to the tokenizer alone. This arrangement lets the autoregressive model guide the tokenizer toward index statistics it can model more easily, shifting the alignment burden from the tokenizer to the generator and producing tokenizer features that are less DINOv2-like while the autoregressive features become more so.

What carries the argument

Dual read-out of the codebook assignment: a hard one-hot branch for autoregressive next-token training and a differentiable soft branch that carries the representation-alignment loss to the tokenizer.

If this is right

ImageNet gFID convergence accelerates by up to 10x relative to strong two-stage autoregressive baselines.
Patch-level and spatially coherent features improve markedly under the joint objective.
The approach works across multiple quantizers including VQVAE, LFQ and IBQ.
The same joint-training recipe extends directly to text-to-image generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The reversal of alignment direction (tokenizer becoming less semantic while the autoregressive model becomes more semantic) suggests that latent design choices for autoregressive models may need to be reconsidered relative to diffusion models.
If the speedup persists when the representation-alignment loss is replaced by other auxiliary objectives, the core benefit may stem mainly from the differentiability mechanism rather than the specific alignment target.
Extending the dual-readout pattern to continuous latent autoregressive models could test whether the same gradient-routing idea applies outside discrete codebooks.

Load-bearing premise

The dual read-out successfully passes gradients to the tokenizer without causing codebook collapse or training instability.

What would settle it

A controlled experiment in which the joint GEAR training run exhibits the same gFID convergence curve as the two-stage LlamaGen-REPA baseline would falsify the central speedup claim.

read the original abstract

Visual generative models are typically trained in two stages. A tokenizer is first trained for reconstruction and then frozen, after which a generator is trained on its discrete indices or continuous latents. This decoupling leaves the tokenizer unaware of what the generator finds easy to model. We present GEAR (Guided End-to-end AutoRegression), which trains a vector-quantized (VQ) tokenizer and an autoregressive (AR) generator jointly and end-to-end, guided by representation alignment. The key obstacle is that the VQ index fed to the AR model is non-differentiable, so gradients cannot reach the tokenizer, and a straight-through estimator collapses. GEAR resolves this with a dual read-out of the codebook assignment. A hard, one-hot branch trains the AR with next-token prediction, while a differentiable soft branch carries a representation-alignment loss that flows back to guide only the tokenizer. The AR model thereby steers its tokenizer toward an index distribution it can predict more easily. This shifts the alignment burden from the tokenizer to the AR: the tokenizer's own features become less DINOv2-like while the AR's become more so, the opposite of diffusion-side recipes that make the latent itself semantic. GEAR speeds up ImageNet gFID convergence by up to 10x relative to the strong LlamaGen-REPA baseline, learns markedly better patch-level and spatially-coherent features, and generalizes across quantizers (VQVAE, LFQ, IBQ) and to text-to-image generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GEAR's dual hard/soft read-out for joint VQ-AR training is a direct attempt to fix the non-differentiability issue, but the abstract gives no equations, ablations, or stability checks to back the 10x gFID claim.

read the letter

The paper's core move is training the tokenizer and autoregressive generator together instead of freezing the first stage. It uses a hard one-hot branch for the usual next-token loss on the AR and a separate differentiable soft branch that carries a representation-alignment loss back only to the tokenizer. This is meant to let the AR push the tokenizer toward codebook assignments it can model more easily.

That joint recipe with the dual read-out is the actual new piece. Prior two-stage VQ+AR work left the tokenizer blind to the generator's needs; here the alignment burden shifts so the tokenizer's features move away from DINOv2 while the AR's move toward it. The abstract also claims this produces up to 10x faster ImageNet gFID convergence versus LlamaGen-REPA, better patch-level coherence, and works across VQVAE, LFQ, IBQ plus text-to-image.

The soft spots sit right where the stress-test note points. The abstract supplies no equations for the soft assignment, no loss weights, no ablation on collapse or gradient stability, and no training curves. Without those, it is impossible to tell whether the soft branch actually routes useful gradients or whether the reported speedup comes from other unmentioned changes. The central claim rests entirely on experimental outcomes that are not shown here.

This is for people already working on discrete autoregressive image models who want to drop the two-stage split. A reader who needs a concrete new training trick and is willing to implement and test the dual read-out themselves could extract value, but only if the full paper supplies the missing controls.

It deserves a serious referee because the obstacle it targets is real and the proposed fix is simple enough to evaluate. I would send it to review and expect the referees to demand the soft-branch implementation details plus ablations on stability before any stronger conclusions.

Referee Report

3 major / 1 minor

Summary. The paper introduces GEAR, a method for jointly training a vector-quantized tokenizer and an autoregressive generator end-to-end for image synthesis. It uses a dual read-out of codebook assignments—a hard one-hot branch for next-token AR training and a differentiable soft branch for a representation-alignment loss that backpropagates only to the tokenizer—claiming this allows the AR model to guide the tokenizer toward more predictable indices. The abstract reports up to 10x faster ImageNet gFID convergence versus LlamaGen-REPA, improved patch-level and spatial features, a shift in DINOv2 alignment (tokenizer less semantic, AR more so), and generalization across VQVAE/LFQ/IBQ quantizers plus text-to-image tasks.

Significance. If the dual read-out mechanism proves stable and the reported speedups hold under controlled conditions, GEAR would represent a meaningful advance in autoregressive generative modeling by closing the tokenizer-generator decoupling gap. The claimed generalization across multiple quantizers and extension to text-to-image would strengthen its practical value; the opposite alignment shift relative to diffusion recipes is an interesting empirical observation worth confirming.

major comments (3)

[Abstract] Abstract: The central 10x gFID speedup claim rests on the dual read-out successfully routing gradients through the soft branch to guide the tokenizer without collapse or degradation of the hard-branch AR training. However, the abstract provides no equations defining the soft assignment (e.g., distance metric to codebook entries, temperature, or softmax formulation), no ablation on collapse prevention, and no training curves or stability metrics, leaving the load-bearing assumption unverified.
[Abstract] Abstract: The reported shift ('tokenizer's own features become less DINOv2-like while the AR's become more so') is presented as a direct consequence of the method, yet no quantitative evidence (e.g., cosine similarities, feature maps, or tables comparing DINOv2 alignment before/after) is referenced, undermining the interpretation that alignment burden has been shifted.
[Abstract] Abstract: Generalization claims across VQVAE, LFQ, and IBQ quantizers, as well as to text-to-image, require that the dual read-out be adapted without introducing new instabilities; the abstract states the outcome but supplies no implementation details, loss-weighting schedules, or per-quantizer ablations that would confirm the mechanism transfers.

minor comments (1)

[Abstract] The abstract would benefit from a brief parenthetical on the representation-alignment loss (e.g., which layers or features are aligned) to orient readers before the method section.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed feedback on the abstract. We address each major comment below and will revise the manuscript to incorporate additional details and references where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract: The central 10x gFID speedup claim rests on the dual read-out successfully routing gradients through the soft branch to guide the tokenizer without collapse or degradation of the hard-branch AR training. However, the abstract provides no equations defining the soft assignment (e.g., distance metric to codebook entries, temperature, or softmax formulation), no ablation on collapse prevention, and no training curves or stability metrics, leaving the load-bearing assumption unverified.

Authors: The manuscript defines the soft assignment via temperature-scaled softmax over Euclidean distances in Section 3.2 (Equation 3). Ablations addressing collapse prevention appear in Section 4.3, and training curves with stability metrics are in Figure 4. We will revise the abstract to include a concise reference to the formulation and these supporting results. revision: yes
Referee: [Abstract] Abstract: The reported shift ('tokenizer's own features become less DINOv2-like while the AR's become more so') is presented as a direct consequence of the method, yet no quantitative evidence (e.g., cosine similarities, feature maps, or tables comparing DINOv2 alignment before/after) is referenced, undermining the interpretation that alignment burden has been shifted.

Authors: Table 3 in the manuscript reports the quantitative DINOv2 cosine similarity comparisons for both tokenizer and AR features before and after training. We will update the abstract to reference this table. revision: yes
Referee: [Abstract] Abstract: Generalization claims across VQVAE, LFQ, and IBQ quantizers, as well as to text-to-image, require that the dual read-out be adapted without introducing new instabilities; the abstract states the outcome but supplies no implementation details, loss-weighting schedules, or per-quantizer ablations that would confirm the mechanism transfers.

Authors: Section 4.4 and the appendix provide the adaptation details for each quantizer, including loss-weighting schedules. Per-quantizer ablations and text-to-image results are in Tables 4–5 and Section 5.3; no additional instabilities were observed. We will revise the abstract to reference these sections and tables. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical joint-training procedure validated by experiments

full rationale

The paper presents GEAR as an empirical training recipe that jointly optimizes a VQ tokenizer and AR generator via a dual-readout mechanism (hard one-hot for next-token loss, soft branch for representation alignment). All performance claims (10x gFID speedup, better patch features, generalization across quantizers) are stated as experimental outcomes relative to baselines such as LlamaGen-REPA. No equations, fitted parameters, or self-citations are invoked in the provided text to derive the central results; the method does not reduce any claimed prediction to a quantity defined by its own inputs. The derivation chain is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the contribution is a training procedure rather than new theoretical constructs or fitted constants.

pith-pipeline@v0.9.1-grok · 5832 in / 1227 out tokens · 40555 ms · 2026-07-01T05:16:31.706513+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 26 canonical work pages · 20 internal anchors

[1]

End-to-end object detection with transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. InEuropean conference on computer vision, pages 213–229. Springer, 2020

2020
[2]

GPIC: A Giant Permissive Image Corpus for Visual Generation

Keshigeyan Chandrasegaran, Kyle Sargent, Suchir Agarwal, Michael Jang, Michael Poli, Juan Carlos Niebles, Justin Johnson, Jiajun Wu, and Li Fei-Fei. Gpic: A giant permissive image corpus for visual generation.arXiv preprint arXiv:2605.30341, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

Maskgit: Masked generative image transformer

Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11315–11325, 2022

2022
[4]

Masked autoencoders are effective tokenizers for diffusion models

Hao Chen, Yujin Han, Fangyi Chen, Xiang Li, Yidong Wang, Jindong Wang, Ze Wang, Zicheng Liu, Difan Zou, and Bhiksha Raj. Masked autoencoders are effective tokenizers for diffusion models. InForty-second International Conference on Machine Learning, 2025

2025
[5]

Pixart-alpha: Fast training of diffusion transformer for photorealistic text-to-image synthesis

Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-alpha: Fast training of diffusion transformer for photorealistic text-to-image synthesis. InInternational conference on learning representations, volume 2024, pages 57611–57640, 2024

2024
[6]

End-to-End Autoregressive Image Generation with 1D Semantic Tokenizer

Wenda Chu, Bingliang Zhang, Jiaqi Han, Yizhuo Li, Linjie Yang, Yisong Yue, and Qiushan Guo. End-to-end autoregressive image generation with 1d semantic tokenizer.arXiv preprint arXiv:2605.00503, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[7]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

Haiwen Diao, Penghao Wu, Hanming Deng, Jiahao Wang, Shihao Bai, Silei Wu, Weichen Fan, Wenjie Ye, Wenwen Tong, Xiangyu Fan, et al. Sensenova-u1: Unifying multimodal understanding and generation with neo-unify architecture.arXiv preprint arXiv:2605.12500, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[9]

Vqrae: Representation quantization autoencoders for multimodal understanding, generation and reconstruction

Sinan Du, Jiahao Guo, Bo Li, Shuhao Cui, Zhengzhuo Xu, Yifu Luo, Yongxian Wei, Kun Gai, Xinggang Wang, Kai Wu, et al. Vqrae: Representation quantization autoencoders for multimodal understanding, generation and reconstruction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 30322–30334, 2026

2026
[10]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021

2021
[11]

Masked diffusion transformer is a strong image synthesizer

Shanghua Gao, Pan Zhou, Ming-Ming Cheng, and Shuicheng Yan. Masked diffusion transformer is a strong image synthesizer. InProceedings of the IEEE/CVF international conference on computer vision, pages 23164–23173, 2023

2023
[12]

Mdtv2: Masked diffusion transformer is a strong image synthesizer.arXiv preprint arXiv:2303.14389, 2023

Shanghua Gao, Pan Zhou, Ming-Ming Cheng, and Shuicheng Yan. Mdtv2: Masked diffusion transformer is a strong image synthesizer.arXiv preprint arXiv:2303.14389, 2023

work page arXiv 2023
[13]

arXiv preprint arXiv:2507.22058 (2025)

Zigang Geng, Yibing Wang, Yeyao Ma, Chen Li, Yongming Rao, Shuyang Gu, Zhao Zhong, Qinglin Lu, Han Hu, Xiaosong Zhang, et al. X-omni: Reinforcement learning makes discrete autoregressive image generative models great again.arXiv preprint arXiv:2507.22058, 2025

work page arXiv 2025
[14]

Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

2023
[15]

Rich feature hierarchies for accurate object detection and semantic segmentation

Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 580–587, 2014

2014
[16]

Vision as a dialect: Unifying visual understanding and generation via text-aligned representations.Advances in Neural Information Processing Systems, 38:158430–158459, 2026

Jiaming Han, Hao Chen, Yang Zhao, Hanyu Wang, Qi Zhao, Ziyan Yang, Hao He, Xiangyu Yue, and Lu Jiang. Vision as a dialect: Unifying visual understanding and generation via text-aligned representations.Advances in Neural Information Processing Systems, 38:158430–158459, 2026

2026
[17]

ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135, 2024. 17

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

The Platonic Representation Hypothesis

Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. The platonic representation hypothesis.arXiv preprint arXiv:2405.07987, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Similarity of neural network represen- tations revisited

Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network represen- tations revisited. InInternational conference on machine learning, pages 3519–3529. PMlR, 2019

2019
[20]

Repa-e: Unlocking vae for end-to-end tuning of latent diffusion transformers

Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. Repa-e: Unlocking vae for end-to-end tuning of latent diffusion transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18262–18272, 2025

2025
[21]

Back to basics: Let denoising generative models denoise

Tianhong Li and Kaiming He. Back to basics: Let denoising generative models denoise. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 36115–36125, 2026

2026
[22]

Autoregressive image generation without vector quantization.Advances in Neural Information Processing Systems, 37:56424–56445, 2024

Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vector quantization.Advances in Neural Information Processing Systems, 37:56424–56445, 2024

2024
[23]

ifsq: Improving fsq for image generation with 1 line of code.arXiv preprint arXiv:2601.17124, 2026

Bin Lin, Zongjian Li, Yuwei Niu, Kaixiong Gong, Yunyang Ge, Yunlong Lin, Mingzhe Zheng, JianWei Zhang, Miles Yang, Zhao Zhong, et al. ifsq: Improving fsq for image generation with 1 line of code.arXiv preprint arXiv:2601.17124, 2026

work page arXiv 2026
[24]

Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation

Zhiheng Liu, Weiming Ren, Xiaoke Huang, Shoufa Chen, Tianhong Li, Mengzhao Chen, Yatai Ji, Sen He, Jonas Schult, Belinda Zeng, et al. Tuna-2: Pixel embeddings beat vision encoders for multimodal understanding and generation.arXiv preprint arXiv:2604.24763, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[25]

Open-magvit2: An open-source project toward democratizing auto-regressive visual generation.arXiv preprint arXiv:2409.04410, 2024

Zhuoyan Luo, Fengyuan Shi, Yixiao Ge, Yujiu Yang, Limin Wang, and Ying Shan. Open-magvit2: An open-source project toward democratizing auto-regressive visual generation.arXiv preprint arXiv:2409.04410, 2024

work page arXiv 2024
[26]

Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers

Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. InEuropean Conference on Computer Vision, pages 23–40. Springer, 2024

2024
[27]

Notes on continuous stochastic phenomena.Biometrika, 37(1/2):17–23, 1950

Patrick AP Moran. Notes on continuous stochastic phenomena.Biometrika, 37(1/2):17–23, 1950

1950
[28]

V-JEPA 2.1: Unlocking Dense Features in Video Self-Supervised Learning

Lorenzo Mur-Labadia, Matthew Muckley, Amir Bar, Mido Assran, Koustuv Sinha, Mike Rabbat, Yann LeCun, Nicolas Ballas, and Adrien Bardes. V-jepa 2.1: Unlocking dense features in video self-supervised learning.arXiv preprint arXiv:2603.14482, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[29]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022

2022
[31]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

2023
[32]

Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

2023
[33]

You only look once: Unified, real-time object detection

Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016

2016
[34]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[35]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

Scalable image tokenization with index backpropagation quantization

Fengyuan Shi, Zhuoyan Luo, Yixiao Ge, Yujiu Yang, Ying Shan, and Limin Wang. Scalable image tokenization with index backpropagation quantization. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16037–16046, 2025. 18

2025
[37]

Representation alignment for just image transformers is not easier than you think.arXiv preprint arXiv:2603.14366, 2026

Jaeyo Shin, Jiwook Kim, and Hyunjung Shim. Representation alignment for just image transformers is not easier than you think.arXiv preprint arXiv:2603.14366, 2026

work page arXiv 2026
[38]

DINOv3

Oriane Siméoni, Huy V Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

What matters for representation alignment: Global information or spatial structure? arXiv preprint arXiv:2512.10794,

Jaskirat Singh, Xingjian Leng, Zongze Wu, Liang Zheng, Richard Zhang, Eli Shechtman, and Saining Xie. What matters for representation alignment: Global information or spatial structure?arXiv preprint arXiv:2512.10794, 2025

work page arXiv 2025
[40]

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

Visual autoregressive modeling: Scalable image generation via next-scale prediction.Advances in neural information processing systems, 37:84839–84865, 2024

Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction.Advances in neural information processing systems, 37:84839–84865, 2024

2024
[43]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

Neural discrete representation learning.Advances in neural information processing systems, 30, 2017

Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Advances in neural information processing systems, 30, 2017

2017
[45]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

Reconstruction vs

Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15703–15712, 2025

2025
[47]

Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

Lijun Yu, José Lezama, Nitesh B Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Vighnesh Birodkar, Agrim Gupta, Xiuye Gu, et al. Language model beats diffusion–tokenizer is key to visual generation. arXiv preprint arXiv:2310.05737, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[48]

Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think.arXiv preprint arXiv:2410.06940, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

Pixeldit: Pixel diffusion transformers for image generation

Yongsheng Yu, Wei Xiong, Weili Nie, Yichen Sheng, Shiqiu Liu, and Jiebo Luo. Pixeldit: Pixel diffusion transformers for image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14273–14282, 2026

2026
[50]

What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion

Zhengrong Yue, Taihang Hu, Mengting Chen, Haiyu Zhang, Zihao Pan, Tao Liu, Zikang Wang, Jinsong Lan, Xiaoyong Zhu, Bo Zheng, et al. What matters for diffusion-friendly latent manifold? prior-aligned autoencoders for latent diffusion.arXiv preprint arXiv:2605.07915, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[51]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018

2018
[52]

Diffusion Transformers with Representation Autoencoders

Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders.arXiv preprint arXiv:2510.11690, 2025. 19 A Ablation Training Configurations All ablation studies fine-tune the GEAR-L model end-to-end on top of the warm-up tokenizer (except the initialization study, which also trains the tokenizer from scrat...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

End-to-end object detection with transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. InEuropean conference on computer vision, pages 213–229. Springer, 2020

2020

[2] [2]

GPIC: A Giant Permissive Image Corpus for Visual Generation

Keshigeyan Chandrasegaran, Kyle Sargent, Suchir Agarwal, Michael Jang, Michael Poli, Juan Carlos Niebles, Justin Johnson, Jiajun Wu, and Li Fei-Fei. Gpic: A giant permissive image corpus for visual generation.arXiv preprint arXiv:2605.30341, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[3] [3]

Maskgit: Masked generative image transformer

Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11315–11325, 2022

2022

[4] [4]

Masked autoencoders are effective tokenizers for diffusion models

Hao Chen, Yujin Han, Fangyi Chen, Xiang Li, Yidong Wang, Jindong Wang, Ze Wang, Zicheng Liu, Difan Zou, and Bhiksha Raj. Masked autoencoders are effective tokenizers for diffusion models. InForty-second International Conference on Machine Learning, 2025

2025

[5] [5]

Pixart-alpha: Fast training of diffusion transformer for photorealistic text-to-image synthesis

Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-alpha: Fast training of diffusion transformer for photorealistic text-to-image synthesis. InInternational conference on learning representations, volume 2024, pages 57611–57640, 2024

2024

[6] [6]

End-to-End Autoregressive Image Generation with 1D Semantic Tokenizer

Wenda Chu, Bingliang Zhang, Jiaqi Han, Yizhuo Li, Linjie Yang, Yisong Yue, and Qiushan Guo. End-to-end autoregressive image generation with 1d semantic tokenizer.arXiv preprint arXiv:2605.00503, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[7] [7]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

Haiwen Diao, Penghao Wu, Hanming Deng, Jiahao Wang, Shihao Bai, Silei Wu, Weichen Fan, Wenjie Ye, Wenwen Tong, Xiangyu Fan, et al. Sensenova-u1: Unifying multimodal understanding and generation with neo-unify architecture.arXiv preprint arXiv:2605.12500, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[9] [9]

Vqrae: Representation quantization autoencoders for multimodal understanding, generation and reconstruction

Sinan Du, Jiahao Guo, Bo Li, Shuhao Cui, Zhengzhuo Xu, Yifu Luo, Yongxian Wei, Kun Gai, Xinggang Wang, Kai Wu, et al. Vqrae: Representation quantization autoencoders for multimodal understanding, generation and reconstruction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 30322–30334, 2026

2026

[10] [10]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021

2021

[11] [11]

Masked diffusion transformer is a strong image synthesizer

Shanghua Gao, Pan Zhou, Ming-Ming Cheng, and Shuicheng Yan. Masked diffusion transformer is a strong image synthesizer. InProceedings of the IEEE/CVF international conference on computer vision, pages 23164–23173, 2023

2023

[12] [12]

Mdtv2: Masked diffusion transformer is a strong image synthesizer.arXiv preprint arXiv:2303.14389, 2023

Shanghua Gao, Pan Zhou, Ming-Ming Cheng, and Shuicheng Yan. Mdtv2: Masked diffusion transformer is a strong image synthesizer.arXiv preprint arXiv:2303.14389, 2023

work page arXiv 2023

[13] [13]

arXiv preprint arXiv:2507.22058 (2025)

Zigang Geng, Yibing Wang, Yeyao Ma, Chen Li, Yongming Rao, Shuyang Gu, Zhao Zhong, Qinglin Lu, Han Hu, Xiaosong Zhang, et al. X-omni: Reinforcement learning makes discrete autoregressive image generative models great again.arXiv preprint arXiv:2507.22058, 2025

work page arXiv 2025

[14] [14]

Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

2023

[15] [15]

Rich feature hierarchies for accurate object detection and semantic segmentation

Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 580–587, 2014

2014

[16] [16]

Vision as a dialect: Unifying visual understanding and generation via text-aligned representations.Advances in Neural Information Processing Systems, 38:158430–158459, 2026

Jiaming Han, Hao Chen, Yang Zhao, Hanyu Wang, Qi Zhao, Ziyan Yang, Hao He, Xiangyu Yue, and Lu Jiang. Vision as a dialect: Unifying visual understanding and generation via text-aligned representations.Advances in Neural Information Processing Systems, 38:158430–158459, 2026

2026

[17] [17]

ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135, 2024. 17

work page internal anchor Pith review Pith/arXiv arXiv 2024

[18] [18]

The Platonic Representation Hypothesis

Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. The platonic representation hypothesis.arXiv preprint arXiv:2405.07987, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

Similarity of neural network represen- tations revisited

Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network represen- tations revisited. InInternational conference on machine learning, pages 3519–3529. PMlR, 2019

2019

[20] [20]

Repa-e: Unlocking vae for end-to-end tuning of latent diffusion transformers

Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. Repa-e: Unlocking vae for end-to-end tuning of latent diffusion transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18262–18272, 2025

2025

[21] [21]

Back to basics: Let denoising generative models denoise

Tianhong Li and Kaiming He. Back to basics: Let denoising generative models denoise. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 36115–36125, 2026

2026

[22] [22]

Autoregressive image generation without vector quantization.Advances in Neural Information Processing Systems, 37:56424–56445, 2024

Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vector quantization.Advances in Neural Information Processing Systems, 37:56424–56445, 2024

2024

[23] [23]

ifsq: Improving fsq for image generation with 1 line of code.arXiv preprint arXiv:2601.17124, 2026

Bin Lin, Zongjian Li, Yuwei Niu, Kaixiong Gong, Yunyang Ge, Yunlong Lin, Mingzhe Zheng, JianWei Zhang, Miles Yang, Zhao Zhong, et al. ifsq: Improving fsq for image generation with 1 line of code.arXiv preprint arXiv:2601.17124, 2026

work page arXiv 2026

[24] [24]

Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation

Zhiheng Liu, Weiming Ren, Xiaoke Huang, Shoufa Chen, Tianhong Li, Mengzhao Chen, Yatai Ji, Sen He, Jonas Schult, Belinda Zeng, et al. Tuna-2: Pixel embeddings beat vision encoders for multimodal understanding and generation.arXiv preprint arXiv:2604.24763, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[25] [25]

Open-magvit2: An open-source project toward democratizing auto-regressive visual generation.arXiv preprint arXiv:2409.04410, 2024

Zhuoyan Luo, Fengyuan Shi, Yixiao Ge, Yujiu Yang, Limin Wang, and Ying Shan. Open-magvit2: An open-source project toward democratizing auto-regressive visual generation.arXiv preprint arXiv:2409.04410, 2024

work page arXiv 2024

[26] [26]

Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers

Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. InEuropean Conference on Computer Vision, pages 23–40. Springer, 2024

2024

[27] [27]

Notes on continuous stochastic phenomena.Biometrika, 37(1/2):17–23, 1950

Patrick AP Moran. Notes on continuous stochastic phenomena.Biometrika, 37(1/2):17–23, 1950

1950

[28] [28]

V-JEPA 2.1: Unlocking Dense Features in Video Self-Supervised Learning

Lorenzo Mur-Labadia, Matthew Muckley, Amir Bar, Mido Assran, Koustuv Sinha, Mike Rabbat, Yann LeCun, Nicolas Ballas, and Adrien Bardes. V-jepa 2.1: Unlocking dense features in video self-supervised learning.arXiv preprint arXiv:2603.14482, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[29] [29]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[30] [30]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022

2022

[31] [31]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

2023

[32] [32]

Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

2023

[33] [33]

You only look once: Unified, real-time object detection

Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016

2016

[34] [34]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[35] [35]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[36] [36]

Scalable image tokenization with index backpropagation quantization

Fengyuan Shi, Zhuoyan Luo, Yixiao Ge, Yujiu Yang, Ying Shan, and Limin Wang. Scalable image tokenization with index backpropagation quantization. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16037–16046, 2025. 18

2025

[37] [37]

Representation alignment for just image transformers is not easier than you think.arXiv preprint arXiv:2603.14366, 2026

Jaeyo Shin, Jiwook Kim, and Hyunjung Shim. Representation alignment for just image transformers is not easier than you think.arXiv preprint arXiv:2603.14366, 2026

work page arXiv 2026

[38] [38]

DINOv3

Oriane Siméoni, Huy V Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[39] [39]

What matters for representation alignment: Global information or spatial structure? arXiv preprint arXiv:2512.10794,

Jaskirat Singh, Xingjian Leng, Zongze Wu, Liang Zheng, Richard Zhang, Eli Shechtman, and Saining Xie. What matters for representation alignment: Global information or spatial structure?arXiv preprint arXiv:2512.10794, 2025

work page arXiv 2025

[40] [40]

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[41] [41]

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[42] [42]

Visual autoregressive modeling: Scalable image generation via next-scale prediction.Advances in neural information processing systems, 37:84839–84865, 2024

Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction.Advances in neural information processing systems, 37:84839–84865, 2024

2024

[43] [43]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[44] [44]

Neural discrete representation learning.Advances in neural information processing systems, 30, 2017

Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Advances in neural information processing systems, 30, 2017

2017

[45] [45]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[46] [46]

Reconstruction vs

Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15703–15712, 2025

2025

[47] [47]

Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

Lijun Yu, José Lezama, Nitesh B Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Vighnesh Birodkar, Agrim Gupta, Xiuye Gu, et al. Language model beats diffusion–tokenizer is key to visual generation. arXiv preprint arXiv:2310.05737, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[48] [48]

Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think.arXiv preprint arXiv:2410.06940, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[49] [49]

Pixeldit: Pixel diffusion transformers for image generation

Yongsheng Yu, Wei Xiong, Weili Nie, Yichen Sheng, Shiqiu Liu, and Jiebo Luo. Pixeldit: Pixel diffusion transformers for image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14273–14282, 2026

2026

[50] [50]

What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion

Zhengrong Yue, Taihang Hu, Mengting Chen, Haiyu Zhang, Zihao Pan, Tao Liu, Zikang Wang, Jinsong Lan, Xiaoyong Zhu, Bo Zheng, et al. What matters for diffusion-friendly latent manifold? prior-aligned autoencoders for latent diffusion.arXiv preprint arXiv:2605.07915, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[51] [51]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018

2018

[52] [52]

Diffusion Transformers with Representation Autoencoders

Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders.arXiv preprint arXiv:2510.11690, 2025. 19 A Ablation Training Configurations All ablation studies fine-tune the GEAR-L model end-to-end on top of the warm-up tokenizer (except the initialization study, which also trains the tokenizer from scrat...

work page internal anchor Pith review Pith/arXiv arXiv 2025