pith. sign in

arxiv: 2606.20543 · v1 · pith:VFEJ6GGQnew · submitted 2026-06-18 · 💻 cs.CV

SSD: Spatially Speculative Decoding Accelerates Autoregressive Image Generation

Pith reviewed 2026-06-26 17:34 UTC · model grok-4.3

classification 💻 cs.CV
keywords autoregressive image generationspeculative decodingspatial correlationsinference accelerationvisual token prediction2D image geometrygeneration speedup
0
0 comments X

The pith

By predicting adjacent horizontal and vertical tokens at once, spatially speculative decoding speeds autoregressive image generation up to 13.3 times.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Autoregressive image models flatten pictures into long 1D token sequences, which wastes the natural two-dimensional structure of images and creates slow inference. The paper introduces spatially speculative decoding to predict both the next token in the current row and the token directly below it in one step. This change uses the strong spatial correlations that exist across neighboring pixels in real images. The result is inference that runs up to 13.3 times faster on standard benchmarks while image quality stays the same. The work shows that matching the prediction rule to the actual geometry of pictures removes a major computational limit in visual generation.

Core claim

The paper claims that its Spatially Speculative Decoding framework aligns the prediction objective with image geometry by simultaneously forecasting the adjacent horizontal token and the token directly below the current position. This 2D extension of next-token prediction overcomes the memory wall that arises from 1D flattening, delivering speedups of up to 13.3 times on DPG-Bench and GenEval without loss of fidelity.

What carries the argument

Spatially Speculative Decoding (SSD), which extends single next-token prediction to joint prediction of the rightward and downward neighbor tokens by exploiting intrinsic 2D spatial correlations.

Load-bearing premise

That simultaneous prediction of adjacent horizontal and below tokens can be performed accurately enough to preserve generation quality because of 2D spatial correlations in images.

What would settle it

A measurable drop in scores on DPG-Bench or GenEval when the same base autoregressive model is run with SSD instead of standard next-token decoding.

Figures

Figures reproduced from arXiv: 2606.20543 by Chengzhi Mao, Lijun Yu, Shilong Xiang, Zirui Zhang.

Figure 1
Figure 1. Figure 1: The Two-Dimensional Nature of Predictive Dependency. To demonstrate that spatial correlations are inherently 2D, we corrupt the sequential context during Janus-Pro-7B generation by replacing the second half of each row with random tokens (red outlines). Despite this severe disruption to the 1D sequence, visual coherence is preserved wherever the token directly above was accurately generated (blue outlines)… view at source ↗
Figure 2
Figure 2. Figure 2: Accelerating Autoregressive Vision via 2D Spatial Anticipation. (a) Standard AR flattens the visual world into a 1D sequence, predicting one token at a time (O(n 2 ) steps). (b) Speculative Decoding accelerates generation locally but remains fundamentally constrained by this linear raster-scan geometry (O(n 2 )). (c) Our SSD aligns the predictive objective with the intrinsic geometry of images. By factoriz… view at source ↗
Figure 3
Figure 3. Figure 3: MTP drafting cost of SSD, normalized by one AR step on Lumina￾mGPT-7B (48×48 grid). Even at 240 drafted tokens, overhead stays below 0.1 AR steps. Autoregressive (AR) image generation loads billions of pa￾rameters from memory at every step yet produces only a sin￾gle token per forward pass, creating a memory-bandwidth bottleneck [30] known as the memory wall. Speculative decoding accelerates generation by … view at source ↗
Figure 4
Figure 4. Figure 4: Verification cost on Lumina-mGPT-7B (48×48 grid), normalized by one AR step. (a) La￾tency of verifying K tokens in parallel. As K grows to 240, the cost stays below 1.6× a single AR step, since the parameter-loading cost dominates due to the memory wall. (b) Wall-clock speedup scales near-linearly with K, approaching the ideal K× bound. Draft Prediction in Latent Space Anticipating in discrete token space … view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative results. Side-by-side comparison of AR baseline and SSD across three models. Our method yields up to 13.6× speedup while preserving high-resolution visual fidelity [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative comparison of Janus-Pro-7B outputs under different vertical verification and [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Samples generated by Lumina-mGPT-7B under joint and staged verification schedules at [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Extended qualitative results. Side-by-side comparisons of AR baseline, SJD, 1D-MTP, and SSD (Ours) across three models, demonstrating that our method achieves significant acceleration while maintaining high visual fidelity across diverse prompts. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
read the original abstract

Autoregressive models excel in visual generation by treating images as 1D sequences of discrete tokens, mirroring language modeling. However, this flattening discards the intrinsic 2D spatial locality of visual signals, creating severe computational bottlenecks during inference. We introduce Spatially Speculative Decoding (SSD), a framework that aligns the predictive objective with the natural geometry of images. Rather than predicting only the immediate next token in a 1D sequence, our model simultaneously predicts the adjacent horizontal token and the token directly below it. By capitalizing on this 2D spatial correlation, spatially speculative decoding overcomes the memory wall in visual inference. Our approach accelerates autoregressive image generation by up to 13.3x while maintaining high fidelity on DPG-Bench and GenEval. Our results suggest that respecting the underlying geometry of vision unlocks massive computational efficiencies, paving the way for real-time, high-resolution autoregressive generative models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces Spatially Speculative Decoding (SSD) for autoregressive image generation. Rather than predicting only the next token in a flattened 1D raster-order sequence, the model is trained to jointly predict the next token along with its right-adjacent and below-adjacent neighbors. This modification exploits intrinsic 2D spatial correlations in images to enable speculative parallel predictions that reduce the number of sequential forward passes at inference time, with the central empirical claim being an acceleration of up to 13.3x while preserving generation quality on DPG-Bench and GenEval.

Significance. If the empirical results and quality preservation hold under detailed scrutiny, the work offers a lightweight way to align the training objective with image geometry instead of discarding 2D locality, which could meaningfully improve inference throughput for high-resolution autoregressive visual generators and support real-time applications. The approach is notable for its simplicity in modifying only the predictive targets without introducing new architectural components or parameters.

major comments (1)
  1. The central speedup claim (up to 13.3x) and fidelity maintenance rest on the assumption that simultaneous prediction of horizontal and vertical neighbors can be performed with sufficient accuracy; however, without any reported equations for the modified loss, details on how speculative tokens are accepted or rejected during raster-order generation, or ablation studies isolating the contribution of the 2D objective, the load-bearing mechanism cannot be verified from the provided text.
minor comments (1)
  1. The abstract supplies only high-level claims with no methods, quantitative error analysis, or verification details, which limits immediate assessment of the reported benchmarks and speedup factor.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review and for highlighting the need for greater technical detail. We address the major comment below and will revise the manuscript to incorporate the requested clarifications.

read point-by-point responses
  1. Referee: The central speedup claim (up to 13.3x) and fidelity maintenance rest on the assumption that simultaneous prediction of horizontal and vertical neighbors can be performed with sufficient accuracy; however, without any reported equations for the modified loss, details on how speculative tokens are accepted or rejected during raster-order generation, or ablation studies isolating the contribution of the 2D objective, the load-bearing mechanism cannot be verified from the provided text.

    Authors: We agree that the current manuscript lacks sufficient detail on these points. The revised version will add: (1) the explicit equation for the modified training loss that jointly optimizes the next token together with its right-adjacent and below-adjacent neighbors; (2) a precise description (with pseudocode) of the inference procedure, including the criteria and mechanism for accepting or rejecting the spatially speculative tokens while preserving raster-order generation; and (3) ablation experiments that isolate the contribution of the 2D spatial objective to both the observed speedup and the preservation of generation quality on DPG-Bench and GenEval. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The provided abstract and context describe a speculative decoding framework that jointly predicts next, right, and below tokens to exploit 2D image correlations for faster inference. No equations, parameter-fitting steps, derivations, or self-citations appear that reduce any claimed prediction or result to its own inputs by construction. The speedup and quality claims are presented as empirical outcomes on benchmarks, with the 2D correlation assumption stated as an enabling premise rather than a fitted or self-defined quantity. The derivation chain is self-contained against external benchmarks with no load-bearing reductions to prior author work or ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only access yields no identifiable free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5691 in / 1034 out tokens · 34701 ms · 2026-06-26T17:34:27.862405+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 15 linked inside Pith

  1. [1]

    Medusa: Simple llm inference acceleration framework with multiple decoding heads

    Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D Lee, Deming Chen, and Tri Dao. Medusa: Simple llm inference acceleration framework with multiple decoding heads. InProceedings of the 41st International Conference on Machine Learning, pages 5209–5235, 2024

  2. [2]

    Accelerating large language model decoding with speculative sampling.arXiv preprint arXiv:2302.01318, 2023

    Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large language model decoding with speculative sampling.arXiv preprint arXiv:2302.01318, 2023

  3. [3]

    Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

    Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

  4. [4]

    Taming transformers for high-resolution image synthesis

    Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021

  5. [5]

    Break the sequential dependency of llm inference using lookahead decoding

    Yichao Fu, Peter Bailis, Ion Stoica, and Hao Zhang. Break the sequential dependency of llm inference using lookahead decoding. InProceedings of the 41st International Conference on Machine Learning, pages 14060–14079, 2024

  6. [6]

    Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

    Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

  7. [7]

    Fast r-cnn

    Ross Girshick. Fast r-cnn. InProceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015

  8. [8]

    Better & faster large language models via multi-token prediction

    Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, and Gabriel Synnaeve. Better & faster large language models via multi-token prediction. InProceedings of the 41st International Conference on Machine Learning, pages 15706–15734, 2024

  9. [9]

    Zipar: Parallel autoregressive image generation through spatial locality

    Yefei He, Feng Chen, Yuanyu He, Shaoxuan He, Hong Zhou, Kaipeng Zhang, and Bohan Zhuang. Zipar: Parallel autoregressive image generation through spatial locality. InInternational Conference on Machine Learning, pages 22368–22378. PMLR, 2025

  10. [10]

    Neighboring autoregressive modeling for efficient visual generation

    Yefei He, Yuanyu He, Shaoxuan He, Feng Chen, Hong Zhou, Kaipeng Zhang, and Bohan Zhuang. Neighboring autoregressive modeling for efficient visual generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19000–19010, 2025

  11. [11]

    Distilling the knowledge in a neural network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

  12. [12]

    Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

  13. [13]

    Ella: Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135, 2024

    Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135, 2024. 10

  14. [14]

    Midjourney prompts dataset

    Huggingface. Midjourney prompts dataset. https://huggingface.co/datasets/vivym/ midjourney-prompts, 2024

  15. [15]

    Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

  16. [16]

    Lantern: Accelerating visual autoregressive models with relaxed speculative decoding

    Doohyuk Jang, Sihwan Park, June Yong Yang, Yeonsung Jung, Jihun Yun, Souvik Kundu, Sung- Yub Kim, and Eunho Yang. Lantern: Accelerating visual autoregressive models with relaxed speculative decoding. InThe Thirteenth International Conference on Learning Representations, 2025

  17. [17]

    Cllms: consistency large language models

    Siqi Kou, Lanxiang Hu, Zhezhi He, Zhijie Deng, and Hao Zhang. Cllms: consistency large language models. InProceedings of the 41st International Conference on Machine Learning, pages 25426–25440, 2024

  18. [18]

    Fast inference from transformers via speculative decoding

    Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. InInternational Conference on Machine Learning, pages 19274–19286. PMLR, 2023

  19. [19]

    Autoregressive image generation with randomized parallel decoding

    Haopeng Li, Jinyue Yang, Guoqi Li, and Huan Wang. Autoregressive image generation with randomized parallel decoding. InThe Fourteenth International Conference on Learning Representations, 2026

  20. [20]

    Annealed relaxation of speculative decoding for faster autoregressive image generation.arXiv preprint arXiv:2601.09212, 2026

    Xingyao Li, Fengzhuo Zhang, Cunxiao Du, and Hui Ji. Annealed relaxation of speculative decoding for faster autoregressive image generation.arXiv preprint arXiv:2601.09212, 2026

  21. [21]

    Eagle: speculative sampling requires rethinking feature uncertainty

    Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle: speculative sampling requires rethinking feature uncertainty. InProceedings of the 41st International Conference on Machine Learning, pages 28935–28948, 2024

  22. [22]

    Eagle-2: Faster inference of language models with dynamic draft trees

    Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle-2: Faster inference of language models with dynamic draft trees. InProceedings of the 2024 conference on empirical methods in natural language processing, pages 7421–7432, 2024

  23. [23]

    Eagle-3: Scaling up inference acceleration of large language models via training-time test.arXiv preprint arXiv:2503.01840, 2025

    Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle-3: Scaling up inference acceleration of large language models via training-time test.arXiv preprint arXiv:2503.01840, 2025

  24. [24]

    Parallel jacobi decoding for fast autoregres- sive image generation

    Boya Liao, Ying Li, Siyong Jian, and Huan Wang. Parallel jacobi decoding for fast autoregres- sive image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9008–9018, 2026

  25. [25]

    Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

  26. [26]

    Lumina-mgpt: Illuminate flexible photorealistic text-to-image generation with multimodal generative pretraining.arXiv preprint arXiv:2408.02657, 2024

    Dongyang Liu, Shitian Zhao, Le Zhuo, Weifeng Lin, Yi Xin, Xinyue Li, Qi Qin, Yu Qiao, Hongsheng Li, and Peng Gao. Lumina-mgpt: Illuminate flexible photorealistic text-to-image generation with multimodal generative pretraining.arXiv preprint arXiv:2408.02657, 2024

  27. [27]

    L-mtp: Leap multi-token prediction beyond adjacent context for large language models.arXiv preprint arXiv:2505.17505, 2025

    Xiaohao Liu, Xiaobo Xia, Weixiang Zhao, Manyi Zhang, Xianzhi Yu, Xiu Su, Shuo Yang, See-Kiong Ng, and Tat-Seng Chua. L-mtp: Leap multi-token prediction beyond adjacent context for large language models.arXiv preprint arXiv:2505.17505, 2025

  28. [28]

    Specinfer: Accelerating large language model serving with tree-based speculative inference and verification

    Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Zhengxin Zhang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, et al. Specinfer: Accelerating large language model serving with tree-based speculative inference and verification. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Lang...

  29. [29]

    Multi-scale local speculative decoding for image generation.arXiv preprint arXiv:2601.05149, 2026

    Elia Peruzzo, Guillaume Sautière, and Amirhossein Habibian. Multi-scale local speculative decoding for image generation.arXiv preprint arXiv:2601.05149, 2026. 11

  30. [30]

    Efficiently scaling transformer inference

    Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. Efficiently scaling transformer inference. Proceedings of machine learning and systems, 5:606–624, 2023

  31. [31]

    Searching for activation functions.arXiv preprint arXiv:1710.05941, 2017

    Prajit Ramachandran, Barret Zoph, and Quoc V Le. Searching for activation functions.arXiv preprint arXiv:1710.05941, 2017

  32. [32]

    Zero-shot text-to-image generation

    Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. InInternational conference on machine learning, pages 8821–8831. Pmlr, 2021

  33. [33]

    Accelerating transformer inference for translation via parallel decoding

    Andrea Santilli, Silvio Severino, Emilian Postolache, Valentino Maiorca, Michele Mancusi, Riccardo Marin, and Emanuele Rodolà. Accelerating transformer inference for translation via parallel decoding. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12336–12355, 2023

  34. [34]

    Glu variants improve transformer.arXiv preprint arXiv:2002.05202, 2020

    Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202, 2020

  35. [35]

    Grouped speculative decoding for autoregressive image generation

    Junhyuk So, Juncheol Shin, Hyunho Kook, and Eunhyeok Park. Grouped speculative decoding for autoregressive image generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15375–15384, 2025

  36. [36]

    Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525, 2024

    Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525, 2024

  37. [37]

    Emu: Generative pretraining in multimodality

    Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Emu: Generative pretraining in multimodality. InThe Twelfth International Conference on Learning Representations, 2024

  38. [38]

    Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024

    Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024

  39. [39]

    Accelerating auto-regressive text-to-image generation with training-free speculative jacobi decoding

    Yao Teng, Han Shi, Xian Liu, Xuefei Ning, Guohao Dai, Yu Wang, Zhenguo Li, and Xihui Liu. Accelerating auto-regressive text-to-image generation with training-free speculative jacobi decoding. InThe Thirteenth International Conference on Learning Representations, 2025

  40. [40]

    Neural discrete representation learning.Advances in neural information processing systems, 30, 2017

    Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Advances in neural information processing systems, 30, 2017

  41. [41]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

  42. [42]

    Emu3: Next-token prediction is all you need

    Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869, 2024

  43. [43]

    Parallelized autoregressive visual generation

    Yuqing Wang, Shuhuai Ren, Zhijie Lin, Yujin Han, Haoyuan Guo, Zhenheng Yang, Difan Zou, Jiashi Feng, and Xihui Liu. Parallelized autoregressive visual generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12955–12965, 2025

  44. [44]

    Janus: Decoupling visual encoding for unified multimodal understanding and generation

    Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12966–12977, 2025

  45. [45]

    Show-o: One single transformer to unify multimodal understanding and generation

    Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation. InThe Thirteenth International Conference on Learning Representations, 2025. 12

  46. [46]

    Scaling autoregressive models for content-rich text-to-image generation.arXiv preprint arXiv:2206.10789, 2022

    Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al. Scaling autoregressive models for content-rich text-to-image generation.arXiv preprint arXiv:2206.10789, 2022

  47. [47]

    Language model beats diffusion-tokenizer is key to visual generation

    Lijun Yu, Jose Lezama, Nitesh Bharadwaj Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Agrim Gupta, Xiuye Gu, Alexander G Hauptmann, et al. Language model beats diffusion-tokenizer is key to visual generation. InThe Twelfth International Conference on Learning Representations, 2024

  48. [48]

    Root mean square layer normalization.Advances in neural information processing systems, 32, 2019

    Biao Zhang and Rico Sennrich. Root mean square layer normalization.Advances in neural information processing systems, 32, 2019

  49. [49]

    Mul- timodal chain-of-thought reasoning in language models.Transactions on Machine Learning Research, 2024, 2024

    Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. Mul- timodal chain-of-thought reasoning in language models.Transactions on Machine Learning Research, 2024, 2024

  50. [50]

    Locality-aware parallel decoding for efficient autoregressive image generation.arXiv preprint arXiv:2507.01957, 2025

    Zhuoyang Zhang, Luke J Huang, Chengyue Wu, Shang Yang, Kelly Peng, Yao Lu, and Song Han. Locality-aware parallel decoding for efficient autoregressive image generation.arXiv preprint arXiv:2507.01957, 2025. A Technical appendices and supplementary material This supplementary material provides additional details that complement the main text. All experi- m...