Recognition: 3 theorem links
· Lean TheoremAutoregressive Model Beats Diffusion: Llama for Scalable Image Generation
Pith reviewed 2026-05-11 22:03 UTC · model grok-4.3
The pith
Vanilla autoregressive models like Llama achieve state-of-the-art image generation when scaled without visual biases.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper shows that a plain autoregressive transformer following the original Llama next-token prediction recipe, when paired with an image tokenizer that downsamples by 16 and reaches 0.94 rFID with 97 percent codebook usage, produces class-conditional images at 2.18 FID on ImageNet 256x256. This outperforms diffusion baselines such as LDM and DiT across model sizes from 111M to 3.1B parameters. A 775M text-conditional variant trained first on LAION-COCO then on high-aesthetic images matches leading methods in visual quality and text alignment. The same models also deliver 3-4x faster inference when run through existing LLM serving systems.
What carries the argument
LlamaGen, a standard transformer that applies next-token prediction to a sequence of discrete tokens produced by a fixed image tokenizer with 16x spatial downsampling.
If this is right
- Class-conditional generation reaches 2.18 FID on ImageNet 256x256 across scales up to 3.1B parameters.
- Text-conditional generation with 775M parameters achieves competitive visual quality and prompt alignment after staged training on large image-text datasets.
- Inference speed increases 326 to 414 percent by reusing existing LLM serving frameworks.
- Model performance improves consistently with scale from 111M to 3.1B parameters without adding vision-specific components.
Where Pith is reading between the lines
- A single transformer backbone could eventually handle both language modeling and image generation by sharing the same next-token objective and weights.
- The result questions whether diffusion's iterative denoising process is required for high-quality synthesis once tokenization and scale are sufficient.
- The same tokenizer and autoregressive setup could be tested directly on video frames or 3D representations to check if the approach generalizes beyond static images.
- Open release of the models lowers the barrier for combining language and vision capabilities in one architecture.
Load-bearing premise
The image tokenizer supplies enough visual information through its reconstruction quality that a transformer without any vision-specific layers or inductive biases can still learn to generate coherent high-fidelity images at scale.
What would settle it
A larger LlamaGen model trained to the same schedule that produces FID scores worse than current diffusion models on ImageNet or visibly lower-quality samples would show that scaling alone is insufficient.
read the original abstract
We introduce LlamaGen, a new family of image generation models that apply original ``next-token prediction'' paradigm of large language models to visual generation domain. It is an affirmative answer to whether vanilla autoregressive models, e.g., Llama, without inductive biases on visual signals can achieve state-of-the-art image generation performance if scaling properly. We reexamine design spaces of image tokenizers, scalability properties of image generation models, and their training data quality. The outcome of this exploration consists of: (1) An image tokenizer with downsample ratio of 16, reconstruction quality of 0.94 rFID and codebook usage of 97% on ImageNet benchmark. (2) A series of class-conditional image generation models ranging from 111M to 3.1B parameters, achieving 2.18 FID on ImageNet 256x256 benchmarks, outperforming the popular diffusion models such as LDM, DiT. (3) A text-conditional image generation model with 775M parameters, from two-stage training on LAION-COCO and high aesthetics quality images, demonstrating competitive performance of visual quality and text alignment. (4) We verify the effectiveness of LLM serving frameworks in optimizing the inference speed of image generation models and achieve 326% - 414% speedup. We release all models and codes to facilitate open-source community of visual generation and multimodal foundation models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces LlamaGen, a family of autoregressive image generation models that apply the standard next-token prediction paradigm from Llama-style transformers to visual tokens produced by a custom image tokenizer. The central claim is that vanilla autoregressive models without vision-specific inductive biases can achieve state-of-the-art performance on class-conditional ImageNet 256x256 generation (2.18 FID) when scaled to up to 3.1B parameters, outperforming diffusion models such as LDM and DiT; the work also reports an improved tokenizer (downsample ratio 16, 0.94 rFID, 97% codebook usage), a 775M text-conditional model, and inference speedups of 326-414% via LLM serving frameworks.
Significance. If the empirical results hold under rigorous verification, the work would be significant for challenging the prevailing view that diffusion models are required for high-fidelity image synthesis and for supporting the feasibility of unified autoregressive multimodal models. The authors receive credit for releasing all models and code, which directly aids reproducibility, and for supplying concrete benchmark numbers on both tokenizer reconstruction and downstream generation.
major comments (3)
- [Abstract] Abstract: The claim that a standard Llama-style transformer without visual inductive biases can reach 2.18 FID rests on the assumption that the custom tokenizer (0.94 rFID) supplies nearly complete visual information. The manuscript does not report ablations that replace this tokenizer with a standard VQGAN-style tokenizer of higher rFID while keeping the AR backbone fixed; without such controls it is impossible to isolate whether the reported gains derive from AR scaling or from the tokenizer's design and fidelity.
- [Results section] Results section: The abstract states that a series of models from 111M to 3.1B parameters was trained and that scalability properties were reexamined, yet no scaling curves, per-size FID tables, or controlled ablations on training data quality are provided. This absence weakens the ability to verify that performance improves predictably with scale rather than with other unstated factors.
- [Methods] Methods: Full training hyperparameters, exact architectural modifications (if any) to the Llama backbone for discrete image tokens, and the precise evaluation protocol used for the 2.18 FID number (including guidance scale and sampling steps) are not detailed. These omissions are load-bearing for confirming that the model is truly vanilla and that comparisons with DiT and LDM are matched on compute and data.
minor comments (2)
- [Abstract] Abstract: The term 'original next-token prediction paradigm' should be clarified to indicate whether any vision-specific positional encodings or token embeddings were introduced.
- The manuscript should include a precise definition of rFID and codebook usage in the main text or a dedicated section, along with the exact ImageNet split used for tokenizer evaluation.
Simulated Author's Rebuttal
We thank the referee for the careful and constructive review. We address each major comment below and will revise the manuscript to improve clarity, completeness, and reproducibility while preserving the core empirical claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that a standard Llama-style transformer without visual inductive biases can reach 2.18 FID rests on the assumption that the custom tokenizer (0.94 rFID) supplies nearly complete visual information. The manuscript does not report ablations that replace this tokenizer with a standard VQGAN-style tokenizer of higher rFID while keeping the AR backbone fixed; without such controls it is impossible to isolate whether the reported gains derive from AR scaling or from the tokenizer's design and fidelity.
Authors: We acknowledge that a controlled ablation swapping our tokenizer for a standard VQGAN while holding the AR backbone fixed would provide stronger isolation of contributions. Our work treats the tokenizer as an integral part of reexamining design spaces for image generation, and the reported tokenizer (downsample ratio 16, 0.94 rFID, 97% codebook usage) is a deliberate improvement over prior VQGAN baselines. The central finding remains that a vanilla Llama-style AR model, paired with this tokenizer, scales to 2.18 FID. In revision we will add explicit discussion of this point and note the substantial compute required for additional full-scale ablations. revision: partial
-
Referee: [Results section] Results section: The abstract states that a series of models from 111M to 3.1B parameters was trained and that scalability properties were reexamined, yet no scaling curves, per-size FID tables, or controlled ablations on training data quality are provided. This absence weakens the ability to verify that performance improves predictably with scale rather than with other unstated factors.
Authors: The manuscript states that models spanning 111M to 3.1B parameters were trained and that scalability was reexamined. To make the scaling behavior fully verifiable, we will include explicit scaling curves and a per-model-size FID table in the revised results section. We will also expand the description of training data and any data-quality controls that were performed. revision: yes
-
Referee: [Methods] Methods: Full training hyperparameters, exact architectural modifications (if any) to the Llama backbone for discrete image tokens, and the precise evaluation protocol used for the 2.18 FID number (including guidance scale and sampling steps) are not detailed. These omissions are load-bearing for confirming that the model is truly vanilla and that comparisons with DiT and LDM are matched on compute and data.
Authors: We agree these details are essential. In the revised manuscript we will add a dedicated appendix containing the complete set of training hyperparameters, confirm that the Llama backbone receives only the minimal adaptation required to handle a discrete image-token vocabulary (no vision-specific inductive biases), and specify the exact classifier-free guidance scale and sampling steps used to obtain the reported 2.18 FID. These additions, together with the already-released code and models, will allow direct verification of the vanilla nature of the architecture and the fairness of the comparisons. revision: yes
Circularity Check
No circularity: empirical benchmarks on external datasets
full rationale
The paper reports training Llama-style autoregressive transformers on images tokenized by a custom VQ-style encoder and measures performance via FID on ImageNet 256x256 and LAION-COCO. All central numbers (2.18 FID, 0.94 rFID, parameter counts, speedups) are direct outputs of model training and evaluation against public benchmarks. No equations, predictions, or uniqueness claims reduce by construction to fitted parameters inside the paper; the tokenizer design space is explored empirically rather than assumed. This is a standard scaling experiment whose claims remain falsifiable by independent replication.
Axiom & Free-Parameter Ledger
free parameters (2)
- tokenizer downsample ratio
- model parameter counts
axioms (1)
- domain assumption Standard transformer decoder can model image token sequences without additional visual inductive biases
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.DAlembert.Inevitabilitybilinear_family_forced echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
vanilla autoregressive models, e.g., Llama, without inductive biases on visual signals can achieve state-of-the-art image generation performance if scaling properly
-
IndisputableMonolith.Foundation.DimensionForcingdimension_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
A series of class-conditional image generation models ranging from 111M to 3.1B parameters, achieving 2.18 FID on ImageNet 256x256 benchmarks, outperforming the popular diffusion models such as LDM, DiT
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 41 Pith papers
-
Design Your Ad: Personalized Advertising Image and Text Generation with Unified Autoregressive Models
Uni-AdGen uses a unified autoregressive framework with foreground perception, instruction tuning, and coarse-to-fine preference modules to generate personalized image-text ads from noisy user behaviors, outperforming ...
-
ExtraVAR: Stage-Aware RoPE Remapping for Resolution Extrapolation in Visual Autoregressive Models
ExtraVAR enables resolution extrapolation in visual autoregressive models by stage-aware RoPE remapping and entropy-driven attention scaling, suppressing repetition and detail loss.
-
Normalizing Trajectory Models
NTM uses per-step conditional normalizing flows plus a trajectory-wide predictor to achieve exact-likelihood 4-step sampling that matches or exceeds baselines on text-to-image tasks.
-
Normalizing Trajectory Models
NTM models each generative reverse step as a conditional normalizing flow with a hybrid shallow-deep architecture, enabling exact-likelihood training and strong four-step sampling performance on text-to-image tasks.
-
Autoregressive Visual Generation Needs a Prologue
Prologue introduces dedicated prologue tokens to decouple generation and reconstruction in AR visual models, significantly improving generation FID scores on ImageNet while maintaining reconstruction quality.
-
BEAT: Tokenizing and Generating Symbolic Music by Uniform Temporal Steps
BEAT tokenizes symbolic music by uniform beat steps with sparse per-beat pitch encodings, producing higher quality and more coherent music continuation and accompaniment than event-based tokenizations.
-
Prompt-Guided Image Editing with Masked Logit Nudging in Visual Autoregressive Models
Masked Logit Nudging aligns visual autoregressive model logits with source token maps under target prompts inside cross-attention masks, delivering top image editing results on PIE benchmarks and strong reconstruction...
-
A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens
Delta tokens compress VFM feature differences into single tokens, enabling a lightweight generative world model that predicts diverse futures with far lower compute than existing approaches.
-
Drift-AR: Single-Step Visual Autoregressive Generation via Anti-Symmetric Drifting
Drift-AR achieves 3.8-5.5x speedup in AR-diffusion image models by using entropy to enable entropy-informed speculative decoding and single-step (1-NFE) anti-symmetric drifting decoding.
-
HeatKV: Head-tuned KV-cache Compression for Visual Autoregressive Modeling
HeatKV ranks attention heads by their focus on prior scales using offline calibration data and applies a static per-head pruning schedule, delivering 2x higher KV-cache compression than prior methods on the Infinity-2...
-
InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation
InsightTok improves text and face fidelity in discrete image tokenization via content-aware perceptual losses, with gains transferring to autoregressive generation.
-
Do multimodal models imagine electric sheep?
Fine-tuning VLMs to output action sequences for puzzles causes emergent internal visual representations that improve performance when integrated into reasoning.
-
FlashAR: Efficient Post-Training Acceleration for Autoregressive Image Generation
FlashAR achieves up to 22.9x speedup in 512x512 autoregressive image generation by post-training a pre-trained model with a complementary vertical head and dynamic fusion using only 0.05% of original training data.
-
FlashAR: Efficient Post-Training Acceleration for Autoregressive Image Generation
FlashAR accelerates autoregressive image generation up to 22.9x by post-training a pre-trained raster-scan model with a complementary vertical head and dynamic fusion for two-way next-token prediction.
-
dFlowGRPO: Rate-Aware Policy Optimization for Discrete Flow Models
dFlowGRPO is a new rate-aware RL method for discrete flow models that outperforms prior GRPO approaches on image generation and matches continuous flow models while supporting broad probability paths.
-
CASCADE: Context-Aware Relaxation for Speculative Image Decoding
CASCADE formalizes semantic interchangeability and convergence in target model representations to enable context-aware acceptance relaxation in tree-based speculative decoding, delivering up to 3.6x speedup on text-to...
-
MUSE: Resolving Manifold Misalignment in Visual Tokenization via Topological Orthogonality
MUSE decouples reconstruction and semantic learning in visual tokenization via topological orthogonality, yielding SOTA generation quality and improved semantic performance over its teacher model.
-
End-to-End Autoregressive Image Generation with 1D Semantic Tokenizer
An end-to-end autoregressive model with a jointly trained 1D semantic tokenizer achieves state-of-the-art FID 1.48 on ImageNet 256x256 generation without guidance.
-
VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations
VibeToken enables autoregressive image generation at arbitrary resolutions using 64 tokens for 1024x1024 images with 3.94 gFID, constant 179G FLOPs, and better efficiency than diffusion or fixed AR baselines.
-
PILOT: One Physics-Integrated Generation Framework to Unify 2D and 3D Radio Map Construction
PILOT unifies 2D and 3D radio map generation via physics-guided wavefront autoregressive prediction, reporting lowest NMSE on 2D benchmarks and 78% NMSE reduction with 2500x faster inference than diffusion baselines for 3D.
-
Normalizing Flows with Iterative Denoising
iTARFlow augments normalizing flows with diffusion-style iterative denoising during sampling while preserving end-to-end likelihood training, reaching competitive results on ImageNet 64/128/256.
-
Generative Refinement Networks for Visual Synthesis
GRN uses hierarchical binary quantization and entropy-guided refinement to set new ImageNet records of 0.56 rFID for reconstruction and 1.81 gFID for class-conditional generation while releasing code and models.
-
Combating Pattern and Content Bias: Adversarial Feature Learning for Generalized AI-Generated Image Detection
MAFL uses adversarial training to suppress pattern and content biases, guiding models to learn shared generative features for better cross-model generalization in detecting AI images.
-
On the Robustness of Watermarking for Autoregressive Image Generation
Watermarking schemes for autoregressive image generation fail against removal and forgery attacks, enabling false detections and undermining synthetic content filtering.
-
Differentiable Vector Quantization for Rate-Distortion Optimization of Generative Image Compression
RDVQ enables joint rate-distortion optimization for vector-quantized generative image compression via differentiable codebook distribution relaxation and an autoregressive entropy model.
-
SMART: When is it Actually Worth Expanding a Speculative Tree?
SMART uses marginal benefit-cost analysis to dynamically build efficient speculative trees, achieving 15-20% additional speedup in LLM and MLLM inference.
-
TC-AE: Unlocking Token Capacity for Deep Compression Autoencoders
TC-AE improves reconstruction and generative performance in deep compression by decomposing token-to-latent compression into two stages and using joint self-supervised training.
-
Multimodal Large Language Models for Multi-Subject In-Context Image Generation
MUSIC is the first MLLM for multi-subject in-context image generation that uses an automatic data pipeline, vision chain-of-thought reasoning, and semantics-driven spatial layout planning to outperform prior methods o...
-
MAR-GRPO: Stabilized GRPO for AR-diffusion Hybrid Image Generation
MAR-GRPO stabilizes GRPO for AR-diffusion hybrids via multi-trajectory expectation and uncertainty-based token selection, yielding better visual quality, stability, and spatial understanding than baselines.
-
ImgEdit: A Unified Image Editing Dataset and Benchmark
ImgEdit supplies 1.2 million curated edit pairs and a three-part benchmark that let a VLM-based model outperform prior open-source editors on adherence, quality, and detail preservation.
-
MMaDA: Multimodal Large Diffusion Language Models
MMaDA is a unified multimodal diffusion model using mixed chain-of-thought fine-tuning and a new UniGRPO reinforcement learning algorithm that outperforms specialized models in reasoning, understanding, and text-to-im...
-
Emu3: Next-Token Prediction is All You Need
Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.
-
Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling
Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemph...
-
Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation
Tuna-2 shows pixel embeddings can replace vision encoders in unified multimodal models, achieving competitive or superior results on understanding and generation benchmarks.
-
UniGenDet: A Unified Generative-Discriminative Framework for Co-Evolutionary Image Generation and Generated Image Detection
UniGenDet unifies generative and discriminative models through symbiotic self-attention and detector-guided alignment to co-evolve image generation and authenticity detection.
-
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
Show-o unifies autoregressive and discrete diffusion modeling inside one transformer to support multimodal understanding and generation tasks with competitive benchmark performance.
-
DeepSight: Long-Horizon World Modeling via Latent States Prediction for End-to-End Autonomous Driving
DeepSight uses parallel latent feature prediction in BEV for long-horizon world modeling and adaptive text reasoning to reach state-of-the-art closed-loop performance on the Bench2drive benchmark.
-
Show-o2: Improved Native Unified Multimodal Models
Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.
-
LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation
This review organizes literature on large multimodal models and object-centric vision into four themes—understanding, referring segmentation, editing, and generation—while summarizing paradigms, strategies, and challe...
-
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
Scaling data, model size, and training optimization on the Janus architecture yields better multimodal understanding and more stable, instruction-following text-to-image generation.
-
Cosmos World Foundation Model Platform for Physical AI
The Cosmos platform supplies open-source pre-trained world models and supporting tools for building fine-tunable digital world simulations to train Physical AI.
Reference graph
Works this paper leans on
-
[1]
Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report.arXiv preprint arXiv:2305.10403,
work page internal anchor Pith review arXiv
-
[2]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023a. Yutong Bai, Xinyang Geng, Karttikeya Mangalam, Amir Bar, Alan Yuille, Trevor Darrell, Jitendra Malik, and Alexei A Efros. Sequential modeling enables scalable learning for large ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, et al. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Large Scale GAN Training for High Fidelity Natural Image Synthesis
Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096,
work page internal anchor Pith review arXiv
-
[5]
URL https://openai.com/research/ video-generation-models-as-world-simulators . Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems , 33:1877–1901,
work page 1901
-
[6]
Muse: Text-to-image generation via masked generative transformers
Huiwen Chang, Han Zhang, Jarred Barber, AJ Maschinot, Jose Lezama, Lu Jiang, Ming-Hsuan Yang, Kevin Murphy, William T Freeman, Michael Rubinstein, et al. Muse: Text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704,
-
[7]
Accelerating Large Language Model Decoding with Speculative Sampling
Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318, 2023a. Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, et al. Pixart: Fast tr...
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Imagenet: A large-scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition , pp. 248–255. Ieee,
work page 2009
-
[9]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Eva-02: A visual representation for neon genesis
Yuxin Fang, Quan Sun, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva-02: A visual representation for neon genesis. arXiv preprint arXiv:2303.11331,
-
[11]
Making llama see and draw with seed tokenizer
Yuying Ge, Sijie Zhao, Ziyun Zeng, Yixiao Ge, Chen Li, Xintao Wang, and Ying Shan. Making llama see and draw with seed tokenizer. arXiv preprint arXiv:2310.01218,
-
[12]
Scaling Laws for Autoregressive Generative Modeling
Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B Brown, Prafulla Dhariwal, Scott Gray, et al. Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701,
work page internal anchor Pith review arXiv 2010
-
[13]
Classifier-Free Diffusion Guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Training Compute-Optimal Large Language Models
Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high fidelity image generation. The Journal of Machine Learning Research, 23(1):2249–2281, 2022a. Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high fide...
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361,
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[16]
Auto-Encoding Variational Bayes
Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Linmiao Xu, and Suhail Doshi
Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Linmiao Xu, and Suhail Doshi. Playground v2. 5: Three insights towards enhancing aesthetic quality in text-to-image generation. arXiv preprint arXiv:2402.17245,
-
[18]
Microsoft coco: Common objects in context
13 Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision– ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp. 740–755. Springer,
work page 2014
-
[19]
Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps
Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Systems, 35:5775–5787, 2022a. Jiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mottaghi, and Aniruddha Kembhavi. Unified- io: A unified model for ...
-
[20]
arXiv preprint arXiv:2404.13013 (2024)
Chuofan Ma, Yi Jiang, Jiannan Wu, Zehuan Yuan, and Xiaojuan Qi. Groma: Localized visual tokenization for grounding multimodal large language models. arXiv preprint arXiv:2404.13013,
- [21]
-
[22]
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741,
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
OpenAI. Consistency decoder. https://github.com/openai/consistencydecoder, 2023a. OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023b. OpenLM-Research. Openllama 3b. https://huggingface.co/openlm-research/open_ llama_3b,
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
Kosmos-2: Grounding Multimodal Large Language Models to the World
Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824,
work page internal anchor Pith review arXiv
-
[25]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952,
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
Hierarchical Text-Conditional Image Generation with CLIP Latents
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text- conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3,
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
Stylegan-xl: Scaling stylegan to large diverse datasets
Axel Sauer, Katja Schwarz, and Andreas Geiger. Stylegan-xl: Scaling stylegan to large diverse datasets. In ACM SIGGRAPH 2022 conference proceedings, pp. 1–10,
work page 2022
-
[28]
GLU Variants Improve Transformer
Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202,
work page internal anchor Pith review Pith/arXiv arXiv 2002
-
[29]
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catan- zaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053,
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[30]
Denoising Diffusion Implicit Models
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502,
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[31]
Generative pretraining in multimodality.CoRR, abs/2307.05222, 2023
Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative pretraining in multimodality. arXiv preprint arXiv:2307.05222, 2023a. Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative...
-
[32]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805,
work page internal anchor Pith review Pith/arXiv arXiv
-
[33]
15 Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction. arXiv preprint arXiv:2404.02905,
-
[34]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[35]
Emergent Abilities of Large Language Models
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682,
work page internal anchor Pith review Pith/arXiv arXiv
-
[36]
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
BigScience Workshop, Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ili´c, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, et al. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100,
work page internal anchor Pith review arXiv
-
[37]
Raphael: Text-to-image generation via large mixture of diffusion paths
Zeyue Xue, Guanglu Song, Qiushan Guo, Boxiao Liu, Zhuofan Zong, Yu Liu, and Ping Luo. Raphael: Text-to-image generation via large mixture of diffusion paths. arXiv preprint arXiv:2305.18295,
-
[38]
Aiyuan Yang, Bin Xiao, Bingning Wang, Borong Zhang, Ce Bian, Chao Yin, Chenxu Lv, Da Pan, Dian Wang, Dong Yan, et al. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305,
-
[39]
Vector-quantized image modeling with improved vqgan.arXiv preprint arXiv:2110.04627, 2021
Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, and Yonghui Wu. Vector-quantized image modeling with improved vqgan. arXiv preprint arXiv:2110.04627,
-
[40]
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al. Scaling autoregressive models for content- rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2(3):5,
work page internal anchor Pith review arXiv
-
[41]
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
Lijun Yu, Yong Cheng, Kihyuk Sohn, José Lezama, Han Zhang, Huiwen Chang, Alexander G Hauptmann, Ming-Hsuan Yang, Yuan Hao, Irfan Essa, et al. Magvit: Masked generative video transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10459–10469, 2023a. Lijun Yu, José Lezama, Nitesh B Gundavarapu, Luca Versari, K...
work page internal anchor Pith review arXiv
-
[42]
Gpt4roi: Instruction tuning large language model on region-of- interest
Shilong Zhang, Peize Sun, Shoufa Chen, Min Xiao, Wenqi Shao, Wenwei Zhang, Kai Chen, and Ping Luo. Gpt4roi: Instruction tuning large language model on region-of-interest. arXiv preprint arXiv:2307.03601,
-
[43]
OPT: Open Pre-trained Transformer Language Models
Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068,
work page internal anchor Pith review Pith/arXiv arXiv
-
[44]
PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
16 Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277,
work page internal anchor Pith review arXiv
-
[45]
17 An abstract painting on a pillow with pink, yellow and red flowers. The dining room is decorated with elegant decor.The new tab in Powerpointis highlighted.The card is attached to an external PCI. A large building with columns and a clock tower. Two lions are laying under a tree in the wild.An assortment of party decorations with owls and otheritems.an...
work page 2022
-
[46]
Describe this image in as much detail as possible
Training stage II: 10M internal high aesthetic quality images. Each image is provided a long caption by LLaV A [Liu et al. 2024] using the prompt of “Describe this image in as much detail as possible”. Some examples are shown in Figure
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.