arxiv: 2406.06525 · v1 · submitted 2024-06-10 · 💻 cs.CV

Recognition: 3 theorem links

· Lean Theorem

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Peize Sun , Yi Jiang , Shoufa Chen , Shilong Zhang , Bingyue Peng , Ping Luo , Zehuan Yuan

Authors on Pith no claims yet

Pith reviewed 2026-05-11 22:03 UTC · model grok-4.3

classification 💻 cs.CV

keywords autoregressive image generationnext-token predictionimage tokenizerLlama architecturediffusion modelsImageNet benchmarksclass-conditional generationtext-conditional generation

0 comments

The pith

Vanilla autoregressive models like Llama achieve state-of-the-art image generation when scaled without visual biases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests whether the next-token prediction paradigm from large language models can be applied directly to images and match or exceed specialized diffusion models. The authors create LlamaGen by pairing a high-quality image tokenizer with a standard Llama-style transformer that predicts image tokens sequentially. Models ranging from 111 million to 3.1 billion parameters reach an FID of 2.18 on ImageNet 256x256 for class-conditional generation, beating LDM and DiT, while a 775 million parameter text-conditional version shows competitive quality after two-stage training. If the result holds, image generation would no longer require diffusion-specific designs and could instead reuse the same scalable transformer backbone used for language.

Core claim

The paper shows that a plain autoregressive transformer following the original Llama next-token prediction recipe, when paired with an image tokenizer that downsamples by 16 and reaches 0.94 rFID with 97 percent codebook usage, produces class-conditional images at 2.18 FID on ImageNet 256x256. This outperforms diffusion baselines such as LDM and DiT across model sizes from 111M to 3.1B parameters. A 775M text-conditional variant trained first on LAION-COCO then on high-aesthetic images matches leading methods in visual quality and text alignment. The same models also deliver 3-4x faster inference when run through existing LLM serving systems.

What carries the argument

LlamaGen, a standard transformer that applies next-token prediction to a sequence of discrete tokens produced by a fixed image tokenizer with 16x spatial downsampling.

If this is right

Class-conditional generation reaches 2.18 FID on ImageNet 256x256 across scales up to 3.1B parameters.
Text-conditional generation with 775M parameters achieves competitive visual quality and prompt alignment after staged training on large image-text datasets.
Inference speed increases 326 to 414 percent by reusing existing LLM serving frameworks.
Model performance improves consistently with scale from 111M to 3.1B parameters without adding vision-specific components.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

A single transformer backbone could eventually handle both language modeling and image generation by sharing the same next-token objective and weights.
The result questions whether diffusion's iterative denoising process is required for high-quality synthesis once tokenization and scale are sufficient.
The same tokenizer and autoregressive setup could be tested directly on video frames or 3D representations to check if the approach generalizes beyond static images.
Open release of the models lowers the barrier for combining language and vision capabilities in one architecture.

Load-bearing premise

The image tokenizer supplies enough visual information through its reconstruction quality that a transformer without any vision-specific layers or inductive biases can still learn to generate coherent high-fidelity images at scale.

What would settle it

A larger LlamaGen model trained to the same schedule that produces FID scores worse than current diffusion models on ImageNet or visibly lower-quality samples would show that scaling alone is insufficient.

read the original abstract

We introduce LlamaGen, a new family of image generation models that apply original ``next-token prediction'' paradigm of large language models to visual generation domain. It is an affirmative answer to whether vanilla autoregressive models, e.g., Llama, without inductive biases on visual signals can achieve state-of-the-art image generation performance if scaling properly. We reexamine design spaces of image tokenizers, scalability properties of image generation models, and their training data quality. The outcome of this exploration consists of: (1) An image tokenizer with downsample ratio of 16, reconstruction quality of 0.94 rFID and codebook usage of 97% on ImageNet benchmark. (2) A series of class-conditional image generation models ranging from 111M to 3.1B parameters, achieving 2.18 FID on ImageNet 256x256 benchmarks, outperforming the popular diffusion models such as LDM, DiT. (3) A text-conditional image generation model with 775M parameters, from two-stage training on LAION-COCO and high aesthetics quality images, demonstrating competitive performance of visual quality and text alignment. (4) We verify the effectiveness of LLM serving frameworks in optimizing the inference speed of image generation models and achieve 326% - 414% speedup. We release all models and codes to facilitate open-source community of visual generation and multimodal foundation models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LlamaGen hits 2.18 FID with a plain scaled Llama on image tokens, but the custom tokenizer is likely carrying more of the load than the AR scaling itself.

read the letter

The one thing to take away is that a standard Llama decoder, trained autoregressively on tokens from their custom image tokenizer, reaches 2.18 FID on ImageNet 256x256 class-conditional generation and beats the diffusion models they compare against, up to 3B parameters. They also train a 775M text-conditional version and show that existing LLM inference stacks give 3-4x speedups. That empirical result is the core of the paper and it is new in the sense that prior AR image work had not demonstrated this scale or these numbers against current diffusion baselines. They release the models and code, which makes the claim checkable. The tokenizer details they give (downsample 16, 0.94 rFID, 97% codebook usage) are concrete and they say they reexamined tokenizer design spaces, so the pipeline is at least described enough to reproduce the setup. The paper engages directly with the question of whether vanilla next-token prediction transfers to images without vision-specific layers. That is useful for anyone thinking about unified multimodal models. The main soft spot is exactly the one in the stress-test note. They do not show ablations that pit their tokenizer against standard VQGAN-style ones with higher rFID, so it remains unclear how much of the 2.18 FID comes from the high-fidelity tokenizer versus the AR architecture and scaling. Without those comparisons the central claim that “vanilla AR without inductive biases” is doing the work is harder to accept at face value. Training details and full ablation tables are also thin in the abstract, though the released code should help. This paper is for groups already working on large autoregressive vision models or looking for open baselines to compare against diffusion. A reader who wants to test whether LLM scaling laws apply directly to images will get something concrete from it. I would send it to peer review because the reported numbers are specific and the open-source release makes verification feasible, even if the attribution questions need more work in revision.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces LlamaGen, a family of autoregressive image generation models that apply the standard next-token prediction paradigm from Llama-style transformers to visual tokens produced by a custom image tokenizer. The central claim is that vanilla autoregressive models without vision-specific inductive biases can achieve state-of-the-art performance on class-conditional ImageNet 256x256 generation (2.18 FID) when scaled to up to 3.1B parameters, outperforming diffusion models such as LDM and DiT; the work also reports an improved tokenizer (downsample ratio 16, 0.94 rFID, 97% codebook usage), a 775M text-conditional model, and inference speedups of 326-414% via LLM serving frameworks.

Significance. If the empirical results hold under rigorous verification, the work would be significant for challenging the prevailing view that diffusion models are required for high-fidelity image synthesis and for supporting the feasibility of unified autoregressive multimodal models. The authors receive credit for releasing all models and code, which directly aids reproducibility, and for supplying concrete benchmark numbers on both tokenizer reconstruction and downstream generation.

major comments (3)

[Abstract] Abstract: The claim that a standard Llama-style transformer without visual inductive biases can reach 2.18 FID rests on the assumption that the custom tokenizer (0.94 rFID) supplies nearly complete visual information. The manuscript does not report ablations that replace this tokenizer with a standard VQGAN-style tokenizer of higher rFID while keeping the AR backbone fixed; without such controls it is impossible to isolate whether the reported gains derive from AR scaling or from the tokenizer's design and fidelity.
[Results section] Results section: The abstract states that a series of models from 111M to 3.1B parameters was trained and that scalability properties were reexamined, yet no scaling curves, per-size FID tables, or controlled ablations on training data quality are provided. This absence weakens the ability to verify that performance improves predictably with scale rather than with other unstated factors.
[Methods] Methods: Full training hyperparameters, exact architectural modifications (if any) to the Llama backbone for discrete image tokens, and the precise evaluation protocol used for the 2.18 FID number (including guidance scale and sampling steps) are not detailed. These omissions are load-bearing for confirming that the model is truly vanilla and that comparisons with DiT and LDM are matched on compute and data.

minor comments (2)

[Abstract] Abstract: The term 'original next-token prediction paradigm' should be clarified to indicate whether any vision-specific positional encodings or token embeddings were introduced.
The manuscript should include a precise definition of rFID and codebook usage in the main text or a dedicated section, along with the exact ImageNet split used for tokenizer evaluation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful and constructive review. We address each major comment below and will revise the manuscript to improve clarity, completeness, and reproducibility while preserving the core empirical claims.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that a standard Llama-style transformer without visual inductive biases can reach 2.18 FID rests on the assumption that the custom tokenizer (0.94 rFID) supplies nearly complete visual information. The manuscript does not report ablations that replace this tokenizer with a standard VQGAN-style tokenizer of higher rFID while keeping the AR backbone fixed; without such controls it is impossible to isolate whether the reported gains derive from AR scaling or from the tokenizer's design and fidelity.

Authors: We acknowledge that a controlled ablation swapping our tokenizer for a standard VQGAN while holding the AR backbone fixed would provide stronger isolation of contributions. Our work treats the tokenizer as an integral part of reexamining design spaces for image generation, and the reported tokenizer (downsample ratio 16, 0.94 rFID, 97% codebook usage) is a deliberate improvement over prior VQGAN baselines. The central finding remains that a vanilla Llama-style AR model, paired with this tokenizer, scales to 2.18 FID. In revision we will add explicit discussion of this point and note the substantial compute required for additional full-scale ablations. revision: partial
Referee: [Results section] Results section: The abstract states that a series of models from 111M to 3.1B parameters was trained and that scalability properties were reexamined, yet no scaling curves, per-size FID tables, or controlled ablations on training data quality are provided. This absence weakens the ability to verify that performance improves predictably with scale rather than with other unstated factors.

Authors: The manuscript states that models spanning 111M to 3.1B parameters were trained and that scalability was reexamined. To make the scaling behavior fully verifiable, we will include explicit scaling curves and a per-model-size FID table in the revised results section. We will also expand the description of training data and any data-quality controls that were performed. revision: yes
Referee: [Methods] Methods: Full training hyperparameters, exact architectural modifications (if any) to the Llama backbone for discrete image tokens, and the precise evaluation protocol used for the 2.18 FID number (including guidance scale and sampling steps) are not detailed. These omissions are load-bearing for confirming that the model is truly vanilla and that comparisons with DiT and LDM are matched on compute and data.

Authors: We agree these details are essential. In the revised manuscript we will add a dedicated appendix containing the complete set of training hyperparameters, confirm that the Llama backbone receives only the minimal adaptation required to handle a discrete image-token vocabulary (no vision-specific inductive biases), and specify the exact classifier-free guidance scale and sampling steps used to obtain the reported 2.18 FID. These additions, together with the already-released code and models, will allow direct verification of the vanilla nature of the architecture and the fairness of the comparisons. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmarks on external datasets

full rationale

The paper reports training Llama-style autoregressive transformers on images tokenized by a custom VQ-style encoder and measures performance via FID on ImageNet 256x256 and LAION-COCO. All central numbers (2.18 FID, 0.94 rFID, parameter counts, speedups) are direct outputs of model training and evaluation against public benchmarks. No equations, predictions, or uniqueness claims reduce by construction to fitted parameters inside the paper; the tokenizer design space is explored empirically rather than assumed. This is a standard scaling experiment whose claims remain falsifiable by independent replication.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The work rests on standard transformer assumptions and empirical choices for tokenizer design and data curation rather than new theoretical axioms or invented entities.

free parameters (2)

tokenizer downsample ratio
Set to 16 to trade off sequence length against reconstruction quality; directly affects the autoregressive modeling difficulty.
model parameter counts
111M to 3.1B chosen to demonstrate scaling behavior.

axioms (1)

domain assumption Standard transformer decoder can model image token sequences without additional visual inductive biases
Invoked when claiming that vanilla Llama suffices once the tokenizer is adequate.

pith-pipeline@v0.9.0 · 5570 in / 1234 out tokens · 59895 ms · 2026-05-11T22:03:28.156773+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.DAlembert.Inevitability bilinear_family_forced echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

vanilla autoregressive models, e.g., Llama, without inductive biases on visual signals can achieve state-of-the-art image generation performance if scaling properly
IndisputableMonolith.Foundation.DimensionForcing dimension_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

A series of class-conditional image generation models ranging from 111M to 3.1B parameters, achieving 2.18 FID on ImageNet 256x256 benchmarks, outperforming the popular diffusion models such as LDM, DiT

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 41 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Design Your Ad: Personalized Advertising Image and Text Generation with Unified Autoregressive Models
cs.CV 2026-05 unverdicted novelty 7.0

Uni-AdGen uses a unified autoregressive framework with foreground perception, instruction tuning, and coarse-to-fine preference modules to generate personalized image-text ads from noisy user behaviors, outperforming ...
ExtraVAR: Stage-Aware RoPE Remapping for Resolution Extrapolation in Visual Autoregressive Models
cs.CV 2026-05 unverdicted novelty 7.0

ExtraVAR enables resolution extrapolation in visual autoregressive models by stage-aware RoPE remapping and entropy-driven attention scaling, suppressing repetition and detail loss.
Normalizing Trajectory Models
cs.CV 2026-05 unverdicted novelty 7.0

NTM uses per-step conditional normalizing flows plus a trajectory-wide predictor to achieve exact-likelihood 4-step sampling that matches or exceeds baselines on text-to-image tasks.
Normalizing Trajectory Models
cs.CV 2026-05 unverdicted novelty 7.0

NTM models each generative reverse step as a conditional normalizing flow with a hybrid shallow-deep architecture, enabling exact-likelihood training and strong four-step sampling performance on text-to-image tasks.
Autoregressive Visual Generation Needs a Prologue
cs.CV 2026-05 unverdicted novelty 7.0

Prologue introduces dedicated prologue tokens to decouple generation and reconstruction in AR visual models, significantly improving generation FID scores on ImageNet while maintaining reconstruction quality.
BEAT: Tokenizing and Generating Symbolic Music by Uniform Temporal Steps
cs.SD 2026-04 unverdicted novelty 7.0

BEAT tokenizes symbolic music by uniform beat steps with sparse per-beat pitch encodings, producing higher quality and more coherent music continuation and accompaniment than event-based tokenizations.
Prompt-Guided Image Editing with Masked Logit Nudging in Visual Autoregressive Models
cs.CV 2026-04 unverdicted novelty 7.0

Masked Logit Nudging aligns visual autoregressive model logits with source token maps under target prompts inside cross-attention masks, delivering top image editing results on PIE benchmarks and strong reconstruction...
A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens
cs.CV 2026-04 conditional novelty 7.0

Delta tokens compress VFM feature differences into single tokens, enabling a lightweight generative world model that predicts diverse futures with far lower compute than existing approaches.
Drift-AR: Single-Step Visual Autoregressive Generation via Anti-Symmetric Drifting
cs.CV 2026-03 unverdicted novelty 7.0

Drift-AR achieves 3.8-5.5x speedup in AR-diffusion image models by using entropy to enable entropy-informed speculative decoding and single-step (1-NFE) anti-symmetric drifting decoding.
HeatKV: Head-tuned KV-cache Compression for Visual Autoregressive Modeling
cs.CV 2026-05 unverdicted novelty 6.0

HeatKV ranks attention heads by their focus on prior scales using offline calibration data and applies a static per-head pruning schedule, delivering 2x higher KV-cache compression than prior methods on the Infinity-2...
InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation
cs.CV 2026-05 conditional novelty 6.0

InsightTok improves text and face fidelity in discrete image tokenization via content-aware perceptual losses, with gains transferring to autoregressive generation.
Do multimodal models imagine electric sheep?
cs.CV 2026-05 conditional novelty 6.0

Fine-tuning VLMs to output action sequences for puzzles causes emergent internal visual representations that improve performance when integrated into reasoning.
FlashAR: Efficient Post-Training Acceleration for Autoregressive Image Generation
cs.CV 2026-05 unverdicted novelty 6.0

FlashAR achieves up to 22.9x speedup in 512x512 autoregressive image generation by post-training a pre-trained model with a complementary vertical head and dynamic fusion using only 0.05% of original training data.
FlashAR: Efficient Post-Training Acceleration for Autoregressive Image Generation
cs.CV 2026-05 unverdicted novelty 6.0

FlashAR accelerates autoregressive image generation up to 22.9x by post-training a pre-trained raster-scan model with a complementary vertical head and dynamic fusion for two-way next-token prediction.
dFlowGRPO: Rate-Aware Policy Optimization for Discrete Flow Models
cs.LG 2026-05 unverdicted novelty 6.0

dFlowGRPO is a new rate-aware RL method for discrete flow models that outperforms prior GRPO approaches on image generation and matches continuous flow models while supporting broad probability paths.
CASCADE: Context-Aware Relaxation for Speculative Image Decoding
cs.CV 2026-05 unverdicted novelty 6.0

CASCADE formalizes semantic interchangeability and convergence in target model representations to enable context-aware acceptance relaxation in tree-based speculative decoding, delivering up to 3.6x speedup on text-to...
MUSE: Resolving Manifold Misalignment in Visual Tokenization via Topological Orthogonality
cs.CV 2026-05 unverdicted novelty 6.0

MUSE decouples reconstruction and semantic learning in visual tokenization via topological orthogonality, yielding SOTA generation quality and improved semantic performance over its teacher model.
End-to-End Autoregressive Image Generation with 1D Semantic Tokenizer
cs.CV 2026-05 unverdicted novelty 6.0

An end-to-end autoregressive model with a jointly trained 1D semantic tokenizer achieves state-of-the-art FID 1.48 on ImageNet 256x256 generation without guidance.
VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations
cs.CV 2026-04 unverdicted novelty 6.0

VibeToken enables autoregressive image generation at arbitrary resolutions using 64 tokens for 1024x1024 images with 3.94 gFID, constant 179G FLOPs, and better efficiency than diffusion or fixed AR baselines.
PILOT: One Physics-Integrated Generation Framework to Unify 2D and 3D Radio Map Construction
eess.SP 2026-04 unverdicted novelty 6.0

PILOT unifies 2D and 3D radio map generation via physics-guided wavefront autoregressive prediction, reporting lowest NMSE on 2D benchmarks and 78% NMSE reduction with 2500x faster inference than diffusion baselines for 3D.
Normalizing Flows with Iterative Denoising
cs.CV 2026-04 unverdicted novelty 6.0

iTARFlow augments normalizing flows with diffusion-style iterative denoising during sampling while preserving end-to-end likelihood training, reaching competitive results on ImageNet 64/128/256.
Generative Refinement Networks for Visual Synthesis
cs.CV 2026-04 unverdicted novelty 6.0

GRN uses hierarchical binary quantization and entropy-guided refinement to set new ImageNet records of 0.56 rFID for reconstruction and 1.81 gFID for class-conditional generation while releasing code and models.
Combating Pattern and Content Bias: Adversarial Feature Learning for Generalized AI-Generated Image Detection
cs.CV 2026-04 unverdicted novelty 6.0

MAFL uses adversarial training to suppress pattern and content biases, guiding models to learn shared generative features for better cross-model generalization in detecting AI images.
On the Robustness of Watermarking for Autoregressive Image Generation
cs.CV 2026-04 unverdicted novelty 6.0

Watermarking schemes for autoregressive image generation fail against removal and forgery attacks, enabling false detections and undermining synthetic content filtering.
Differentiable Vector Quantization for Rate-Distortion Optimization of Generative Image Compression
cs.CV 2026-04 unverdicted novelty 6.0

RDVQ enables joint rate-distortion optimization for vector-quantized generative image compression via differentiable codebook distribution relaxation and an autoregressive entropy model.
SMART: When is it Actually Worth Expanding a Speculative Tree?
cs.DC 2026-04 unverdicted novelty 6.0

SMART uses marginal benefit-cost analysis to dynamically build efficient speculative trees, achieving 15-20% additional speedup in LLM and MLLM inference.
TC-AE: Unlocking Token Capacity for Deep Compression Autoencoders
cs.CV 2026-04 unverdicted novelty 6.0

TC-AE improves reconstruction and generative performance in deep compression by decomposing token-to-latent compression into two stages and using joint self-supervised training.
Multimodal Large Language Models for Multi-Subject In-Context Image Generation
cs.LG 2026-04 unverdicted novelty 6.0

MUSIC is the first MLLM for multi-subject in-context image generation that uses an automatic data pipeline, vision chain-of-thought reasoning, and semantics-driven spatial layout planning to outperform prior methods o...
MAR-GRPO: Stabilized GRPO for AR-diffusion Hybrid Image Generation
cs.CV 2026-04 unverdicted novelty 6.0

MAR-GRPO stabilizes GRPO for AR-diffusion hybrids via multi-trajectory expectation and uncertainty-based token selection, yielding better visual quality, stability, and spatial understanding than baselines.
ImgEdit: A Unified Image Editing Dataset and Benchmark
cs.CV 2025-05 conditional novelty 6.0

ImgEdit supplies 1.2 million curated edit pairs and a three-part benchmark that let a VLM-based model outperform prior open-source editors on adherence, quality, and detail preservation.
MMaDA: Multimodal Large Diffusion Language Models
cs.CV 2025-05 unverdicted novelty 6.0

MMaDA is a unified multimodal diffusion model using mixed chain-of-thought fine-tuning and a new UniGRPO reinforcement learning algorithm that outperforms specialized models in reasoning, understanding, and text-to-im...
Emu3: Next-Token Prediction is All You Need
cs.CV 2024-09 unverdicted novelty 6.0

Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.
Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling
cs.CV 2026-04 unverdicted novelty 5.0

Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemph...
Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation
cs.CV 2026-04 unverdicted novelty 5.0

Tuna-2 shows pixel embeddings can replace vision encoders in unified multimodal models, achieving competitive or superior results on understanding and generation benchmarks.
UniGenDet: A Unified Generative-Discriminative Framework for Co-Evolutionary Image Generation and Generated Image Detection
cs.CV 2026-04 unverdicted novelty 5.0

UniGenDet unifies generative and discriminative models through symbiotic self-attention and detector-guided alignment to co-evolve image generation and authenticity detection.
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
cs.CV 2024-08 unverdicted novelty 5.0

Show-o unifies autoregressive and discrete diffusion modeling inside one transformer to support multimodal understanding and generation tasks with competitive benchmark performance.
DeepSight: Long-Horizon World Modeling via Latent States Prediction for End-to-End Autonomous Driving
cs.CV 2026-05 unverdicted novelty 4.0

DeepSight uses parallel latent feature prediction in BEV for long-horizon world modeling and adaptive text reasoning to reach state-of-the-art closed-loop performance on the Bench2drive benchmark.
Show-o2: Improved Native Unified Multimodal Models
cs.CV 2025-06 unverdicted novelty 4.0

Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.
LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation
cs.CV 2026-04 unverdicted novelty 3.0

This review organizes literature on large multimodal models and object-centric vision into four themes—understanding, referring segmentation, editing, and generation—while summarizing paradigms, strategies, and challe...
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
cs.AI 2025-01 conditional novelty 3.0

Scaling data, model size, and training optimization on the Janus architecture yields better multimodal understanding and more stable, instruction-following text-to-image generation.
Cosmos World Foundation Model Platform for Physical AI
cs.CV 2025-01 unverdicted novelty 3.0

The Cosmos platform supplies open-source pre-trained world models and supporting tools for building fine-tunable digital world simulations to train Physical AI.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · cited by 39 Pith papers · 27 internal anchors

[1]

PaLM 2 Technical Report

Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report.arXiv preprint arXiv:2305.10403,

work page internal anchor Pith review arXiv
[2]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023a. Yutong Bai, Xinyang Geng, Karttikeya Mangalam, Amir Bar, Alan Yuille, Trevor Darrell, Jitendra Malik, and Alexei A Efros. Sequential modeling enables scalable learning for large ...

work page internal anchor Pith review Pith/arXiv arXiv
[3]

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, et al. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Large Scale GAN Training for High Fidelity Natural Image Synthesis

Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096,

work page internal anchor Pith review arXiv
[5]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al

URL https://openai.com/research/ video-generation-models-as-world-simulators . Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems , 33:1877–1901,

work page 1901
[6]

Muse: Text-to-image generation via masked generative transformers

Huiwen Chang, Han Zhang, Jarred Barber, AJ Maschinot, Jose Lezama, Lu Jiang, Ming-Hsuan Yang, Kevin Murphy, William T Freeman, Michael Rubinstein, et al. Muse: Text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704,

work page arXiv
[7]

Accelerating Large Language Model Decoding with Speculative Sampling

Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318, 2023a. Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, et al. Pixart: Fast tr...

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition , pp. 248–255. Ieee,

work page 2009
[9]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Eva-02: A visual representation for neon genesis

Yuxin Fang, Quan Sun, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva-02: A visual representation for neon genesis. arXiv preprint arXiv:2303.11331,

work page arXiv
[11]

Making llama see and draw with seed tokenizer

Yuying Ge, Sijie Zhao, Ziyun Zeng, Yixiao Ge, Chen Li, Xintao Wang, and Ying Shan. Making llama see and draw with seed tokenizer. arXiv preprint arXiv:2310.01218,

work page arXiv
[12]

Scaling Laws for Autoregressive Generative Modeling

Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B Brown, Prafulla Dhariwal, Scott Gray, et al. Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701,

work page internal anchor Pith review arXiv 2010
[13]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Training Compute-Optimal Large Language Models

Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high fidelity image generation. The Journal of Machine Learning Research, 23(1):2249–2281, 2022a. Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high fide...

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361,

work page internal anchor Pith review Pith/arXiv arXiv 2001
[16]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Linmiao Xu, and Suhail Doshi

Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Linmiao Xu, and Suhail Doshi. Playground v2. 5: Three insights towards enhancing aesthetic quality in text-to-image generation. arXiv preprint arXiv:2402.17245,

work page arXiv
[18]

Microsoft coco: Common objects in context

13 Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision– ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp. 740–755. Springer,

work page 2014
[19]

Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Systems, 35:5775–5787, 2022a. Jiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mottaghi, and Aniruddha Kembhavi. Unified- io: A unified model for ...

work page arXiv
[20]

arXiv preprint arXiv:2404.13013 (2024)

Chuofan Ma, Yi Jiang, Jiannan Wu, Zehuan Yuan, and Xiaojuan Qi. Groma: Localized visual tokenization for grounding multimodal large language models. arXiv preprint arXiv:2404.13013,

work page arXiv
[21]

Battaglia

Charlie Nash, Jacob Menick, Sander Dieleman, and Peter W Battaglia. Generating images with sparse representations. arXiv preprint arXiv:2103.03841,

work page arXiv
[22]

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

GPT-4 Technical Report

OpenAI. Consistency decoder. https://github.com/openai/consistencydecoder, 2023a. OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023b. OpenLM-Research. Openllama 3b. https://huggingface.co/openlm-research/open_ llama_3b,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Kosmos-2: Grounding Multimodal Large Language Models to the World

Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824,

work page internal anchor Pith review arXiv
[25]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text- conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Stylegan-xl: Scaling stylegan to large diverse datasets

Axel Sauer, Katja Schwarz, and Andreas Geiger. Stylegan-xl: Scaling stylegan to large diverse datasets. In ACM SIGGRAPH 2022 conference proceedings, pp. 1–10,

work page 2022
[28]

GLU Variants Improve Transformer

Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202,

work page internal anchor Pith review Pith/arXiv arXiv 2002
[29]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catan- zaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053,

work page internal anchor Pith review Pith/arXiv arXiv 1909
[30]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502,

work page internal anchor Pith review Pith/arXiv arXiv 2010
[31]

Generative pretraining in multimodality.CoRR, abs/2307.05222, 2023

Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative pretraining in multimodality. arXiv preprint arXiv:2307.05222, 2023a. Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative...

work page arXiv
[32]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805,

work page internal anchor Pith review Pith/arXiv arXiv
[33]

Visual autoregressive modeling: Scalable image generation via next-scale prediction.arXiv preprint arXiv:2404.02905,

15 Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction. arXiv preprint arXiv:2404.02905,

work page arXiv
[34]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay ...

work page internal anchor Pith review Pith/arXiv arXiv
[35]

Emergent Abilities of Large Language Models

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682,

work page internal anchor Pith review Pith/arXiv arXiv
[36]

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

BigScience Workshop, Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ili´c, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, et al. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100,

work page internal anchor Pith review arXiv
[37]

Raphael: Text-to-image generation via large mixture of diffusion paths

Zeyue Xue, Guanglu Song, Qiushan Guo, Boxiao Liu, Zhuofan Zong, Yu Liu, and Ping Luo. Raphael: Text-to-image generation via large mixture of diffusion paths. arXiv preprint arXiv:2305.18295,

work page arXiv
[38]

CoRR , volume =

Aiyuan Yang, Bin Xiao, Bingning Wang, Borong Zhang, Ce Bian, Chao Yin, Chenxu Lv, Da Pan, Dian Wang, Dong Yan, et al. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305,

work page arXiv
[39]

Vector-quantized image modeling with improved vqgan.arXiv preprint arXiv:2110.04627, 2021

Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, and Yonghui Wu. Vector-quantized image modeling with improved vqgan. arXiv preprint arXiv:2110.04627,

work page arXiv
[40]

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al. Scaling autoregressive models for content- rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2(3):5,

work page internal anchor Pith review arXiv
[41]

Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

Lijun Yu, Yong Cheng, Kihyuk Sohn, José Lezama, Han Zhang, Huiwen Chang, Alexander G Hauptmann, Ming-Hsuan Yang, Yuan Hao, Irfan Essa, et al. Magvit: Masked generative video transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10459–10469, 2023a. Lijun Yu, José Lezama, Nitesh B Gundavarapu, Luca Versari, K...

work page internal anchor Pith review arXiv
[42]

Gpt4roi: Instruction tuning large language model on region-of- interest

Shilong Zhang, Peize Sun, Shoufa Chen, Min Xiao, Wenqi Shao, Wenwei Zhang, Kai Chen, and Ping Luo. Gpt4roi: Instruction tuning large language model on region-of-interest. arXiv preprint arXiv:2307.03601,

work page arXiv
[43]

OPT: Open Pre-trained Transformer Language Models

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068,

work page internal anchor Pith review Pith/arXiv arXiv
[44]

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

16 Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277,

work page internal anchor Pith review arXiv
[45]

The dining room is decorated with elegant decor.The new tab in Powerpointis highlighted.The card is attached to an external PCI

17 An abstract painting on a pillow with pink, yellow and red flowers. The dining room is decorated with elegant decor.The new tab in Powerpointis highlighted.The card is attached to an external PCI. A large building with columns and a clock tower. Two lions are laying under a tree in the wild.An assortment of party decorations with owls and otheritems.an...

work page 2022
[46]

Describe this image in as much detail as possible

Training stage II: 10M internal high aesthetic quality images. Each image is provided a long caption by LLaV A [Liu et al. 2024] using the prompt of “Describe this image in as much detail as possible”. Some examples are shown in Figure

work page 2024