pith. machine review for the scientific record. sign in

arxiv: 2406.06525 · v1 · submitted 2024-06-10 · 💻 cs.CV

Recognition: 3 theorem links

· Lean Theorem

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-11 22:03 UTC · model grok-4.3

classification 💻 cs.CV
keywords autoregressive image generationnext-token predictionimage tokenizerLlama architecturediffusion modelsImageNet benchmarksclass-conditional generationtext-conditional generation
0
0 comments X

The pith

Vanilla autoregressive models like Llama achieve state-of-the-art image generation when scaled without visual biases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests whether the next-token prediction paradigm from large language models can be applied directly to images and match or exceed specialized diffusion models. The authors create LlamaGen by pairing a high-quality image tokenizer with a standard Llama-style transformer that predicts image tokens sequentially. Models ranging from 111 million to 3.1 billion parameters reach an FID of 2.18 on ImageNet 256x256 for class-conditional generation, beating LDM and DiT, while a 775 million parameter text-conditional version shows competitive quality after two-stage training. If the result holds, image generation would no longer require diffusion-specific designs and could instead reuse the same scalable transformer backbone used for language.

Core claim

The paper shows that a plain autoregressive transformer following the original Llama next-token prediction recipe, when paired with an image tokenizer that downsamples by 16 and reaches 0.94 rFID with 97 percent codebook usage, produces class-conditional images at 2.18 FID on ImageNet 256x256. This outperforms diffusion baselines such as LDM and DiT across model sizes from 111M to 3.1B parameters. A 775M text-conditional variant trained first on LAION-COCO then on high-aesthetic images matches leading methods in visual quality and text alignment. The same models also deliver 3-4x faster inference when run through existing LLM serving systems.

What carries the argument

LlamaGen, a standard transformer that applies next-token prediction to a sequence of discrete tokens produced by a fixed image tokenizer with 16x spatial downsampling.

If this is right

  • Class-conditional generation reaches 2.18 FID on ImageNet 256x256 across scales up to 3.1B parameters.
  • Text-conditional generation with 775M parameters achieves competitive visual quality and prompt alignment after staged training on large image-text datasets.
  • Inference speed increases 326 to 414 percent by reusing existing LLM serving frameworks.
  • Model performance improves consistently with scale from 111M to 3.1B parameters without adding vision-specific components.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • A single transformer backbone could eventually handle both language modeling and image generation by sharing the same next-token objective and weights.
  • The result questions whether diffusion's iterative denoising process is required for high-quality synthesis once tokenization and scale are sufficient.
  • The same tokenizer and autoregressive setup could be tested directly on video frames or 3D representations to check if the approach generalizes beyond static images.
  • Open release of the models lowers the barrier for combining language and vision capabilities in one architecture.

Load-bearing premise

The image tokenizer supplies enough visual information through its reconstruction quality that a transformer without any vision-specific layers or inductive biases can still learn to generate coherent high-fidelity images at scale.

What would settle it

A larger LlamaGen model trained to the same schedule that produces FID scores worse than current diffusion models on ImageNet or visibly lower-quality samples would show that scaling alone is insufficient.

read the original abstract

We introduce LlamaGen, a new family of image generation models that apply original ``next-token prediction'' paradigm of large language models to visual generation domain. It is an affirmative answer to whether vanilla autoregressive models, e.g., Llama, without inductive biases on visual signals can achieve state-of-the-art image generation performance if scaling properly. We reexamine design spaces of image tokenizers, scalability properties of image generation models, and their training data quality. The outcome of this exploration consists of: (1) An image tokenizer with downsample ratio of 16, reconstruction quality of 0.94 rFID and codebook usage of 97% on ImageNet benchmark. (2) A series of class-conditional image generation models ranging from 111M to 3.1B parameters, achieving 2.18 FID on ImageNet 256x256 benchmarks, outperforming the popular diffusion models such as LDM, DiT. (3) A text-conditional image generation model with 775M parameters, from two-stage training on LAION-COCO and high aesthetics quality images, demonstrating competitive performance of visual quality and text alignment. (4) We verify the effectiveness of LLM serving frameworks in optimizing the inference speed of image generation models and achieve 326% - 414% speedup. We release all models and codes to facilitate open-source community of visual generation and multimodal foundation models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces LlamaGen, a family of autoregressive image generation models that apply the standard next-token prediction paradigm from Llama-style transformers to visual tokens produced by a custom image tokenizer. The central claim is that vanilla autoregressive models without vision-specific inductive biases can achieve state-of-the-art performance on class-conditional ImageNet 256x256 generation (2.18 FID) when scaled to up to 3.1B parameters, outperforming diffusion models such as LDM and DiT; the work also reports an improved tokenizer (downsample ratio 16, 0.94 rFID, 97% codebook usage), a 775M text-conditional model, and inference speedups of 326-414% via LLM serving frameworks.

Significance. If the empirical results hold under rigorous verification, the work would be significant for challenging the prevailing view that diffusion models are required for high-fidelity image synthesis and for supporting the feasibility of unified autoregressive multimodal models. The authors receive credit for releasing all models and code, which directly aids reproducibility, and for supplying concrete benchmark numbers on both tokenizer reconstruction and downstream generation.

major comments (3)
  1. [Abstract] Abstract: The claim that a standard Llama-style transformer without visual inductive biases can reach 2.18 FID rests on the assumption that the custom tokenizer (0.94 rFID) supplies nearly complete visual information. The manuscript does not report ablations that replace this tokenizer with a standard VQGAN-style tokenizer of higher rFID while keeping the AR backbone fixed; without such controls it is impossible to isolate whether the reported gains derive from AR scaling or from the tokenizer's design and fidelity.
  2. [Results section] Results section: The abstract states that a series of models from 111M to 3.1B parameters was trained and that scalability properties were reexamined, yet no scaling curves, per-size FID tables, or controlled ablations on training data quality are provided. This absence weakens the ability to verify that performance improves predictably with scale rather than with other unstated factors.
  3. [Methods] Methods: Full training hyperparameters, exact architectural modifications (if any) to the Llama backbone for discrete image tokens, and the precise evaluation protocol used for the 2.18 FID number (including guidance scale and sampling steps) are not detailed. These omissions are load-bearing for confirming that the model is truly vanilla and that comparisons with DiT and LDM are matched on compute and data.
minor comments (2)
  1. [Abstract] Abstract: The term 'original next-token prediction paradigm' should be clarified to indicate whether any vision-specific positional encodings or token embeddings were introduced.
  2. The manuscript should include a precise definition of rFID and codebook usage in the main text or a dedicated section, along with the exact ImageNet split used for tokenizer evaluation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful and constructive review. We address each major comment below and will revise the manuscript to improve clarity, completeness, and reproducibility while preserving the core empirical claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that a standard Llama-style transformer without visual inductive biases can reach 2.18 FID rests on the assumption that the custom tokenizer (0.94 rFID) supplies nearly complete visual information. The manuscript does not report ablations that replace this tokenizer with a standard VQGAN-style tokenizer of higher rFID while keeping the AR backbone fixed; without such controls it is impossible to isolate whether the reported gains derive from AR scaling or from the tokenizer's design and fidelity.

    Authors: We acknowledge that a controlled ablation swapping our tokenizer for a standard VQGAN while holding the AR backbone fixed would provide stronger isolation of contributions. Our work treats the tokenizer as an integral part of reexamining design spaces for image generation, and the reported tokenizer (downsample ratio 16, 0.94 rFID, 97% codebook usage) is a deliberate improvement over prior VQGAN baselines. The central finding remains that a vanilla Llama-style AR model, paired with this tokenizer, scales to 2.18 FID. In revision we will add explicit discussion of this point and note the substantial compute required for additional full-scale ablations. revision: partial

  2. Referee: [Results section] Results section: The abstract states that a series of models from 111M to 3.1B parameters was trained and that scalability properties were reexamined, yet no scaling curves, per-size FID tables, or controlled ablations on training data quality are provided. This absence weakens the ability to verify that performance improves predictably with scale rather than with other unstated factors.

    Authors: The manuscript states that models spanning 111M to 3.1B parameters were trained and that scalability was reexamined. To make the scaling behavior fully verifiable, we will include explicit scaling curves and a per-model-size FID table in the revised results section. We will also expand the description of training data and any data-quality controls that were performed. revision: yes

  3. Referee: [Methods] Methods: Full training hyperparameters, exact architectural modifications (if any) to the Llama backbone for discrete image tokens, and the precise evaluation protocol used for the 2.18 FID number (including guidance scale and sampling steps) are not detailed. These omissions are load-bearing for confirming that the model is truly vanilla and that comparisons with DiT and LDM are matched on compute and data.

    Authors: We agree these details are essential. In the revised manuscript we will add a dedicated appendix containing the complete set of training hyperparameters, confirm that the Llama backbone receives only the minimal adaptation required to handle a discrete image-token vocabulary (no vision-specific inductive biases), and specify the exact classifier-free guidance scale and sampling steps used to obtain the reported 2.18 FID. These additions, together with the already-released code and models, will allow direct verification of the vanilla nature of the architecture and the fairness of the comparisons. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmarks on external datasets

full rationale

The paper reports training Llama-style autoregressive transformers on images tokenized by a custom VQ-style encoder and measures performance via FID on ImageNet 256x256 and LAION-COCO. All central numbers (2.18 FID, 0.94 rFID, parameter counts, speedups) are direct outputs of model training and evaluation against public benchmarks. No equations, predictions, or uniqueness claims reduce by construction to fitted parameters inside the paper; the tokenizer design space is explored empirically rather than assumed. This is a standard scaling experiment whose claims remain falsifiable by independent replication.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The work rests on standard transformer assumptions and empirical choices for tokenizer design and data curation rather than new theoretical axioms or invented entities.

free parameters (2)
  • tokenizer downsample ratio
    Set to 16 to trade off sequence length against reconstruction quality; directly affects the autoregressive modeling difficulty.
  • model parameter counts
    111M to 3.1B chosen to demonstrate scaling behavior.
axioms (1)
  • domain assumption Standard transformer decoder can model image token sequences without additional visual inductive biases
    Invoked when claiming that vanilla Llama suffices once the tokenizer is adequate.

pith-pipeline@v0.9.0 · 5570 in / 1234 out tokens · 59895 ms · 2026-05-11T22:03:28.156773+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith.Foundation.DAlembert.Inevitability bilinear_family_forced echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    vanilla autoregressive models, e.g., Llama, without inductive biases on visual signals can achieve state-of-the-art image generation performance if scaling properly

  • IndisputableMonolith.Foundation.DimensionForcing dimension_forced unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    A series of class-conditional image generation models ranging from 111M to 3.1B parameters, achieving 2.18 FID on ImageNet 256x256 benchmarks, outperforming the popular diffusion models such as LDM, DiT

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 45 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Design Your Ad: Personalized Advertising Image and Text Generation with Unified Autoregressive Models

    cs.CV 2026-05 unverdicted novelty 7.0

    Uni-AdGen uses a unified autoregressive framework with foreground perception, instruction tuning, and coarse-to-fine preference modules to generate personalized image-text ads from noisy user behaviors, outperforming ...

  2. ExtraVAR: Stage-Aware RoPE Remapping for Resolution Extrapolation in Visual Autoregressive Models

    cs.CV 2026-05 unverdicted novelty 7.0

    ExtraVAR enables resolution extrapolation in visual autoregressive models by stage-aware RoPE remapping and entropy-driven attention scaling, suppressing repetition and detail loss.

  3. Normalizing Trajectory Models

    cs.CV 2026-05 unverdicted novelty 7.0

    NTM uses per-step conditional normalizing flows plus a trajectory-wide predictor to achieve exact-likelihood 4-step sampling that matches or exceeds baselines on text-to-image tasks.

  4. Normalizing Trajectory Models

    cs.CV 2026-05 unverdicted novelty 7.0

    NTM models each generative reverse step as a conditional normalizing flow with a hybrid shallow-deep architecture, enabling exact-likelihood training and strong four-step sampling performance on text-to-image tasks.

  5. Autoregressive Visual Generation Needs a Prologue

    cs.CV 2026-05 unverdicted novelty 7.0

    Prologue introduces dedicated prologue tokens to decouple generation and reconstruction in AR visual models, significantly improving generation FID scores on ImageNet while maintaining reconstruction quality.

  6. BEAT: Tokenizing and Generating Symbolic Music by Uniform Temporal Steps

    cs.SD 2026-04 unverdicted novelty 7.0

    BEAT tokenizes symbolic music by uniform beat steps with sparse per-beat pitch encodings, producing higher quality and more coherent music continuation and accompaniment than event-based tokenizations.

  7. Prompt-Guided Image Editing with Masked Logit Nudging in Visual Autoregressive Models

    cs.CV 2026-04 unverdicted novelty 7.0

    Masked Logit Nudging aligns visual autoregressive model logits with source token maps under target prompts inside cross-attention masks, delivering top image editing results on PIE benchmarks and strong reconstruction...

  8. A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens

    cs.CV 2026-04 conditional novelty 7.0

    Delta tokens compress VFM feature differences into single tokens, enabling a lightweight generative world model that predicts diverse futures with far lower compute than existing approaches.

  9. Drift-AR: Single-Step Visual Autoregressive Generation via Anti-Symmetric Drifting

    cs.CV 2026-03 unverdicted novelty 7.0

    Drift-AR achieves 3.8-5.5x speedup in AR-diffusion image models by using entropy to enable entropy-informed speculative decoding and single-step (1-NFE) anti-symmetric drifting decoding.

  10. Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards

    cs.CV 2026-03 unverdicted novelty 7.0

    SOLACE improves text-to-image generation by using intrinsic self-confidence rewards from noise reconstruction accuracy during reinforcement learning post-training without external supervision.

  11. WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation

    cs.CV 2025-03 unverdicted novelty 7.0

    Text-to-image models show significant limitations in integrating world knowledge, as measured by the new WISE benchmark and WiScore metric across 20 models.

  12. HeatKV: Head-tuned KV-cache Compression for Visual Autoregressive Modeling

    cs.CV 2026-05 unverdicted novelty 6.0

    HeatKV ranks attention heads by their focus on prior scales using offline calibration data and applies a static per-head pruning schedule, delivering 2x higher KV-cache compression than prior methods on the Infinity-2...

  13. InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation

    cs.CV 2026-05 conditional novelty 6.0

    InsightTok improves text and face fidelity in discrete image tokenization via content-aware perceptual losses, with gains transferring to autoregressive generation.

  14. Do multimodal models imagine electric sheep?

    cs.CV 2026-05 conditional novelty 6.0

    Fine-tuning VLMs to output action sequences for puzzles causes emergent internal visual representations that improve performance when integrated into reasoning.

  15. FlashAR: Efficient Post-Training Acceleration for Autoregressive Image Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    FlashAR achieves up to 22.9x speedup in 512x512 autoregressive image generation by post-training a pre-trained model with a complementary vertical head and dynamic fusion using only 0.05% of original training data.

  16. FlashAR: Efficient Post-Training Acceleration for Autoregressive Image Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    FlashAR accelerates autoregressive image generation up to 22.9x by post-training a pre-trained raster-scan model with a complementary vertical head and dynamic fusion for two-way next-token prediction.

  17. dFlowGRPO: Rate-Aware Policy Optimization for Discrete Flow Models

    cs.LG 2026-05 unverdicted novelty 6.0

    dFlowGRPO is a new rate-aware RL method for discrete flow models that outperforms prior GRPO approaches on image generation and matches continuous flow models while supporting broad probability paths.

  18. CASCADE: Context-Aware Relaxation for Speculative Image Decoding

    cs.CV 2026-05 unverdicted novelty 6.0

    CASCADE formalizes semantic interchangeability and convergence in target model representations to enable context-aware acceptance relaxation in tree-based speculative decoding, delivering up to 3.6x speedup on text-to...

  19. MUSE: Resolving Manifold Misalignment in Visual Tokenization via Topological Orthogonality

    cs.CV 2026-05 unverdicted novelty 6.0

    MUSE decouples reconstruction and semantic learning in visual tokenization via topological orthogonality, yielding SOTA generation quality and improved semantic performance over its teacher model.

  20. End-to-End Autoregressive Image Generation with 1D Semantic Tokenizer

    cs.CV 2026-05 unverdicted novelty 6.0

    An end-to-end autoregressive model with a jointly trained 1D semantic tokenizer achieves state-of-the-art FID 1.48 on ImageNet 256x256 generation without guidance.

  21. VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations

    cs.CV 2026-04 unverdicted novelty 6.0

    VibeToken enables autoregressive image generation at arbitrary resolutions using 64 tokens for 1024x1024 images with 3.94 gFID, constant 179G FLOPs, and better efficiency than diffusion or fixed AR baselines.

  22. PILOT: One Physics-Integrated Generation Framework to Unify 2D and 3D Radio Map Construction

    eess.SP 2026-04 unverdicted novelty 6.0

    PILOT unifies 2D and 3D radio map generation via physics-guided wavefront autoregressive prediction, reporting lowest NMSE on 2D benchmarks and 78% NMSE reduction with 2500x faster inference than diffusion baselines for 3D.

  23. Normalizing Flows with Iterative Denoising

    cs.CV 2026-04 unverdicted novelty 6.0

    iTARFlow augments normalizing flows with diffusion-style iterative denoising during sampling while preserving end-to-end likelihood training, reaching competitive results on ImageNet 64/128/256.

  24. Generative Refinement Networks for Visual Synthesis

    cs.CV 2026-04 unverdicted novelty 6.0

    GRN uses hierarchical binary quantization and entropy-guided refinement to set new ImageNet records of 0.56 rFID for reconstruction and 1.81 gFID for class-conditional generation while releasing code and models.

  25. Combating Pattern and Content Bias: Adversarial Feature Learning for Generalized AI-Generated Image Detection

    cs.CV 2026-04 unverdicted novelty 6.0

    MAFL uses adversarial training to suppress pattern and content biases, guiding models to learn shared generative features for better cross-model generalization in detecting AI images.

  26. On the Robustness of Watermarking for Autoregressive Image Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    Watermarking schemes for autoregressive image generation fail against removal and forgery attacks, enabling false detections and undermining synthetic content filtering.

  27. Differentiable Vector Quantization for Rate-Distortion Optimization of Generative Image Compression

    cs.CV 2026-04 unverdicted novelty 6.0

    RDVQ enables joint rate-distortion optimization for vector-quantized generative image compression via differentiable codebook distribution relaxation and an autoregressive entropy model.

  28. SMART: When is it Actually Worth Expanding a Speculative Tree?

    cs.DC 2026-04 unverdicted novelty 6.0

    SMART uses marginal benefit-cost analysis to dynamically build efficient speculative trees, achieving 15-20% additional speedup in LLM and MLLM inference.

  29. TC-AE: Unlocking Token Capacity for Deep Compression Autoencoders

    cs.CV 2026-04 unverdicted novelty 6.0

    TC-AE improves reconstruction and generative performance in deep compression by decomposing token-to-latent compression into two stages and using joint self-supervised training.

  30. Multimodal Large Language Models for Multi-Subject In-Context Image Generation

    cs.LG 2026-04 unverdicted novelty 6.0

    MUSIC is the first MLLM for multi-subject in-context image generation that uses an automatic data pipeline, vision chain-of-thought reasoning, and semantics-driven spatial layout planning to outperform prior methods o...

  31. MAR-GRPO: Stabilized GRPO for AR-diffusion Hybrid Image Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    MAR-GRPO stabilizes GRPO for AR-diffusion hybrids via multi-trajectory expectation and uncertainty-based token selection, yielding better visual quality, stability, and spatial understanding than baselines.

  32. ImgEdit: A Unified Image Editing Dataset and Benchmark

    cs.CV 2025-05 conditional novelty 6.0

    ImgEdit supplies 1.2 million curated edit pairs and a three-part benchmark that let a VLM-based model outperform prior open-source editors on adherence, quality, and detail preservation.

  33. FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving

    cs.CV 2025-05 conditional novelty 6.0

    FSDrive uses a generated future scene frame as visual spatio-temporal CoT to improve VLA models for safer autonomous driving trajectory prediction.

  34. MMaDA: Multimodal Large Diffusion Language Models

    cs.CV 2025-05 unverdicted novelty 6.0

    MMaDA is a unified multimodal diffusion model using mixed chain-of-thought fine-tuning and a new UniGRPO reinforcement learning algorithm that outperforms specialized models in reasoning, understanding, and text-to-im...

  35. Emu3: Next-Token Prediction is All You Need

    cs.CV 2024-09 unverdicted novelty 6.0

    Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.

  36. Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling

    cs.CV 2026-04 unverdicted novelty 5.0

    Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemph...

  37. Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation

    cs.CV 2026-04 unverdicted novelty 5.0

    Tuna-2 shows pixel embeddings can replace vision encoders in unified multimodal models, achieving competitive or superior results on understanding and generation benchmarks.

  38. UniGenDet: A Unified Generative-Discriminative Framework for Co-Evolutionary Image Generation and Generated Image Detection

    cs.CV 2026-04 unverdicted novelty 5.0

    UniGenDet unifies generative and discriminative models through symbiotic self-attention and detector-guided alignment to co-evolve image generation and authenticity detection.

  39. From Static Inference to Dynamic Interaction: A Survey of Streaming Large Language Models

    cs.CL 2026-03 unverdicted novelty 5.0

    The paper supplies a unified definition based on data flow and dynamic interaction plus a systematic taxonomy to organize fragmented work on streaming large language models.

  40. Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

    cs.CV 2024-08 unverdicted novelty 5.0

    Show-o unifies autoregressive and discrete diffusion modeling inside one transformer to support multimodal understanding and generation tasks with competitive benchmark performance.

  41. DeepSight: Long-Horizon World Modeling via Latent States Prediction for End-to-End Autonomous Driving

    cs.CV 2026-05 unverdicted novelty 4.0

    DeepSight uses parallel latent feature prediction in BEV for long-horizon world modeling and adaptive text reasoning to reach state-of-the-art closed-loop performance on the Bench2drive benchmark.

  42. Show-o2: Improved Native Unified Multimodal Models

    cs.CV 2025-06 unverdicted novelty 4.0

    Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.

  43. LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation

    cs.CV 2026-04 unverdicted novelty 3.0

    This review organizes literature on large multimodal models and object-centric vision into four themes—understanding, referring segmentation, editing, and generation—while summarizing paradigms, strategies, and challe...

  44. Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    cs.AI 2025-01 conditional novelty 3.0

    Scaling data, model size, and training optimization on the Janus architecture yields better multimodal understanding and more stable, instruction-following text-to-image generation.

  45. Cosmos World Foundation Model Platform for Physical AI

    cs.CV 2025-01 unverdicted novelty 3.0

    The Cosmos platform supplies open-source pre-trained world models and supporting tools for building fine-tunable digital world simulations to train Physical AI.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · cited by 43 Pith papers · 28 internal anchors

  1. [1]

    PaLM 2 Technical Report

    Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report.arXiv preprint arXiv:2305.10403,

  2. [2]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023a. Yutong Bai, Xinyang Geng, Karttikeya Mangalam, Amir Bar, Alan Yuille, Trevor Darrell, Jitendra Malik, and Alexei A Efros. Sequential modeling enables scalable learning for large ...

  3. [3]

    DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

    Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, et al. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954,

  4. [4]

    Large Scale GAN Training for High Fidelity Natural Image Synthesis

    Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096,

  5. [5]

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al

    URL https://openai.com/research/ video-generation-models-as-world-simulators . Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems , 33:1877–1901,

  6. [6]

    Muse: Text-to-image generation via masked generative transform- ers.arXiv preprint arXiv:2301.00704, 2023

    Huiwen Chang, Han Zhang, Jarred Barber, AJ Maschinot, Jose Lezama, Lu Jiang, Ming-Hsuan Yang, Kevin Murphy, William T Freeman, Michael Rubinstein, et al. Muse: Text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704,

  7. [7]

    Accelerating Large Language Model Decoding with Speculative Sampling

    Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318, 2023a. Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, et al. Pixart: Fast tr...

  8. [8]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition , pp. 248–255. Ieee,

  9. [9]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805,

  10. [10]

    Eva-02: A visual representation for neon genesis

    Yuxin Fang, Quan Sun, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva-02: A visual representation for neon genesis. arXiv preprint arXiv:2303.11331,

  11. [11]

    Making llama see and draw with seed tokenizer

    Yuying Ge, Sijie Zhao, Ziyun Zeng, Yixiao Ge, Chen Li, Xintao Wang, and Ying Shan. Making llama see and draw with seed tokenizer. arXiv preprint arXiv:2310.01218,

  12. [12]

    Scaling Laws for Autoregressive Generative Modeling

    Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B Brown, Prafulla Dhariwal, Scott Gray, et al. Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701,

  13. [13]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598,

  14. [14]

    Training Compute-Optimal Large Language Models

    Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high fidelity image generation. The Journal of Machine Learning Research, 23(1):2249–2281, 2022a. Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high fide...

  15. [15]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361,

  16. [16]

    Auto-Encoding Variational Bayes

    Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114,

  17. [17]

    Playground v2.5: Three Insights towards Enhancing Aesthetic Quality in Text-to-Image Generation

    Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Linmiao Xu, and Suhail Doshi. Playground v2. 5: Three insights towards enhancing aesthetic quality in text-to-image generation. arXiv preprint arXiv:2402.17245,

  18. [18]

    Microsoft coco: Common objects in context

    13 Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision– ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp. 740–755. Springer,

  19. [19]

    Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps

    Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Systems, 35:5775–5787, 2022a. Jiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mottaghi, and Aniruddha Kembhavi. Unified- io: A unified model for ...

  20. [20]

    arXiv preprint arXiv:2404.13013 (2024)

    Chuofan Ma, Yi Jiang, Jiannan Wu, Zehuan Yuan, and Xiaojuan Qi. Groma: Localized visual tokenization for grounding multimodal large language models. arXiv preprint arXiv:2404.13013,

  21. [21]

    Battaglia

    Charlie Nash, Jacob Menick, Sander Dieleman, and Peter W Battaglia. Generating images with sparse representations. arXiv preprint arXiv:2103.03841,

  22. [22]

    GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

    Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741,

  23. [23]

    GPT-4 Technical Report

    OpenAI. Consistency decoder. https://github.com/openai/consistencydecoder, 2023a. OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023b. OpenLM-Research. Openllama 3b. https://huggingface.co/openlm-research/open_ llama_3b,

  24. [24]

    Kosmos-2: Grounding Multimodal Large Language Models to the World

    Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824,

  25. [25]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952,

  26. [26]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text- conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3,

  27. [27]

    Stylegan-xl: Scaling stylegan to large diverse datasets

    Axel Sauer, Katja Schwarz, and Andreas Geiger. Stylegan-xl: Scaling stylegan to large diverse datasets. In ACM SIGGRAPH 2022 conference proceedings, pp. 1–10,

  28. [28]

    GLU Variants Improve Transformer

    Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202,

  29. [29]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catan- zaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053,

  30. [30]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502,

  31. [31]

    Emu: Generative pretraining in multi- modality

    Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative pretraining in multimodality. arXiv preprint arXiv:2307.05222, 2023a. Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative...

  32. [32]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805,

  33. [33]

    Visual autoregressive modeling: Scalable im- age generation via next-scale prediction.arXiv preprint arXiv:2404.02905, 2024

    15 Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction. arXiv preprint arXiv:2404.02905,

  34. [34]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay ...

  35. [35]

    Emergent Abilities of Large Language Models

    Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682,

  36. [36]

    BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

    BigScience Workshop, Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ili´c, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, et al. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100,

  37. [37]

    Raphael: Text-to-image generation via large mixture of diffusion paths

    Zeyue Xue, Guanglu Song, Qiushan Guo, Boxiao Liu, Zhuofan Zong, Yu Liu, and Ping Luo. Raphael: Text-to-image generation via large mixture of diffusion paths. arXiv preprint arXiv:2305.18295,

  38. [38]

    CoRR , volume =

    Aiyuan Yang, Bin Xiao, Bingning Wang, Borong Zhang, Ce Bian, Chao Yin, Chenxu Lv, Da Pan, Dian Wang, Dong Yan, et al. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305,

  39. [39]

    Vector-quantized image modeling with improved vqgan.arXiv preprint arXiv:2110.04627, 2021

    Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, and Yonghui Wu. Vector-quantized image modeling with improved vqgan. arXiv preprint arXiv:2110.04627,

  40. [40]

    Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

    Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al. Scaling autoregressive models for content- rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2(3):5,

  41. [41]

    Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

    Lijun Yu, Yong Cheng, Kihyuk Sohn, José Lezama, Han Zhang, Huiwen Chang, Alexander G Hauptmann, Ming-Hsuan Yang, Yuan Hao, Irfan Essa, et al. Magvit: Masked generative video transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10459–10469, 2023a. Lijun Yu, José Lezama, Nitesh B Gundavarapu, Luca Versari, K...

  42. [42]

    Gpt4roi: Instruction tuning large language model on region-of- interest

    Shilong Zhang, Peize Sun, Shoufa Chen, Min Xiao, Wenqi Shao, Wenwei Zhang, Kai Chen, and Ping Luo. Gpt4roi: Instruction tuning large language model on region-of-interest. arXiv preprint arXiv:2307.03601,

  43. [43]

    OPT: Open Pre-trained Transformer Language Models

    Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068,

  44. [44]

    PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

    16 Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277,

  45. [45]

    The dining room is decorated with elegant decor.The new tab in Powerpointis highlighted.The card is attached to an external PCI

    17 An abstract painting on a pillow with pink, yellow and red flowers. The dining room is decorated with elegant decor.The new tab in Powerpointis highlighted.The card is attached to an external PCI. A large building with columns and a clock tower. Two lions are laying under a tree in the wild.An assortment of party decorations with owls and otheritems.an...

  46. [46]

    Describe this image in as much detail as possible

    Training stage II: 10M internal high aesthetic quality images. Each image is provided a long caption by LLaV A [Liu et al. 2024] using the prompt of “Describe this image in as much detail as possible”. Some examples are shown in Figure