arxiv: 2310.05737 · v3 · submitted 2023-10-09 · 💻 cs.CV · cs.AI· cs.MM

Recognition: 2 theorem links

· Lean Theorem

Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

Lijun Yu , Jos\'e Lezama , Nitesh B. Gundavarapu , Luca Versari , Kihyuk Sohn , David Minnen , Yong Cheng , Vighnesh Birodkar

show 8 more authors

Agrim Gupta Xiuye Gu Alexander G. Hauptmann Boqing Gong Ming-Hsuan Yang Irfan Essa David A. Ross Lu Jiang

Authors on Pith no claims yet

Pith reviewed 2026-05-13 20:01 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.MM

keywords MAGVIT-v2visual tokenizerlanguage modelsdiffusion modelsimage generationvideo generationvideo compressionaction recognition

0 comments

The pith

A new tokenizer lets language models outperform diffusion models on image and video generation benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MAGVIT-v2, a tokenizer that converts images and videos into concise discrete tokens sharing one vocabulary. When paired with large language models, this tokenizer produces higher-quality outputs than diffusion models on ImageNet for images and Kinetics for videos. The same tokenizer also achieves video compression quality comparable to next-generation codecs by human judgment and yields stronger features for action recognition. A reader would care because the result frames tokenization, rather than model family, as the decisive factor for applying language models to visual generation.

Core claim

Equipped with the MAGVIT-v2 tokenizer, large language models outperform diffusion models on standard image and video generation benchmarks including ImageNet and Kinetics. The tokenizer generates concise and expressive tokens for both videos and images using a common token vocabulary and also surpasses prior video tokenizers on compression and representation learning for action recognition.

What carries the argument

MAGVIT-v2, a tokenizer that maps pixel inputs to discrete tokens using a shared vocabulary for images and videos.

If this is right

Language models become the stronger base architecture for both image and video generation once tokenization is addressed.
A single tokenizer supports high-quality generation across static images and dynamic video sequences.
Video compression reaches human-judged parity with next-generation codecs.
Token sequences from the tokenizer produce effective representations for downstream tasks such as action recognition.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future scaling of language models may widen the advantage over diffusion models without requiring architecture-specific changes.
The same tokenizer could be tested in hybrid pipelines that combine language-model generation with other visual modules.
If token quality dominates, similar gains might appear when the tokenizer is applied to language-model variants trained on different objectives.

Load-bearing premise

Performance differences arise mainly from the tokenizer rather than from differences in model scale, training data, optimization, or evaluation protocols between the language-model and diffusion baselines.

What would settle it

A matched experiment in which diffusion models and language models are trained at identical scale and data using the new tokenizer, then show no quality gap or a reversal in favor of diffusion models, would falsify the claim.

read the original abstract

While Large Language Models (LLMs) are the dominant models for generative tasks in language, they do not perform as well as diffusion models on image and video generation. To effectively use LLMs for visual generation, one crucial component is the visual tokenizer that maps pixel-space inputs to discrete tokens appropriate for LLM learning. In this paper, we introduce MAGVIT-v2, a video tokenizer designed to generate concise and expressive tokens for both videos and images using a common token vocabulary. Equipped with this new tokenizer, we show that LLMs outperform diffusion models on standard image and video generation benchmarks including ImageNet and Kinetics. In addition, we demonstrate that our tokenizer surpasses the previously top-performing video tokenizer on two more tasks: (1) video compression comparable to the next-generation video codec (VCC) according to human evaluations, and (2) learning effective representations for action recognition tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MAGVIT-v2 gives LLMs a real shot at beating diffusion on image and video generation, but the paper needs tighter controls to prove the tokenizer is the main driver.

read the letter

The paper's core move is introducing MAGVIT-v2, a tokenizer that produces short, high-quality discrete tokens for both images and video from a shared vocabulary. That design lets them train an LLM on visual generation and report better numbers than standard diffusion baselines on ImageNet and Kinetics. The tokenizer also looks strong on its own: it reaches human-eval parity with next-gen video codecs on compression and improves action recognition features. Those two results are the clearest new pieces; they are concrete and not just restatements of prior VQ work.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces MAGVIT-v2, a visual tokenizer that maps images and videos to concise discrete tokens from a shared vocabulary. Equipped with this tokenizer, the authors claim that large language models outperform diffusion models on image generation (ImageNet) and video generation (Kinetics) benchmarks. The tokenizer is additionally shown to surpass prior video tokenizers on video compression (human-evaluated as comparable to the VCC codec) and on learning representations for action recognition.

Significance. If the central empirical claim holds after isolating the tokenizer contribution, the result would be significant: it would demonstrate that discrete tokenization improvements can allow autoregressive LLMs to surpass diffusion models on standard visual generation benchmarks, supporting a unified LLM-based approach to multimodal generation. The tokenizer's reported utility for compression and representation learning further adds practical value.

major comments (1)

[Sections 4 and 5] Sections 4 and 5: The benchmark comparisons of the LLM (with MAGVIT-v2) against diffusion baselines (e.g., ADM, DiT variants) lack tokenizer-swap ablations that train the identical LLM backbone with prior tokenizers such as MAGVIT-v1 or VQGAN under matched scale, data, and optimization conditions. Without these controls, performance gains cannot be confidently attributed to the tokenizer rather than differences in model capacity or training regime, which directly undermines the title claim that the tokenizer is key.

minor comments (2)

[Abstract] The abstract states the central outperformance result without any quantitative metrics, specific baseline names, model sizes, or training details; these should be added for immediate evaluability.
[Section on compression evaluation] The human evaluation protocol for video compression (claimed comparable to VCC) requires additional details on rater instructions, number of samples, and statistical significance to support reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and the opportunity to clarify the contributions of our work. We address the major comment point-by-point below.

read point-by-point responses

Referee: [Sections 4 and 5] Sections 4 and 5: The benchmark comparisons of the LLM (with MAGVIT-v2) against diffusion baselines (e.g., ADM, DiT variants) lack tokenizer-swap ablations that train the identical LLM backbone with prior tokenizers such as MAGVIT-v1 or VQGAN under matched scale, data, and optimization conditions. Without these controls, performance gains cannot be confidently attributed to the tokenizer rather than differences in model capacity or training regime, which directly undermines the title claim that the tokenizer is key.

Authors: We agree that matched tokenizer-swap ablations on the identical LLM backbone would provide the strongest isolation of the tokenizer's contribution. In the manuscript, our primary comparisons are against published diffusion models (ADM, DiT) that use their own training regimes and often implicit or different tokenization strategies, making exact controls challenging. We do demonstrate MAGVIT-v2's superiority over MAGVIT-v1 and VQGAN on video compression and action recognition using matched backbones and data, which supports the tokenizer as a key enabler. Full-scale LLM retraining with prior tokenizers under identical conditions was not performed due to prohibitive compute costs, but we have added explicit discussion of this limitation and the supporting evidence from auxiliary tasks in the revision. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical benchmark comparison with no self-referential derivation or fitted predictions.

full rationale

The paper introduces MAGVIT-v2 as a new video/image tokenizer and reports empirical results showing LLMs equipped with it outperform published diffusion baselines on ImageNet and Kinetics. No equations, first-principles derivations, or predictions are presented that reduce by construction to the paper's own inputs, fitted parameters, or self-citations. The central claim rests on external benchmark numbers rather than any internal definitional loop, self-citation chain, or renaming of known results. This is a standard empirical architecture paper whose validity hinges on experimental controls, not on tautological reasoning.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim depends on the effectiveness of the newly introduced MAGVIT-v2 tokenizer as the decisive factor; this component is postulated without independent evidence or prior validation in the abstract.

invented entities (1)

MAGVIT-v2 tokenizer no independent evidence
purpose: Maps pixel inputs to concise discrete tokens suitable for LLM training on images and video
New component introduced to enable the claimed LLM superiority; no external validation supplied.

pith-pipeline@v0.9.0 · 5519 in / 1095 out tokens · 30510 ms · 2026-05-13T20:01:59.559261+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.DAlembert.Inevitability bilinear_family_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

A novel lookup-free quantization approach that enables improving the visual generation quality of language models by learning a large vocabulary.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 25 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization
cs.LG 2026-05 unverdicted novelty 7.0

PairAlign learns compact audio token sequences via self-alignment of paired content views using an autoregressive decoder, achieving strong cross-view consistency and edit-distance preservation while reducing token co...
Prompt-Guided Image Editing with Masked Logit Nudging in Visual Autoregressive Models
cs.CV 2026-04 unverdicted novelty 7.0

Masked Logit Nudging aligns visual autoregressive model logits with source token maps under target prompts inside cross-attention masks, delivering top image editing results on PIE benchmarks and strong reconstruction...
ChopGrad: Pixel-Wise Losses for Latent Video Diffusion via Truncated Backpropagation
cs.CV 2026-03 unverdicted novelty 7.0

ChopGrad truncates backpropagation to local frame windows in video diffusion models, reducing memory from linear in frame count to constant while enabling pixel-wise loss fine-tuning.
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
cs.CV 2024-06 conditional novelty 7.0

Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.
InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation
cs.CV 2026-05 conditional novelty 6.0

InsightTok improves text and face fidelity in discrete image tokenization via content-aware perceptual losses, with gains transferring to autoregressive generation.
Yeti: A compact protein structure tokenizer for reconstruction and multi-modal generation
q-bio.BM 2026-05 unverdicted novelty 6.0

Yeti is a compact tokenizer for protein structures that delivers strong codebook use, token diversity, and reconstruction while enabling from-scratch multimodal generation of plausible sequences and structures with 10...
CASCADE: Context-Aware Relaxation for Speculative Image Decoding
cs.CV 2026-05 unverdicted novelty 6.0

CASCADE formalizes semantic interchangeability and convergence in target model representations to enable context-aware acceptance relaxation in tree-based speculative decoding, delivering up to 3.6x speedup on text-to...
MUSE: Resolving Manifold Misalignment in Visual Tokenization via Topological Orthogonality
cs.CV 2026-05 unverdicted novelty 6.0

MUSE decouples reconstruction and semantic learning in visual tokenization via topological orthogonality, yielding SOTA generation quality and improved semantic performance over its teacher model.
Video Active Perception: Effective Inference-Time Long-Form Video Understanding with Vision-Language Models
cs.CV 2026-05 unverdicted novelty 6.0

VAP is a training-free active-perception method that improves zero-shot long-form video QA performance and frame efficiency up to 5.6x in VLMs by selecting keyframes that differ from priors generated by a text-conditi...
End-to-End Autoregressive Image Generation with 1D Semantic Tokenizer
cs.CV 2026-05 unverdicted novelty 6.0

An end-to-end autoregressive model with a jointly trained 1D semantic tokenizer achieves state-of-the-art FID 1.48 on ImageNet 256x256 generation without guidance.
VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations
cs.CV 2026-04 unverdicted novelty 6.0

VibeToken enables autoregressive image generation at arbitrary resolutions using 64 tokens for 1024x1024 images with 3.94 gFID, constant 179G FLOPs, and better efficiency than diffusion or fixed AR baselines.
dWorldEval: Scalable Robotic Policy Evaluation via Discrete Diffusion World Model
cs.RO 2026-04 unverdicted novelty 6.0

A discrete diffusion model tokenizes multimodal robotic data and uses a progress token to predict future states and task completion for scalable policy evaluation.
Latent-Compressed Variational Autoencoder for Video Diffusion Models
cs.CV 2026-04 unverdicted novelty 6.0

A frequency-based latent compression method for video VAEs yields higher reconstruction quality than channel-reduction baselines at fixed compression ratios.
ELT: Elastic Looped Transformers for Visual Generation
cs.CV 2026-04 unverdicted novelty 6.0

Elastic Looped Transformers share weights across recurrent blocks and apply intra-loop self-distillation to deliver 4x parameter reduction while matching competitive FID and FVD scores on ImageNet and UCF-101.
ExploreVLA: Dense World Modeling and Exploration for End-to-End Autonomous Driving
cs.CV 2026-04 unverdicted novelty 6.0

ExploreVLA augments VLA driving models with future RGB and depth prediction for dense supervision and uses prediction uncertainty as a safety-gated intrinsic reward for RL-based exploration, reaching SOTA PDMS 93.7 on NAVSIM.
MMaDA: Multimodal Large Diffusion Language Models
cs.CV 2025-05 unverdicted novelty 6.0

MMaDA is a unified multimodal diffusion model using mixed chain-of-thought fine-tuning and a new UniGRPO reinforcement learning algorithm that outperforms specialized models in reasoning, understanding, and text-to-im...
CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation
cs.RO 2024-11 unverdicted novelty 6.0

CogACT is a new VLA model that uses a conditioned diffusion action transformer to achieve over 35% higher average success rates than OpenVLA in simulation and 55% in real-robot experiments while generalizing to new ro...
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
cs.CV 2024-08 unverdicted novelty 6.0

CogVideoX generates coherent 10-second text-to-video outputs at high resolution using a 3D VAE, expert adaptive LayerNorm transformer, progressive training, and a custom data pipeline, claiming state-of-the-art results.
Seed-TTS: A Family of High-Quality Versatile Speech Generation Models
eess.AS 2024-06 unverdicted novelty 6.0

Seed-TTS models produce speech matching human naturalness and speaker similarity, with added controllability via self-distillation and reinforcement learning.
Co-Generative De Novo Functional Protein Design
q-bio.QM 2026-05 unverdicted novelty 5.0

CodeFP jointly generates protein sequences and structures using functional local structures and auxiliary supervision, yielding 6.1% better functional consistency and 3.2% better foldability than prior baselines.
Open-Sora: Democratizing Efficient Video Production for All
cs.CV 2024-12 unverdicted novelty 5.0

Open-Sora releases an open-source video generation model based on a Spatial-Temporal Diffusion Transformer that decouples spatial and temporal attention, supporting text-to-video, image-to-video, and text-to-image tas...
Movie Gen: A Cast of Media Foundation Models
cs.CV 2024-10 unverdicted novelty 5.0

A 30B-parameter transformer and related models generate high-quality videos and audio, claiming state-of-the-art results on text-to-video, video editing, personalization, and audio generation tasks.
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
cs.CV 2024-08 unverdicted novelty 5.0

Show-o unifies autoregressive and discrete diffusion modeling inside one transformer to support multimodal understanding and generation tasks with competitive benchmark performance.
Seedance 1.0: Exploring the Boundaries of Video Generation Models
cs.CV 2025-06 unverdicted novelty 4.0

Seedance 1.0 generates 5-second 1080p videos in about 41 seconds with claimed superior motion quality, prompt adherence, and multi-shot consistency compared to prior models.
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
cs.CV 2024-02 unverdicted novelty 2.0

The paper reviews the background, technology, applications, limitations, and future directions of OpenAI's Sora text-to-video generative model based on public information.

Reference graph

Works this paper leans on

291 extracted references · 291 canonical work pages · cited by 25 Pith papers · 23 internal anchors

[1]

Chang, Huiwen and Zhang, Han and Jiang, Lu and Liu, Ce and Freeman, William T , booktitle=CVPR, ignpages=. Mask

work page
[2]

Long video generation with time-agnostic

Ge, Songwei and Hayes, Thomas and Yang, Harry and Yin, Xi and Pang, Guan and Jacobs, David and Huang, Jia-Bin and Parikh, Devi , booktitle=ECCV, year=. Long video generation with time-agnostic

work page
[3]

VISIGRAPP (5: VISAPP) , year=

Latent Video Transformer , author=. VISIGRAPP (5: VISAPP) , year=

work page
[4]

Scaling Autoregressive Video Models , author=

work page
[5]

Hong, Wenyi and Ding, Ming and Zheng, Wendi and Liu, Xinghan and Tang, Jie , journal=. Cog

work page
[6]

arXiv:1907.06571 , year=

Adversarial video generation on complex datasets , author=. arXiv:1907.06571 , year=

work page arXiv 1907
[7]

Generating Videos with Dynamics-aware Implicit Generative Adversarial Networks , author=

work page
[8]

Le Moing, Guillaume and Ponce, Jean and Schmid, Cordelia , booktitle=NIPS, ignvolume=

work page
[9]

Yan, Wilson and Zhang, Yunzhi and Abbeel, Pieter and Srinivas, Aravind , journal=. Video

work page
[10]

arXiv:2003.04035 , year=

Transformation-based adversarial video prediction on large-scale data , author=. arXiv:2003.04035 , year=

work page arXiv 2003
[11]

arXiv:2203.09494 , year=

Transframer: Arbitrary Frame Prediction with Generative Models , author=. arXiv:2203.09494 , year=

work page arXiv
[12]

Gupta, Agrim and Tian, Stephen and Zhang, Yunzhi and Wu, Jiajun and Mart. Mask

work page
[13]

arXiv:2106.13195 , year=

FitVid: Overfitting in pixel-level video prediction , author=. arXiv:2106.13195 , year=

work page arXiv
[14]

Wu, Chenfei and Liang, Jian and Ji, Lei and Yang, Fan and Fang, Yuejian and Jiang, Daxin and Duan, Nan , booktitle=ECCV, year=

work page
[15]

Video diffusion models , author=

work page
[16]

2020 , publisher=

Train sparsely, generate densely: Memory-efficient unsupervised training of high-resolution temporal gan , author=. 2020 , publisher=

work page 2020
[17]

Generating videos with scene dynamics , author=

work page
[18]

Temporal generative adversarial nets with singular value clipping , author=

work page
[19]

Tulyakov, Sergey and Liu, Ming-Yu and Yang, Xiaodong and Kautz, Jan , booktitle=CVPR, ignpages=. Mo

work page
[20]

arXiv:1810.02419 , year=

Towards high resolution video generation with progressive growing of sliced wasserstein gans , author=. arXiv:1810.02419 , year=

work page arXiv
[21]

Neural Networks , volume=

Lower dimensional kernels for video discriminators , author=. Neural Networks , volume=. 2020 , publisher=

work page 2020
[22]

A Good Image Generator Is What You Need for High-Resolution Video Synthesis , author=

work page
[23]

Taming transformers for high-resolution image synthesis , author=

work page
[24]

Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , booktitle=NAACL, ignpages=

work page
[25]

Image as a Foreign Language:

Wang, Wenhui and Bao, Hangbo and Dong, Li and Bjorck, Johan and Peng, Zhiliang and Liu, Qiang and Aggarwal, Kriti and Mohammed, Owais Khan and Singhal, Saksham and Som, Subhojit and others , journal=. Image as a Foreign Language:

work page
[26]

Soomro, Khurram and Zamir, Amir Roshan and Shah, Mubarak , journal=

work page
[27]

, author=

Self-Supervised Visual Planning with Temporal Skip Connections. , author=

work page
[28]

A short note about

Carreira, Joao and Noland, Eric and Banki-Horvath, Andras and Hillier, Chloe and Zisserman, Andrew , journal=. A short note about

work page
[29]

The ``something something" video database for learning and evaluating visual common sense , author=

work page
[30]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Towards accurate generative models of video: A new metric & challenges , author=. arXiv:1812.01717 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[31]

Quo vadis, action recognition? a new model and the

Carreira, Joao and Zisserman, Andrew , booktitle=CVPR, ignpages=. Quo vadis, action recognition? a new model and the

work page
[32]

Learning spatiotemporal features with 3

Tran, Du and Bourdev, Lubomir and Fergus, Rob and Torresani, Lorenzo and Paluri, Manohar , booktitle=ICCV, ignpages=. Learning spatiotemporal features with 3

work page
[33]

Vector quantized diffusion model for text-to-image synthesis , author=

work page
[34]

Zero-shot text-to-image generation , author=

work page
[35]

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation , author=. arXiv:2206.10789 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[36]

Show Me What and Tell Me How: Video Synthesis via Multimodal Conditioning , author=

work page
[37]

Zhang, Zhu and Ma, Jianxin and Zhou, Chang and Men, Rui and Li, Zhikang and Ding, Ming and Tang, Jie and Zhou, Jingren and Yang, Hongxia , journal=. M6-

work page
[38]

Vimpac: Video pre-training via masked token prediction and contrastive learning

Vimpac: Video pre-training via masked token prediction and contrastive learning , author=. arXiv:2106.11250 , year=

work page arXiv
[39]

Neural discrete representation learning , author=

work page
[40]

Vector-quantized image modeling with improved

Yu, Jiahui and Li, Xin and Koh, Jing Yu and Zhang, Han and Pang, Ruoming and Qin, James and Ku, Alexander and Xu, Yuanzhong and Baldridge, Jason and Wu, Yonghui , booktitle=ICLR, year=. Vector-quantized image modeling with improved

work page
[41]

Ghazvininejad, Marjan and Levy, Omer and Liu, Yinhan and Zettlemoyer, Luke , booktitle=EMNLP-IJCNLP, ignpages=. Mask-

work page
[42]

2004 , publisher=

Image quality assessment: from error visibility to structural similarity , author=. 2004 , publisher=

work page 2004
[43]

The unreasonable effectiveness of deep features as a perceptual metric , author=

work page
[44]

arXiv:2206.07696 , year=

Diffusion Models for Video Prediction and Infilling , author=. arXiv:2206.07696 , year=

work page arXiv
[45]

Large Scale

Brock, Andrew and Donahue, Jeff and Simonyan, Karen , booktitle=ICLR, year=. Large Scale

work page
[46]

Deng, Jia and Dong, Wei and Socher, Richard and Li, Li-Jia and Li, Kai and Fei-Fei, Li , booktitle=CVPR, ignpages=. Image. 2009 , ignorganization=

work page 2009
[47]

Improved Masked Image Generation with

Lezama, Jos. Improved Masked Image Generation with

work page
[48]

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , author=

work page
[49]

Fully Non-autoregressive Neural Machine Translation: Tricks of the Trade , author=

work page
[50]

Kong, Xiang and Jiang, Lu and Chang, Huiwen and Zhang, Han and Hao, Yuan and Gong, Haifeng and Essa, Irfan , booktitle=ECCV, year=

work page
[51]

Regularizing generative adversarial networks under limited data , author=

work page
[52]

Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation , author=

work page
[53]

Objectron: A large scale dataset of object-centric videos in the wild with pose annotations , author=

work page
[54]

Caesar, Holger and Bankiti, Varun and Lang, Alex H and Vora, Sourabh and Liong, Venice Erin and Xu, Qiang and Krishnan, Anush and Pan, Yu and Baldan, Giancarlo and Beijbom, Oscar , booktitle=CVPR, ignpages=. nu

work page
[55]

Ego4d: Around the world in 3,000 hours of egocentric video , author=

work page
[56]

A style-based generator architecture for generative adversarial networks , author=

work page
[57]

James Bradbury and Roy Frostig and Peter Hawkins and Matthew James Johnson and Chris Leary and Dougal Maclaurin and George Necula and Adam Paszke and Jake Vander

work page
[58]

Making convolutional networks shift-invariant again , author=

work page
[59]

Distill , year =

Odena, Augustus and Dumoulin, Vincent and Olah, Chris , title =. Distill , year =

work page
[60]

Deep residual learning for image recognition , author=

work page
[61]

Ten lessons from three generations shaped Google’s

Jouppi, Norman P and Yoon, Doe Hyun and Ashcraft, Matthew and Gottscho, Mark and Jablin, Thomas B and Kurian, George and Laudon, James and Li, Sheng and Ma, Peter and Ma, Xiaoyu and others , booktitle=ISCA, ignpages=. Ten lessons from three generations shaped Google’s. 2021 , ignorganization=

work page 2021
[62]

High-resolution image synthesis with latent diffusion models , author=

work page
[63]

Align your latents: High-resolution video synthesis with latent diffusion models , author=

work page
[64]

Ding, Ming and Yang, Zhuoyi and Hong, Wenyi and Zheng, Wendi and Zhou, Chang and Yin, Da and Lin, Junyang and Zou, Xu and Shao, Zhou and Yang, Hongxia and others , booktitle=NIPS, ignvolume=. Cog

work page
[65]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author=

work page
[66]

Adam: A Method for Stochastic Optimization

Adam: A method for stochastic optimization , author=. arXiv:1412.6980 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[67]

Language models are few-shot learners , author=

work page
[68]

Generative pretraining from pixels , author=

work page
[69]

Deep unsupervised learning using nonequilibrium thermodynamics , author=

work page
[70]

Structured denoising diffusion models in discrete state-spaces , author=

work page
[71]

Conditional image generation with pixelcnn decoders , author=

work page
[72]

Pixel Recurrent Neural Networks , booktitle = ICML, year =

A. Pixel Recurrent Neural Networks , booktitle = ICML, year =

work page
[73]

Yang Song and Stefano Ermon , title =

work page
[74]

Score-Based Generative Modeling through Stochastic Differential Equations , author=

work page
[75]

Kingma and Tim Salimans and Ben Poole and Jonathan Ho , title =

Diederik P. Kingma and Tim Salimans and Ben Poole and Jonathan Ho , title =. CoRR , volume =

work page
[77]

Auto-Encoding Variational Bayes

Auto-encoding variational bayes , author=. arXiv:1312.6114 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[78]

arXiv:1905.09883 , year=

Neural stochastic differential equations: Deep latent gaussian models in the diffusion limit , author=. arXiv:1905.09883 , year=

work page arXiv 1905
[79]

Denoising Diffusion Probabilistic Models , booktitle = NIPS, year =

Jonathan Ho and Ajay Jain and Pieter Abbeel , igneditor =. Denoising Diffusion Probabilistic Models , booktitle = NIPS, year =

work page
[81]

Photorealistic text-to-image diffusion models with deep language understanding , author=

work page
[82]

Make-A-Video: Text-to-Video Generation without Text-Video Data

Make-a-video: Text-to-video generation without text-video data , author=. arXiv:2209.14792 , year=

work page internal anchor Pith review Pith/arXiv arXiv

Showing first 80 references.