OmniGen-AR: AutoRegressive Any-to-Image Generation

Junke Wang; Peize Sun; Qiushan Guo; Weilin Huang; Xun Wang; Yu-Gang Jiang; Zuxuan Wu

arxiv: 2606.09156 · v1 · pith:7TDFPHWUnew · submitted 2026-06-08 · 💻 cs.CV

OmniGen-AR: AutoRegressive Any-to-Image Generation

Junke Wang , Xun Wang , Qiushan Guo , Peize Sun , Weilin Huang , Zuxuan Wu , Yu-Gang Jiang This is my paper

Pith reviewed 2026-06-27 17:03 UTC · model grok-4.3

classification 💻 cs.CV

keywords autoregressive image generationany-to-imagevisual tokenizerdisentangled causal attentionconditional generationmultimodal image synthesistext-to-video

0 comments

The pith

A single autoregressive model generates images from text, depth, segmentation, or prior images using one shared visual tokenizer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to establish that diverse visual conditions can be unified in one autoregressive generation system by discretizing them all through a shared visual tokenizer alongside text prompts. It introduces Disentangled Causal Attention to block information leakage from condition tokens to content tokens during training. If the approach holds, real-world image synthesis tasks that now require separate models could be handled by a single architecture without changes at inference time. A sympathetic reader would care because this design targets flexible control over outputs from whatever signals are available, such as spatial maps or video frames.

Core claim

OmniGen-AR is presented as a unified autoregressive framework for Any-to-Image generation. By discretizing various visual conditions through a shared visual tokenizer and text prompts with a text tokenizer, the model supports a broad spectrum of conditional inputs within a single model, including text-to-image generation, segmentation-to-image, depth-to-image, image editing, frame prediction, and text-to-video generation. Disentangled Causal Attention separates the full-sequence causal mask into condition causal attention and content causal attention, serving as a training-time regularizer without affecting standard next-token prediction during inference.

What carries the argument

Disentangled Causal Attention (DCA), which separates the causal mask into condition and content parts to prevent leakage while acting only as a training regularizer.

If this is right

One model architecture performs text-to-image, spatial-signal-to-image, and visual-context tasks without retraining or separate heads.
Disentangled Causal Attention keeps inference unchanged while regularizing training against condition leakage.
The design extends naturally to frame prediction for video and to image editing from visual context.
Competitive or leading results appear across text-to-image and video benchmarks under the unified setup.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The uniform tokenization could allow a single generation pass to accept mixed conditions, such as text plus depth, without extra engineering.
If the tokenizer generalizes, the same approach might apply to additional modalities like sketches or 3D projections in future extensions.
Simpler deployment pipelines become possible when one checkpoint replaces several task-specific generators.

Load-bearing premise

A shared visual tokenizer can discretize diverse condition types such as depth maps and segmentation while retaining enough detail for accurate image synthesis.

What would settle it

Generate an image from a provided depth map condition and check whether the output depth structure matches the input map within a small error margin; systematic mismatch would show the tokenizer or attention failed to preserve the signal.

Figures

Figures reproduced from arXiv: 2606.09156 by Junke Wang, Peize Sun, Qiushan Guo, Weilin Huang, Xun Wang, Yu-Gang Jiang, Zuxuan Wu.

**Figure 2.** Figure 2: OmniGen-AR consists of a text tokenizer, a visual tokenizer, an autoregressive transformer. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison between plain causal attention, the proposed disentangled causal attention, and [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Token similarity on MagicBrush [105] [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Effects of model scaling. Synergy between different tasks. We study the synergy between different generation tasks by comparing joint training and separate training, both initialized with the 2nd stage checkpoint. As shown in [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Visualization of text-to-image and text-to-video results generated by OmniGen-AR. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Image editing, depth-to-image, and segmentation-to-image generation results. [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Failure cases generated by OmniGen-AR. potential directions for future work: 1) Scaling up the model and training data to build a stronger base model with improved generalization and instruction-following ability across diverse visual tasks. 2) Leveraging chain-of-thought (CoT) [90] to improve the reasoning ability on complex prompts. 5 Conclusion and Broader Impacts This paper presented OmniGen-AR, a unif… view at source ↗

read the original abstract

Autoregressive (AR) models have demonstrated strong potential in visual generation, offering superior performance with simple architectures and optimization objectives. However, existing methods are typically limited to single-modality conditions, e.g., text, restricting their applicability in real-world scenarios that demand image synthesis from diverse controls. In this work, we present OmniGen-AR, a unified autoregressive framework for Any-to-Image generation. By discretizing various visual conditions through a shared visual tokenizer and text prompts with a text tokenizer, OmniGen-AR supports a broad spectrum of conditional inputs within a single model, including text (text-to-image generation), spatial signals (segmentation-to-image and depth-to-image), and visual context (image editing, frame prediction, and text-to-video generation). To mitigate the risk of information leakage from condition tokens to content tokens, we introduce Disentangled Causal Attention (DCA), which separates the full-sequence causal mask into condition causal attention and content causal attention. It serves as a training-time regularizer without affecting the standard next-token prediction during inference. With this design, OmniGen-AR achieves new state-of-the-art or at least competitive results across a range of benchmark, e.g., 0.63 on GenEval and 80.02 on VBench, demonstrating its effectiveness in flexible and high-fidelity visual generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Shared visual tokenizer plus DCA is the main new piece, but no tokenizer metrics or DCA ablations means the cross-modality claims rest on unverified assumptions.

read the letter

OmniGen-AR puts forward one autoregressive model that handles text-to-image, depth-to-image, segmentation-to-image, image editing, frame prediction, and text-to-video by feeding all the visual conditions through a single shared visual tokenizer and using text tokens for prompts. The new technical move is Disentangled Causal Attention, which splits the causal mask so condition tokens attend only among themselves and content tokens attend to both, acting as a training regularizer that disappears at inference.

The unification goal is clear and the architecture stays simple, which is a plus if it scales. The reported numbers (0.63 GenEval, 80.02 VBench) would be competitive if the experiments are clean.

The soft spot is exactly where the stress-test note points: the abstract supplies no per-modality reconstruction metrics for the shared tokenizer, no codebook-size study, and no ablation of DCA versus ordinary causal attention. Those are the parts that have to carry the argument, so the performance claims cannot be evaluated yet.

The paper is for people already working on autoregressive visual generation who want fewer separate models. A reader who follows the recent AR image papers will see the intended contribution immediately.

I would bring the full version to a reading group once the tokenizer and DCA checks are added. It is worth sending to peer review because the idea is testable and the unification target matters, even though the current evidence is thin on the central mechanisms.

Referee Report

2 major / 1 minor

Summary. The paper presents OmniGen-AR, a unified autoregressive framework for any-to-image generation. It discretizes diverse visual conditions (depth, segmentation, video frames) via a shared visual tokenizer and text via a text tokenizer, enabling a single model to handle text-to-image, segmentation-to-image, depth-to-image, image editing, frame prediction, and text-to-video. Disentangled Causal Attention (DCA) is introduced to separate condition and content causal masks during training as a regularizer that leaves standard inference unchanged. The work claims new state-of-the-art or competitive results on benchmarks including 0.63 on GenEval and 80.02 on VBench.

Significance. If the experimental claims hold after proper validation, the approach would offer a meaningful unification of conditional visual generation tasks under a single autoregressive architecture, reducing the need for modality-specific models while maintaining high fidelity.

major comments (2)

Abstract: the benchmark scores (0.63 on GenEval, 80.02 on VBench) are asserted without any description of experimental protocol, baselines, training details, or implementation of the shared visual tokenizer and DCA, rendering the performance claims unverifiable from the supplied text.
Method section (DCA and shared tokenizer): the central claims rest on the assumptions that a single visual tokenizer losslessly discretizes depth maps, segmentation masks, and video frames into a shared codebook and that DCA prevents leakage without side-effects, yet no per-modality reconstruction metrics, codebook analysis, or ablation comparing DCA to standard causal attention are reported.

minor comments (1)

Notation for the condition causal attention mask versus content causal attention mask should be defined explicitly with an equation or diagram to avoid ambiguity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point-by-point below and outline concrete revisions to improve clarity and completeness of the experimental claims.

read point-by-point responses

Referee: Abstract: the benchmark scores (0.63 on GenEval, 80.02 on VBench) are asserted without any description of experimental protocol, baselines, training details, or implementation of the shared visual tokenizer and DCA, rendering the performance claims unverifiable from the supplied text.

Authors: We agree that the abstract, as currently written, is too terse to allow verification of the reported numbers. The full manuscript contains the experimental protocol, baselines, and implementation details in Sections 4 and 5, but these are not referenced or summarized in the abstract. In the revised version we will add one concise sentence to the abstract that states the training data scale, the primary baselines, and that all results use the shared tokenizer and DCA described in Section 3. revision: yes
Referee: Method section (DCA and shared tokenizer): the central claims rest on the assumptions that a single visual tokenizer losslessly discretizes depth maps, segmentation masks, and video frames into a shared codebook and that DCA prevents leakage without side-effects, yet no per-modality reconstruction metrics, codebook analysis, or ablation comparing DCA to standard causal attention are reported.

Authors: The manuscript describes the shared visual tokenizer (Section 3.2) and DCA (Section 3.3) but does not include the requested quantitative validation. We will add a new subsection (or appendix) containing: (i) per-modality reconstruction metrics (PSNR/SSIM/FID) for depth, segmentation, and video frames, (ii) codebook utilization statistics, and (iii) an ablation table comparing DCA against standard causal attention on GenEval and a subset of VBench tasks. These additions will directly test the lossless-discretization and no-leakage assumptions. revision: yes

Circularity Check

0 steps flagged

No circularity; claims rest on external benchmark scores

full rationale

The paper presents architectural choices (shared visual tokenizer, DCA) whose value is asserted via empirical results on independent benchmarks (GenEval 0.63, VBench 80.02). No equations, self-definitions, or fitted parameters are shown reducing to the inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems. The derivation chain is therefore self-contained against external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

Based solely on the abstract, the paper relies on standard autoregressive modeling assumptions and introduces DCA and shared tokenization as key new elements; no explicit free parameters are named.

axioms (2)

domain assumption Autoregressive next-token prediction remains effective when inputs from multiple modalities are discretized into a shared token space.
The framework extends prior AR visual generation to any-to-image by assuming tokenization unifies conditions without major information loss.
ad hoc to paper Separating condition causal attention from content causal attention during training prevents leakage while leaving standard inference unchanged.
DCA is presented as a training-time regularizer whose correctness is asserted without further justification in the abstract.

invented entities (2)

Disentangled Causal Attention (DCA) no independent evidence
purpose: Separate full-sequence causal mask into condition and content attention to mitigate information leakage.
New mechanism introduced in the paper.
Shared visual tokenizer no independent evidence
purpose: Discretize diverse visual conditions (depth, segmentation, frames) into tokens compatible with text tokenizer.
Core component enabling the unified any-to-image capability.

pith-pipeline@v0.9.1-grok · 5783 in / 1649 out tokens · 31612 ms · 2026-06-27T17:03:34.239161+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

111 extracted references · 29 linked inside Pith

[1]

Abdin, J

M. Abdin, J. Aneja, H. Behl, S. Bubeck, R. Eldan, S. Gunasekar, M. Harrison, R. J. Hewett, M. Javaheripi, P. Kauffmann, et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905, 2024

Pith/arXiv arXiv 2024
[2]

Achiam, S

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

Pith/arXiv arXiv 2023
[3]

J. Bai, S. Bai, Y . Chu, Z. Cui, K. Dang, X. Deng, Y . Fan, W. Ge, Y . Han, F. Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

Pith/arXiv arXiv 2023
[4]

Bolya, P.-Y

D. Bolya, P.-Y . Huang, P. Sun, J. H. Cho, A. Madotto, C. Wei, T. Ma, J. Zhi, J. Rajasegaran, H. Rasheed, et al. Perception encoder: The best visual embeddings are not at the output of the network.arXiv preprint arXiv:2504.13181, 2025

Pith/arXiv arXiv 2025
[5]

Brooks, A

T. Brooks, A. Holynski, and A. A. Efros. Instructpix2pix: Learning to follow image editing instructions. InCVPR, 2023

2023
[6]

Carreira, E

J. Carreira, E. Noland, A. Banki-Horvath, C. Hillier, and A. Zisserman. A short note about kinetics-600.arXiv preprint arXiv:1808.01340, 2018

Pith/arXiv arXiv 2018
[7]

Changpinyo, P

S. Changpinyo, P. Sharma, N. Ding, and R. Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. InCVPR, 2021

2021
[8]

H. Chen, Y . Zhang, X. Cun, M. Xia, X. Wang, C. Weng, and Y . Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. InCVPR, 2024

2024
[9]

J. Chen, J. Yu, C. Ge, L. Yao, E. Xie, Y . Wu, Z. Wang, J. Kwok, P. Luo, H. Lu, et al. Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis.arXiv preprint arXiv:2310.00426, 2023

Pith/arXiv arXiv 2023
[10]

M. Chen, A. Radford, R. Child, J. Wu, H. Jun, D. Luan, and I. Sutskever. Generative pretraining from pixels. InICML, 2020

2020
[11]

T.-S. Chen, A. Siarohin, W. Menapace, E. Deyneka, H.-w. Chao, B. E. Jeon, Y . Fang, H.-Y . Lee, J. Ren, M.-H. Yang, et al. Panda-70m: Captioning 70m videos with multiple cross-modality teachers. InCVPR, 2024

2024
[12]

Z. Chen, J. Wu, W. Wang, W. Su, G. Chen, S. Xing, M. Zhong, Q. Zhang, X. Zhu, L. Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InCVPR, 2024

2024
[13]

Dhariwal and A

P. Dhariwal and A. Nichol. Diffusion models beat gans on image synthesis. InNeurIPS, 2021

2021
[14]

Esser, S

P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. Müller, H. Saini, Y . Levi, D. Lorenz, A. Sauer, F. Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In ICLR, 2024

2024
[15]

Esser, R

P. Esser, R. Rombach, and B. Ommer. Taming transformers for high-resolution image synthesis. InCVPR, 2021

2021
[16]

N. et. al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

Pith/arXiv arXiv 2025
[17]

Z. Fu, W. Lam, A. M.-C. So, and B. Shi. A theoretical analysis of the repetition problem in text generation. InAAAI, 2021

2021
[18]

S. Ge, T. Hayes, H. Yang, X. Yin, G. Pang, D. Jacobs, J.-B. Huang, and D. Parikh. Long video generation with time-agnostic vqgan and time-sensitive transformer. InECCV, 2022

2022
[19]

Y . Ge, S. Zhao, C. Li, Y . Ge, and Y . Shan. Seed-data-edit technical report: A hybrid dataset for instructional image editing.arXiv preprint arXiv:2405.04007, 2024. 10

arXiv 2024
[20]

Ghosh, H

D. Ghosh, H. Hajishirzi, and L. Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment. InNeurIPS, 2024

2024
[21]

I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio. Generative adversarial nets. InNeurIPS, 2014

2014
[22]

Y . Guo, C. Yang, A. Rao, Z. Liang, Y . Wang, Y . Qiao, M. Agrawala, D. Lin, and B. Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning.arXiv preprint arXiv:2307.04725, 2023

Pith/arXiv arXiv 2023
[23]

J. Han, J. Liu, Y . Jiang, B. Yan, Y . Zhang, Z. Yuan, B. Peng, and X. Liu. Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis. InCVPR, 2025

2025
[24]

Henighan, J

T. Henighan, J. Kaplan, M. Katz, M. Chen, C. Hesse, J. Jackson, H. Jun, T. B. Brown, P. Dhariwal, S. Gray, et al. Scaling laws for autoregressive generative modeling.arXiv preprint arXiv:2010.14701, 2020

Pith/arXiv arXiv 2010
[25]

J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. InNeurIPS, 2020

2020
[26]

Ho and T

J. Ho and T. Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

Pith/arXiv arXiv 2022
[27]

W. Hong, M. Ding, W. Zheng, X. Liu, and J. Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers.arXiv preprint arXiv:2205.15868, 2022

Pith/arXiv arXiv 2022
[28]

open-sora-pexels-45k

hpcai tech. open-sora-pexels-45k. https://huggingface.co/datasets/hpcai-tech/ open-sora-pexels-45k, 2025

2025
[29]

Huang, Y

Z. Huang, Y . He, J. Yu, F. Zhang, C. Si, Y . Jiang, Y . Zhang, T. Wu, Q. Jin, N. Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InCVPR, 2024

2024
[30]

X. Ju, Y . Gao, Z. Zhang, Z. Yuan, X. Wang, A. Zeng, Y . Xiong, Q. Xu, and Y . Shan. Miradata: A large-scale video dataset with long durations and structured captions. InNeurIPS, 2024

2024
[31]

Kaplan, S

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Rad- ford, J. Wu, and D. Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

Pith/arXiv arXiv 2001
[32]

D. P. Kingma, M. Welling, et al. Auto-encoding variational bayes, 2013

2013
[33]

Kirillov, E

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, et al. Segment anything. InICCV, 2023

2023
[34]

Kondratyuk, L

D. Kondratyuk, L. Yu, X. Gu, J. Lezama, J. Huang, G. Schindler, R. Hornung, V . Birodkar, J. Yan, M.-C. Chiu, et al. Videopoet: A large language model for zero-shot video generation. InICML, 2024

2024
[35]

W. Kong, Q. Tian, Z. Zhang, R. Min, Z. Dai, J. Zhou, J. Xiong, X. Li, B. Wu, J. Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

Pith/arXiv arXiv 2024
[36]

Kuznetsova, H

A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, S. Popov, M. Malloci, A. Kolesnikov, et al. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale.IJCV, 2020

2020
[37]

H. Li, T. Lan, Z. Fu, D. Cai, L. Liu, N. Collier, T. Watanabe, and Y . Su. Repetition in repetition out: Towards understanding neural text degeneration from the data perspective. InNeurIPS, 2023

2023
[38]

J. Li, D. Li, C. Xiong, and S. Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InICML, 2022

2022
[39]

T. Li, Y . Tian, H. Li, M. Deng, and K. He. Autoregressive image generation without vector quantization. InNeurIPS, 2024. 11

2024
[40]

Y . Li, H. Liu, Q. Wu, F. Mu, J. Yang, J. Gao, C. Li, and Y . J. Lee. Gligen: Open-set grounded text-to-image generation. InCVPR, 2023

2023
[41]

Z. Li, T. Cheng, S. Chen, P. Sun, H. Shen, L. Ran, X. Chen, W. Liu, and X. Wang. Controlar: Controllable image generation with autoregressive models. InICLR, 2025

2025
[42]

B. Lin, Y . Ge, X. Cheng, Z. Li, B. Zhu, S. Wang, X. He, Y . Ye, S. Yuan, L. Chen, et al. Open-sora plan: Open-source large video generation model.arXiv preprint arXiv:2412.00131, 2024

Pith/arXiv arXiv 2024
[43]

H. Liu, C. Li, Y . Li, and Y . J. Lee. Improved baselines with visual instruction tuning. InCVPR, 2024

2024
[44]

H. Liu, C. Li, Q. Wu, and Y . J. Lee. Visual instruction tuning. InNeuIPS, 2023

2023
[45]

Loshchilov and F

I. Loshchilov and F. Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

Pith/arXiv arXiv 2017
[46]

Megalith-huggingface

madebyollin. Megalith-huggingface. https://huggingface.co/datasets/ madebyollin/megalith-10m, 2024

2024
[47]

Mokady, A

R. Mokady, A. Hertz, K. Aberman, Y . Pritch, and D. Cohen-Or. Null-text inversion for editing real images using guided diffusion models. InCVPR, 2023

2023
[48]

C. Mou, X. Wang, L. Xie, Y . Wu, J. Zhang, Z. Qi, and Y . Shan. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. InAAAI, 2024

2024
[49]

J. Mu, N. Vasconcelos, and X. Wang. Editar: Unified conditional generation with autoregres- sive models. InCVPR, 2025

2025
[50]

K. Nan, R. Xie, P. Zhou, T. Fan, Z. Yang, Z. Chen, X. Li, J. Yang, and Y . Tai. Openvid-1m: A large-scale high-quality dataset for text-to-video generation.arXiv preprint arXiv:2407.02371, 2024

Pith/arXiv arXiv 2024
[51]

Peebles and S

W. Peebles and S. Xie. Scalable diffusion models with transformers. InCVPR, 2023

2023
[52]

Podell, Z

D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. Müller, J. Penna, and R. Rom- bach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

Pith/arXiv arXiv 2023
[53]

synthetic-dataset-1m-dalle3-high-quality- captions

ProGamerGov. synthetic-dataset-1m-dalle3-high-quality- captions. https://huggingface.co/datasets/ProGamerGov/ synthetic-dataset-1m-dalle3-high-quality-captions, 2022

2022
[54]

C. Qin, S. Zhang, N. Yu, Y . Feng, X. Yang, Y . Zhou, H. Wang, J. C. Niebles, C. Xiong, S. Savarese, et al. Unicontrol: A unified diffusion model for controllable visual generation in the wild.arXiv preprint arXiv:2305.11147, 2023

arXiv 2023
[55]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervi- sion. InICML, 2021

2021
[56]

Radford, K

A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, et al. Improving language understanding by generative pre-training.OpenAI Blog, 2018

2018
[57]

Radford, J

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. Language models are unsupervised multitask learners.OpenAI Blog, 2019

2019
[58]

Rakhimov, D

R. Rakhimov, D. V olkhonskiy, A. Artemov, D. Zorin, and E. Burnaev. Latent video transformer. arXiv preprint arXiv:2006.10704, 2020

arXiv 2006
[59]

Ramesh, P

A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen. Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 2022. 12

Pith/arXiv arXiv 2022
[60]

Ramesh, M

A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. V oss, A. Radford, M. Chen, and I. Sutskever. Zero-shot text-to-image generation. InICML, 2021

2021
[61]

Rombach, A

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. InCVPR, 2022

2022
[62]

Sharma, N

P. Sharma, N. Ding, S. Goodman, and R. Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. InACL, 2018

2018
[63]

Sheynin, A

S. Sheynin, A. Polyak, U. Singer, Y . Kirstain, A. Zohar, O. Ashual, D. Parikh, and Y . Taigman. Emu edit: Precise image editing via recognition and generation tasks. InCVPR, 2024

2024
[64]

J. Song, C. Meng, and S. Ermon. Denoising diffusion implicit models. InICLR, 2021

2021
[65]

K. Sun, J. Pan, Y . Ge, H. Li, H. Duan, X. Wu, R. Zhang, A. Zhou, Z. Qin, Y . Wang, et al. Journeydb: A benchmark for generative image understanding. InNeurIPS, 2023

2023
[66]

P. Sun, Y . Jiang, S. Chen, S. Zhang, B. Peng, P. Luo, and Z. Yuan. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525, 2024

Pith/arXiv arXiv 2024
[67]

C. Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024

Pith/arXiv arXiv 2024
[68]

G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

Pith/arXiv arXiv 2023
[69]

K. Tian, Y . Jiang, Z. Yuan, B. Peng, and L. Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction.arXiv preprint arXiv:2404.02905, 2024

arXiv 2024
[70]

R. Tian, Q. Dai, J. Bao, K. Qiu, Y . Yang, C. Luo, Z. Wu, and Y .-G. Jiang. Reducio! generating 1k video within 16 seconds using extremely compressed motion latents. InICCV, 2025

2025
[71]

R. Tian, M. Gao, M. Xu, J. Hu, J. Lu, Z. Wu, Y . Yang, and A. Dehghan. Unigen: Enhanced training & test-time strategies for unified multimodal understanding and generation. In NeurIPS, 2025

2025
[72]

Touvron, T

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

Pith/arXiv arXiv 2023
[73]

Tumanyan, M

N. Tumanyan, M. Geyer, S. Bagon, and T. Dekel. Plug-and-play diffusion features for text-driven image-to-image translation. InCVPR, 2023

2023
[74]

van den Oord and N

A. van den Oord and N. Kalchbrenner. Pixel rnn. InICML, 2016

2016
[75]

Van Den Oord, O

A. Van Den Oord, O. Vinyals, et al. Neural discrete representation learning. InNeurIPS, 2017

2017
[76]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. InNeurIPS, 2017

2017
[77]

Walker, A

J. Walker, A. Razavi, and A. v. d. Oord. Predicting video with vqvae.arXiv preprint arXiv:2103.01950, 2021

arXiv 2021
[78]

A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, J. Zeng, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Pith/arXiv arXiv 2025
[79]

H. Wang, S. Suri, Y . Ren, H. Chen, and A. Shrivastava. Larp: Tokenizing videos with a learned autoregressive generative prior. InICLR, 2025

2025
[80]

J. Wang, D. Chen, Z. Wu, C. Luo, L. Zhou, Y . Zhao, Y . Xie, C. Liu, Y .-G. Jiang, and L. Yuan. Omnivl: One foundation model for image-language and video-language tasks. InNeurIPS, 2022. 13

2022

Showing first 80 references.

[1] [1]

Abdin, J

M. Abdin, J. Aneja, H. Behl, S. Bubeck, R. Eldan, S. Gunasekar, M. Harrison, R. J. Hewett, M. Javaheripi, P. Kauffmann, et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905, 2024

Pith/arXiv arXiv 2024

[2] [2]

Achiam, S

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

Pith/arXiv arXiv 2023

[3] [3]

J. Bai, S. Bai, Y . Chu, Z. Cui, K. Dang, X. Deng, Y . Fan, W. Ge, Y . Han, F. Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

Pith/arXiv arXiv 2023

[4] [4]

Bolya, P.-Y

D. Bolya, P.-Y . Huang, P. Sun, J. H. Cho, A. Madotto, C. Wei, T. Ma, J. Zhi, J. Rajasegaran, H. Rasheed, et al. Perception encoder: The best visual embeddings are not at the output of the network.arXiv preprint arXiv:2504.13181, 2025

Pith/arXiv arXiv 2025

[5] [5]

Brooks, A

T. Brooks, A. Holynski, and A. A. Efros. Instructpix2pix: Learning to follow image editing instructions. InCVPR, 2023

2023

[6] [6]

Carreira, E

J. Carreira, E. Noland, A. Banki-Horvath, C. Hillier, and A. Zisserman. A short note about kinetics-600.arXiv preprint arXiv:1808.01340, 2018

Pith/arXiv arXiv 2018

[7] [7]

Changpinyo, P

S. Changpinyo, P. Sharma, N. Ding, and R. Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. InCVPR, 2021

2021

[8] [8]

H. Chen, Y . Zhang, X. Cun, M. Xia, X. Wang, C. Weng, and Y . Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. InCVPR, 2024

2024

[9] [9]

J. Chen, J. Yu, C. Ge, L. Yao, E. Xie, Y . Wu, Z. Wang, J. Kwok, P. Luo, H. Lu, et al. Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis.arXiv preprint arXiv:2310.00426, 2023

Pith/arXiv arXiv 2023

[10] [10]

M. Chen, A. Radford, R. Child, J. Wu, H. Jun, D. Luan, and I. Sutskever. Generative pretraining from pixels. InICML, 2020

2020

[11] [11]

T.-S. Chen, A. Siarohin, W. Menapace, E. Deyneka, H.-w. Chao, B. E. Jeon, Y . Fang, H.-Y . Lee, J. Ren, M.-H. Yang, et al. Panda-70m: Captioning 70m videos with multiple cross-modality teachers. InCVPR, 2024

2024

[12] [12]

Z. Chen, J. Wu, W. Wang, W. Su, G. Chen, S. Xing, M. Zhong, Q. Zhang, X. Zhu, L. Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InCVPR, 2024

2024

[13] [13]

Dhariwal and A

P. Dhariwal and A. Nichol. Diffusion models beat gans on image synthesis. InNeurIPS, 2021

2021

[14] [14]

Esser, S

P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. Müller, H. Saini, Y . Levi, D. Lorenz, A. Sauer, F. Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In ICLR, 2024

2024

[15] [15]

Esser, R

P. Esser, R. Rombach, and B. Ommer. Taming transformers for high-resolution image synthesis. InCVPR, 2021

2021

[16] [16]

N. et. al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

Pith/arXiv arXiv 2025

[17] [17]

Z. Fu, W. Lam, A. M.-C. So, and B. Shi. A theoretical analysis of the repetition problem in text generation. InAAAI, 2021

2021

[18] [18]

S. Ge, T. Hayes, H. Yang, X. Yin, G. Pang, D. Jacobs, J.-B. Huang, and D. Parikh. Long video generation with time-agnostic vqgan and time-sensitive transformer. InECCV, 2022

2022

[19] [19]

Y . Ge, S. Zhao, C. Li, Y . Ge, and Y . Shan. Seed-data-edit technical report: A hybrid dataset for instructional image editing.arXiv preprint arXiv:2405.04007, 2024. 10

arXiv 2024

[20] [20]

Ghosh, H

D. Ghosh, H. Hajishirzi, and L. Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment. InNeurIPS, 2024

2024

[21] [21]

I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio. Generative adversarial nets. InNeurIPS, 2014

2014

[22] [22]

Y . Guo, C. Yang, A. Rao, Z. Liang, Y . Wang, Y . Qiao, M. Agrawala, D. Lin, and B. Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning.arXiv preprint arXiv:2307.04725, 2023

Pith/arXiv arXiv 2023

[23] [23]

J. Han, J. Liu, Y . Jiang, B. Yan, Y . Zhang, Z. Yuan, B. Peng, and X. Liu. Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis. InCVPR, 2025

2025

[24] [24]

Henighan, J

T. Henighan, J. Kaplan, M. Katz, M. Chen, C. Hesse, J. Jackson, H. Jun, T. B. Brown, P. Dhariwal, S. Gray, et al. Scaling laws for autoregressive generative modeling.arXiv preprint arXiv:2010.14701, 2020

Pith/arXiv arXiv 2010

[25] [25]

J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. InNeurIPS, 2020

2020

[26] [26]

Ho and T

J. Ho and T. Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

Pith/arXiv arXiv 2022

[27] [27]

W. Hong, M. Ding, W. Zheng, X. Liu, and J. Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers.arXiv preprint arXiv:2205.15868, 2022

Pith/arXiv arXiv 2022

[28] [28]

open-sora-pexels-45k

hpcai tech. open-sora-pexels-45k. https://huggingface.co/datasets/hpcai-tech/ open-sora-pexels-45k, 2025

2025

[29] [29]

Huang, Y

Z. Huang, Y . He, J. Yu, F. Zhang, C. Si, Y . Jiang, Y . Zhang, T. Wu, Q. Jin, N. Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InCVPR, 2024

2024

[30] [30]

X. Ju, Y . Gao, Z. Zhang, Z. Yuan, X. Wang, A. Zeng, Y . Xiong, Q. Xu, and Y . Shan. Miradata: A large-scale video dataset with long durations and structured captions. InNeurIPS, 2024

2024

[31] [31]

Kaplan, S

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Rad- ford, J. Wu, and D. Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

Pith/arXiv arXiv 2001

[32] [32]

D. P. Kingma, M. Welling, et al. Auto-encoding variational bayes, 2013

2013

[33] [33]

Kirillov, E

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, et al. Segment anything. InICCV, 2023

2023

[34] [34]

Kondratyuk, L

D. Kondratyuk, L. Yu, X. Gu, J. Lezama, J. Huang, G. Schindler, R. Hornung, V . Birodkar, J. Yan, M.-C. Chiu, et al. Videopoet: A large language model for zero-shot video generation. InICML, 2024

2024

[35] [35]

W. Kong, Q. Tian, Z. Zhang, R. Min, Z. Dai, J. Zhou, J. Xiong, X. Li, B. Wu, J. Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

Pith/arXiv arXiv 2024

[36] [36]

Kuznetsova, H

A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, S. Popov, M. Malloci, A. Kolesnikov, et al. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale.IJCV, 2020

2020

[37] [37]

H. Li, T. Lan, Z. Fu, D. Cai, L. Liu, N. Collier, T. Watanabe, and Y . Su. Repetition in repetition out: Towards understanding neural text degeneration from the data perspective. InNeurIPS, 2023

2023

[38] [38]

J. Li, D. Li, C. Xiong, and S. Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InICML, 2022

2022

[39] [39]

T. Li, Y . Tian, H. Li, M. Deng, and K. He. Autoregressive image generation without vector quantization. InNeurIPS, 2024. 11

2024

[40] [40]

Y . Li, H. Liu, Q. Wu, F. Mu, J. Yang, J. Gao, C. Li, and Y . J. Lee. Gligen: Open-set grounded text-to-image generation. InCVPR, 2023

2023

[41] [41]

Z. Li, T. Cheng, S. Chen, P. Sun, H. Shen, L. Ran, X. Chen, W. Liu, and X. Wang. Controlar: Controllable image generation with autoregressive models. InICLR, 2025

2025

[42] [42]

B. Lin, Y . Ge, X. Cheng, Z. Li, B. Zhu, S. Wang, X. He, Y . Ye, S. Yuan, L. Chen, et al. Open-sora plan: Open-source large video generation model.arXiv preprint arXiv:2412.00131, 2024

Pith/arXiv arXiv 2024

[43] [43]

H. Liu, C. Li, Y . Li, and Y . J. Lee. Improved baselines with visual instruction tuning. InCVPR, 2024

2024

[44] [44]

H. Liu, C. Li, Q. Wu, and Y . J. Lee. Visual instruction tuning. InNeuIPS, 2023

2023

[45] [45]

Loshchilov and F

I. Loshchilov and F. Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

Pith/arXiv arXiv 2017

[46] [46]

Megalith-huggingface

madebyollin. Megalith-huggingface. https://huggingface.co/datasets/ madebyollin/megalith-10m, 2024

2024

[47] [47]

Mokady, A

R. Mokady, A. Hertz, K. Aberman, Y . Pritch, and D. Cohen-Or. Null-text inversion for editing real images using guided diffusion models. InCVPR, 2023

2023

[48] [48]

C. Mou, X. Wang, L. Xie, Y . Wu, J. Zhang, Z. Qi, and Y . Shan. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. InAAAI, 2024

2024

[49] [49]

J. Mu, N. Vasconcelos, and X. Wang. Editar: Unified conditional generation with autoregres- sive models. InCVPR, 2025

2025

[50] [50]

K. Nan, R. Xie, P. Zhou, T. Fan, Z. Yang, Z. Chen, X. Li, J. Yang, and Y . Tai. Openvid-1m: A large-scale high-quality dataset for text-to-video generation.arXiv preprint arXiv:2407.02371, 2024

Pith/arXiv arXiv 2024

[51] [51]

Peebles and S

W. Peebles and S. Xie. Scalable diffusion models with transformers. InCVPR, 2023

2023

[52] [52]

Podell, Z

D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. Müller, J. Penna, and R. Rom- bach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

Pith/arXiv arXiv 2023

[53] [53]

synthetic-dataset-1m-dalle3-high-quality- captions

ProGamerGov. synthetic-dataset-1m-dalle3-high-quality- captions. https://huggingface.co/datasets/ProGamerGov/ synthetic-dataset-1m-dalle3-high-quality-captions, 2022

2022

[54] [54]

C. Qin, S. Zhang, N. Yu, Y . Feng, X. Yang, Y . Zhou, H. Wang, J. C. Niebles, C. Xiong, S. Savarese, et al. Unicontrol: A unified diffusion model for controllable visual generation in the wild.arXiv preprint arXiv:2305.11147, 2023

arXiv 2023

[55] [55]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervi- sion. InICML, 2021

2021

[56] [56]

Radford, K

A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, et al. Improving language understanding by generative pre-training.OpenAI Blog, 2018

2018

[57] [57]

Radford, J

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. Language models are unsupervised multitask learners.OpenAI Blog, 2019

2019

[58] [58]

Rakhimov, D

R. Rakhimov, D. V olkhonskiy, A. Artemov, D. Zorin, and E. Burnaev. Latent video transformer. arXiv preprint arXiv:2006.10704, 2020

arXiv 2006

[59] [59]

Ramesh, P

A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen. Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 2022. 12

Pith/arXiv arXiv 2022

[60] [60]

Ramesh, M

A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. V oss, A. Radford, M. Chen, and I. Sutskever. Zero-shot text-to-image generation. InICML, 2021

2021

[61] [61]

Rombach, A

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. InCVPR, 2022

2022

[62] [62]

Sharma, N

P. Sharma, N. Ding, S. Goodman, and R. Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. InACL, 2018

2018

[63] [63]

Sheynin, A

S. Sheynin, A. Polyak, U. Singer, Y . Kirstain, A. Zohar, O. Ashual, D. Parikh, and Y . Taigman. Emu edit: Precise image editing via recognition and generation tasks. InCVPR, 2024

2024

[64] [64]

J. Song, C. Meng, and S. Ermon. Denoising diffusion implicit models. InICLR, 2021

2021

[65] [65]

K. Sun, J. Pan, Y . Ge, H. Li, H. Duan, X. Wu, R. Zhang, A. Zhou, Z. Qin, Y . Wang, et al. Journeydb: A benchmark for generative image understanding. InNeurIPS, 2023

2023

[66] [66]

P. Sun, Y . Jiang, S. Chen, S. Zhang, B. Peng, P. Luo, and Z. Yuan. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525, 2024

Pith/arXiv arXiv 2024

[67] [67]

C. Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024

Pith/arXiv arXiv 2024

[68] [68]

G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

Pith/arXiv arXiv 2023

[69] [69]

K. Tian, Y . Jiang, Z. Yuan, B. Peng, and L. Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction.arXiv preprint arXiv:2404.02905, 2024

arXiv 2024

[70] [70]

R. Tian, Q. Dai, J. Bao, K. Qiu, Y . Yang, C. Luo, Z. Wu, and Y .-G. Jiang. Reducio! generating 1k video within 16 seconds using extremely compressed motion latents. InICCV, 2025

2025

[71] [71]

R. Tian, M. Gao, M. Xu, J. Hu, J. Lu, Z. Wu, Y . Yang, and A. Dehghan. Unigen: Enhanced training & test-time strategies for unified multimodal understanding and generation. In NeurIPS, 2025

2025

[72] [72]

Touvron, T

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

Pith/arXiv arXiv 2023

[73] [73]

Tumanyan, M

N. Tumanyan, M. Geyer, S. Bagon, and T. Dekel. Plug-and-play diffusion features for text-driven image-to-image translation. InCVPR, 2023

2023

[74] [74]

van den Oord and N

A. van den Oord and N. Kalchbrenner. Pixel rnn. InICML, 2016

2016

[75] [75]

Van Den Oord, O

A. Van Den Oord, O. Vinyals, et al. Neural discrete representation learning. InNeurIPS, 2017

2017

[76] [76]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. InNeurIPS, 2017

2017

[77] [77]

Walker, A

J. Walker, A. Razavi, and A. v. d. Oord. Predicting video with vqvae.arXiv preprint arXiv:2103.01950, 2021

arXiv 2021

[78] [78]

A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, J. Zeng, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Pith/arXiv arXiv 2025

[79] [79]

H. Wang, S. Suri, Y . Ren, H. Chen, and A. Shrivastava. Larp: Tokenizing videos with a learned autoregressive generative prior. InICLR, 2025

2025

[80] [80]

J. Wang, D. Chen, Z. Wu, C. Luo, L. Zhou, Y . Zhao, Y . Xie, C. Liu, Y .-G. Jiang, and L. Yuan. Omnivl: One foundation model for image-language and video-language tasks. InNeurIPS, 2022. 13

2022