{"work":{"id":"720d288e-fac0-464c-9929-19efd9a52afc","openalex_id":null,"doi":null,"arxiv_id":"2409.18869","raw_key":null,"title":"Emu3: Next-Token Prediction is All You Need","authors":null,"authors_text":"Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang","year":2024,"venue":"cs.CV","abstract":"While next-token prediction is considered a promising path towards artificial general intelligence, it has struggled to excel in multimodal tasks, which are still dominated by diffusion models (e.g., Stable Diffusion) and compositional approaches (e.g., CLIP combined with LLMs). In this paper, we introduce Emu3, a new suite of state-of-the-art multimodal models trained solely with next-token prediction. By tokenizing images, text, and videos into a discrete space, we train a single transformer from scratch on a mixture of multimodal sequences. Emu3 outperforms several well-established task-specific models in both generation and perception tasks, surpassing flagship models such as SDXL and LLaVA-1.6, while eliminating the need for diffusion or compositional architectures. Emu3 is also capable of generating high-fidelity video via predicting the next token in a video sequence. We simplify complex multimodal model designs by converging on a singular focus: tokens, unlocking great potential for scaling both during training and inference. Our results demonstrate that next-token prediction is a promising path towards building general multimodal intelligence beyond language. We open-source key techniques and models to support further research in this direction.","external_url":"https://arxiv.org/abs/2409.18869","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-25T08:15:33.781802+00:00","pith_arxiv_id":"2409.18869","created_at":"2026-05-09T22:44:15.259675+00:00","updated_at":"2026-05-25T08:15:33.781802+00:00","title_quality_ok":true,"display_title":"Emu3: Next-Token Prediction is All You Need","render_title":"Emu3: Next-Token Prediction is All You Need"},"hub":{"state":{"work_id":"720d288e-fac0-464c-9929-19efd9a52afc","tier":"super_hub","tier_reason":"100+ Pith inbound or 10,000+ external citations","pith_inbound_count":103,"external_cited_by_count":null,"distinct_field_count":9,"first_pith_cited_at":"2024-08-22T16:32:32+00:00","last_pith_cited_at":"2026-05-21T11:30:29+00:00","author_build_status":"needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-05-29T22:50:29.179866+00:00","tier_text":"super_hub"},"tier":"super_hub","role_counts":[{"context_role":"baseline","n":19},{"context_role":"background","n":18},{"context_role":"other","n":2},{"context_role":"dataset","n":1},{"context_role":"method","n":1}],"polarity_counts":[{"context_polarity":"baseline","n":19},{"context_polarity":"background","n":18},{"context_polarity":"unclear","n":2},{"context_polarity":"use_dataset","n":1},{"context_polarity":"use_method","n":1}],"runs":{"ask_index":{"job_type":"ask_index","status":"succeeded","result":{"title":"Emu3: Next-Token Prediction is All You Need","claims":[{"claim_text":"While next-token prediction is considered a promising path towards artificial general intelligence, it has struggled to excel in multimodal tasks, which are still dominated by diffusion models (e.g., Stable Diffusion) and compositional approaches (e.g., CLIP combined with LLMs). In this paper, we introduce Emu3, a new suite of state-of-the-art multimodal models trained solely with next-token prediction. By tokenizing images, text, and videos into a discrete space, we train a single transformer from scratch on a mixture of multimodal sequences. Emu3 outperforms several well-established task-spe","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"Nucleus-ImageTECHNICALREPORT Figure 11: GenEval overall scores for top-performing models. Nucleus-Image matches Qwen-Image at 0.87 and leads all models on spatial position understanding. Table 9: GenEval results. Model Single Two Count Colors Position Attr. Overall Show-o [44] 0.95 0.52 0.49 0.82 0.11 0.28 0.53 PixArt-α[45] 0.98 0.50 0.44 0.80 0.08 0.07 0.48 Emu3-Gen [46] 0.98 0.71 0.34 0.81 0.17 0.21 0.54 SD3 Medium [27] 0.98 0.74 0.63 0.67 0.34 0.36 0.62 FLUX.1 Dev [40] 0.98 0.81 0.74 0.79 0.2","claim_type":"baseline","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"comprehensive multilingual benchmark derived from real-world user instructions. BaselinesWe compare against both generation-only and unified models. For image generation, generation-only baselines include SDXL [38], Stable Diffusion 3.5 Medium/Large [11], FLUX.1- dev [3], Infinity [17], OmniGen2 [57] and Wan2.2-t2i-plus, unified models include Janus [56], Janus- Pro [6], Emu3 [54], OneCAT [27], Janus-Flow [34], BLIP3-o [5], UniWorld-V1 [31], Mogao [30] and BAGEL [9]. For editing, generation-only","claim_type":"baseline","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"Diffusion model alignment using direct preference optimization, 2023. [35] Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Lingpeng Kong, Qi Liu, Tianyu Liu, et al. Large language models are not fair evaluators. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9440-9450, 2024. [36] Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li,","claim_type":"other","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"consistent hand movement of the person holding a cup in the last case. Overall, these results demonstrate 17 (a) VBench Metrics Part I Models Params. Quality Score Semantic Score Subj. Consist. Bkg. Consist. Temp. Flicker. Motion Smooth. Dynamic Degree Aesthetic Quality Imaging Quality Object Class Generation-only Models ModelScope [112] 1.7B 78.05 66.54 89.87 95.29 98.28 95.79 66.39 52.06 58.57 82.25 LaVie [116] 3B 78.78 70.31 91.41 97.47 98.30 96.38 49.72 54.94 61.90 91.82 Show-1 [143] 6B 80.4","claim_type":"baseline","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"0LLaV A[1] AR Vicuna-7B - - - - -/809 37.0 38.7 - -LLaV A-1.5[2] AR Vicuna-7B - - - - -/1510 66.1 64.3 - -Qwen2-VL[5] AR Qwen2-7B 54.1 43.5 - 60.7 - - - - 58.2DeepSeek-VL[79] AR DeepSeek-7B36.6 - - - - 70.4 73.2 - -DeepSeek-VL2[80] AR - 51.1 - - 61.3 - - - - 62.8Janus[81] AR DeepSeek-1.3B 30.5 - - - -/1338 63.7 69.4 - -Janus-Pro[82] AR DeepSeek-7B 41.0 - - - -/1567 72.1 79.2 - -Emu3[83] AR - 31.6 - - - - 68.2 58.5 - -MAmmoTH[55] AR Qwen2.5-7B50.8 - 25.3 63.0 - 76.0 - 34.2 67.6LLaV A-OV[3] AR Qwe","claim_type":"baseline","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"hang Huang, Linjie Xing, Zhe Chen, Hao Li, Xizhou Zhu, Zhiguo Cao, et al. The all-seeing project: Towards panoptic visual recognition and understanding of the open world. In ICLR, 2024. 1, 3 [103] Weiyun Wang, Shuibo Zhang, Yiming Ren, Yuchen Duan, Tiantong Li, Shuo Liu, Mengkang Hu, Zhe Chen, Kaipeng Zhang, Lewei Lu, et al. Needle in a multimodal haystack. arXiv preprint arXiv:2406.07230, 2024. 3 [104] Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang,","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks Emu3: Next-Token Prediction is All You Need because it crossed a citation-hub threshold. Current citing contexts most often use it as baseline evidence (19 contexts).","role_counts":[{"n":19,"context_role":"baseline"},{"n":18,"context_role":"background"},{"n":2,"context_role":"other"},{"n":1,"context_role":"dataset"},{"n":1,"context_role":"method"}]},"error":null,"updated_at":"2026-05-23T08:04:21.927168+00:00"},"author_expand":{"job_type":"author_expand","status":"succeeded","result":{"authors_linked":[{"id":"d0d468d1-1f5a-458c-b7fe-f129c9678b91","orcid":null,"display_name":"Xinlong Wang"},{"id":"5c640aa1-1c84-4ac1-bf1e-04b328ce8f24","orcid":null,"display_name":"Xiaosong Zhang"},{"id":"8f31ae33-361f-4e27-a880-0a2869b5096d","orcid":null,"display_name":"Zhengxiong Luo"},{"id":"ad9117a5-4ba7-43cb-b864-e877731def01","orcid":null,"display_name":"Quan Sun"},{"id":"e086e60c-db40-49db-92dd-2a66b963588f","orcid":null,"display_name":"Yufeng Cui"},{"id":"b74076d9-3582-479d-b7c0-55e370de8b96","orcid":null,"display_name":"Jinsheng Wang"}]},"error":null,"updated_at":"2026-05-23T08:04:22.824499+00:00"},"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-14T13:21:17.915492+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling","work_id":"67d9e391-26d1-459e-ab56-07e60511c886","shared_citers":31},{"title":"Emerging Properties in Unified Multimodal Pretraining","work_id":"e0cfd82c-f5d4-44fd-b531-ec73ab0a805b","shared_citers":29},{"title":"Chameleon: Mixed-Modal Early-Fusion Foundation Models","work_id":"2661b9a6-25cc-41a1-8100-612d2b801289","shared_citers":25},{"title":"BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset","work_id":"86d896d2-592f-4d9b-938e-dfeb11f9388f","shared_citers":21},{"title":"Show-o: One Single Transformer to Unify Multimodal Understanding and Generation","work_id":"1393dc24-a6b2-44e1-b5d7-7009d1fa4811","shared_citers":20},{"title":"OmniGen2: Towards Instruction-Aligned Multimodal Generation","work_id":"d3153e5f-b6e2-4ab3-9f41-e24e24d64496","shared_citers":18},{"title":"Show-o2: Improved Native Unified Multimodal Models","work_id":"77f00563-1ce6-4fba-9d4e-c8ce83f716ac","shared_citers":18},{"title":"Qwen2.5-VL Technical Report","work_id":"69dffacb-bfe8-442d-be86-48624c60426f","shared_citers":17},{"title":"SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis","work_id":"8034c587-fba6-4941-87ba-c98f2ac962cb","shared_citers":16},{"title":"Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model","work_id":"c2bb4d2d-29de-4bf2-9150-3d6373ff358f","shared_citers":15},{"title":"UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation","work_id":"488a273e-95d8-46f1-87c7-2244068d00d0","shared_citers":15},{"title":"ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment","work_id":"94248955-4bc5-4517-98a0-66224a36d865","shared_citers":12},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":12},{"title":"Qwen-Image Technical Report","work_id":"d06d7ecc-7579-4f89-a60b-4278a0f3c562","shared_citers":12},{"title":"Wan: Open and Advanced Large-Scale Video Generative Models","work_id":"ad3ebc3b-4224-46c9-b61d-bcf135da0a7c","shared_citers":12},{"title":"Seed-x: Multimodal models with unified multi-granularity comprehension and generation","work_id":"15953092-dd9e-49ae-9f72-e28fc93a6068","shared_citers":11},{"title":"Transfer between modalities with metaqueries","work_id":"a89f31c1-6a3f-451e-9971-692c11219ea3","shared_citers":11},{"title":"Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation","work_id":"41efe203-9377-4c63-b1d6-e499cd6e46f6","shared_citers":10},{"title":"Metamorph: Multimodal understanding and generation via instruction tuning","work_id":"cc1e25c3-90c2-4e87-9268-d28d48c546a1","shared_citers":10},{"title":"Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution","work_id":"8abcfe4f-e0fb-44b7-9123-448fac95f90a","shared_citers":10},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":10},{"title":"Vila-u: a unified foundation model integrating visual understanding and generation","work_id":"7039c3ef-6ce4-4c98-96ed-9eef3d669045","shared_citers":10},{"title":"FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space","work_id":"5dfe19d5-3541-4803-8fe9-3c8b9e29b281","shared_citers":9},{"title":"Mogao: An omni foundation model for interleaved multi-modal generation","work_id":"f2badd0e-c06a-45f9-9d9e-7ceda62176b8","shared_citers":9}],"time_series":[{"n":1,"year":2024},{"n":11,"year":2025},{"n":36,"year":2026}],"dependency_candidates":[]},"error":null,"updated_at":"2026-05-14T13:31:06.093164+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-14T13:21:14.445506+00:00"},"role_polarity":{"job_type":"role_polarity","status":"succeeded","result":{"title":"Emu3: Next-Token Prediction is All You Need","claims":[{"claim_text":"While next-token prediction is considered a promising path towards artificial general intelligence, it has struggled to excel in multimodal tasks, which are still dominated by diffusion models (e.g., Stable Diffusion) and compositional approaches (e.g., CLIP combined with LLMs). In this paper, we introduce Emu3, a new suite of state-of-the-art multimodal models trained solely with next-token prediction. By tokenizing images, text, and videos into a discrete space, we train a single transformer from scratch on a mixture of multimodal sequences. Emu3 outperforms several well-established task-spe","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"Nucleus-ImageTECHNICALREPORT Figure 11: GenEval overall scores for top-performing models. Nucleus-Image matches Qwen-Image at 0.87 and leads all models on spatial position understanding. Table 9: GenEval results. Model Single Two Count Colors Position Attr. Overall Show-o [44] 0.95 0.52 0.49 0.82 0.11 0.28 0.53 PixArt-α[45] 0.98 0.50 0.44 0.80 0.08 0.07 0.48 Emu3-Gen [46] 0.98 0.71 0.34 0.81 0.17 0.21 0.54 SD3 Medium [27] 0.98 0.74 0.63 0.67 0.34 0.36 0.62 FLUX.1 Dev [40] 0.98 0.81 0.74 0.79 0.2","claim_type":"baseline","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"comprehensive multilingual benchmark derived from real-world user instructions. BaselinesWe compare against both generation-only and unified models. For image generation, generation-only baselines include SDXL [38], Stable Diffusion 3.5 Medium/Large [11], FLUX.1- dev [3], Infinity [17], OmniGen2 [57] and Wan2.2-t2i-plus, unified models include Janus [56], Janus- Pro [6], Emu3 [54], OneCAT [27], Janus-Flow [34], BLIP3-o [5], UniWorld-V1 [31], Mogao [30] and BAGEL [9]. For editing, generation-only","claim_type":"baseline","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"Diffusion model alignment using direct preference optimization, 2023. [35] Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Lingpeng Kong, Qi Liu, Tianyu Liu, et al. Large language models are not fair evaluators. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9440-9450, 2024. [36] Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li,","claim_type":"other","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"consistent hand movement of the person holding a cup in the last case. Overall, these results demonstrate 17 (a) VBench Metrics Part I Models Params. Quality Score Semantic Score Subj. Consist. Bkg. Consist. Temp. Flicker. Motion Smooth. Dynamic Degree Aesthetic Quality Imaging Quality Object Class Generation-only Models ModelScope [112] 1.7B 78.05 66.54 89.87 95.29 98.28 95.79 66.39 52.06 58.57 82.25 LaVie [116] 3B 78.78 70.31 91.41 97.47 98.30 96.38 49.72 54.94 61.90 91.82 Show-1 [143] 6B 80.4","claim_type":"baseline","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"0LLaV A[1] AR Vicuna-7B - - - - -/809 37.0 38.7 - -LLaV A-1.5[2] AR Vicuna-7B - - - - -/1510 66.1 64.3 - -Qwen2-VL[5] AR Qwen2-7B 54.1 43.5 - 60.7 - - - - 58.2DeepSeek-VL[79] AR DeepSeek-7B36.6 - - - - 70.4 73.2 - -DeepSeek-VL2[80] AR - 51.1 - - 61.3 - - - - 62.8Janus[81] AR DeepSeek-1.3B 30.5 - - - -/1338 63.7 69.4 - -Janus-Pro[82] AR DeepSeek-7B 41.0 - - - -/1567 72.1 79.2 - -Emu3[83] AR - 31.6 - - - - 68.2 58.5 - -MAmmoTH[55] AR Qwen2.5-7B50.8 - 25.3 63.0 - 76.0 - 34.2 67.6LLaV A-OV[3] AR Qwe","claim_type":"baseline","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"hang Huang, Linjie Xing, Zhe Chen, Hao Li, Xizhou Zhu, Zhiguo Cao, et al. The all-seeing project: Towards panoptic visual recognition and understanding of the open world. In ICLR, 2024. 1, 3 [103] Weiyun Wang, Shuibo Zhang, Yiming Ren, Yuchen Duan, Tiantong Li, Shuo Liu, Mengkang Hu, Zhe Chen, Kaipeng Zhang, Lewei Lu, et al. Needle in a multimodal haystack. arXiv preprint arXiv:2406.07230, 2024. 3 [104] Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang,","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks Emu3: Next-Token Prediction is All You Need because it crossed a citation-hub threshold. Current citing contexts most often use it as baseline evidence (19 contexts).","role_counts":[{"n":19,"context_role":"baseline"},{"n":18,"context_role":"background"},{"n":2,"context_role":"other"},{"n":1,"context_role":"dataset"},{"n":1,"context_role":"method"}]},"error":null,"updated_at":"2026-05-23T08:04:22.828107+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"Emu3: Next-Token Prediction is All You Need","claims":[{"claim_text":"While next-token prediction is considered a promising path towards artificial general intelligence, it has struggled to excel in multimodal tasks, which are still dominated by diffusion models (e.g., Stable Diffusion) and compositional approaches (e.g., CLIP combined with LLMs). In this paper, we introduce Emu3, a new suite of state-of-the-art multimodal models trained solely with next-token prediction. By tokenizing images, text, and videos into a discrete space, we train a single transformer from scratch on a mixture of multimodal sequences. Emu3 outperforms several well-established task-spe","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Emu3: Next-Token Prediction is All You Need because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T13:31:08.425262+00:00"}},"summary":{"title":"Emu3: Next-Token Prediction is All You Need","claims":[{"claim_text":"While next-token prediction is considered a promising path towards artificial general intelligence, it has struggled to excel in multimodal tasks, which are still dominated by diffusion models (e.g., Stable Diffusion) and compositional approaches (e.g., CLIP combined with LLMs). In this paper, we introduce Emu3, a new suite of state-of-the-art multimodal models trained solely with next-token prediction. By tokenizing images, text, and videos into a discrete space, we train a single transformer from scratch on a mixture of multimodal sequences. Emu3 outperforms several well-established task-spe","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Emu3: Next-Token Prediction is All You Need because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling","work_id":"67d9e391-26d1-459e-ab56-07e60511c886","shared_citers":31},{"title":"Emerging Properties in Unified Multimodal Pretraining","work_id":"e0cfd82c-f5d4-44fd-b531-ec73ab0a805b","shared_citers":29},{"title":"Chameleon: Mixed-Modal Early-Fusion Foundation Models","work_id":"2661b9a6-25cc-41a1-8100-612d2b801289","shared_citers":25},{"title":"BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset","work_id":"86d896d2-592f-4d9b-938e-dfeb11f9388f","shared_citers":21},{"title":"Show-o: One Single Transformer to Unify Multimodal Understanding and Generation","work_id":"1393dc24-a6b2-44e1-b5d7-7009d1fa4811","shared_citers":20},{"title":"OmniGen2: Towards Instruction-Aligned Multimodal Generation","work_id":"d3153e5f-b6e2-4ab3-9f41-e24e24d64496","shared_citers":18},{"title":"Show-o2: Improved Native Unified Multimodal Models","work_id":"77f00563-1ce6-4fba-9d4e-c8ce83f716ac","shared_citers":18},{"title":"Qwen2.5-VL Technical Report","work_id":"69dffacb-bfe8-442d-be86-48624c60426f","shared_citers":17},{"title":"SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis","work_id":"8034c587-fba6-4941-87ba-c98f2ac962cb","shared_citers":16},{"title":"Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model","work_id":"c2bb4d2d-29de-4bf2-9150-3d6373ff358f","shared_citers":15},{"title":"UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation","work_id":"488a273e-95d8-46f1-87c7-2244068d00d0","shared_citers":15},{"title":"ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment","work_id":"94248955-4bc5-4517-98a0-66224a36d865","shared_citers":12},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":12},{"title":"Qwen-Image Technical Report","work_id":"d06d7ecc-7579-4f89-a60b-4278a0f3c562","shared_citers":12},{"title":"Wan: Open and Advanced Large-Scale Video Generative Models","work_id":"ad3ebc3b-4224-46c9-b61d-bcf135da0a7c","shared_citers":12},{"title":"Seed-x: Multimodal models with unified multi-granularity comprehension and generation","work_id":"15953092-dd9e-49ae-9f72-e28fc93a6068","shared_citers":11},{"title":"Transfer between modalities with metaqueries","work_id":"a89f31c1-6a3f-451e-9971-692c11219ea3","shared_citers":11},{"title":"Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation","work_id":"41efe203-9377-4c63-b1d6-e499cd6e46f6","shared_citers":10},{"title":"Metamorph: Multimodal understanding and generation via instruction tuning","work_id":"cc1e25c3-90c2-4e87-9268-d28d48c546a1","shared_citers":10},{"title":"Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution","work_id":"8abcfe4f-e0fb-44b7-9123-448fac95f90a","shared_citers":10},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":10},{"title":"Vila-u: a unified foundation model integrating visual understanding and generation","work_id":"7039c3ef-6ce4-4c98-96ed-9eef3d669045","shared_citers":10},{"title":"FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space","work_id":"5dfe19d5-3541-4803-8fe9-3c8b9e29b281","shared_citers":9},{"title":"Mogao: An omni foundation model for interleaved multi-modal generation","work_id":"f2badd0e-c06a-45f9-9d9e-7ceda62176b8","shared_citers":9}],"time_series":[{"n":1,"year":2024},{"n":11,"year":2025},{"n":36,"year":2026}],"dependency_candidates":[]},"authors":[{"id":"b74076d9-3582-479d-b7c0-55e370de8b96","orcid":null,"display_name":"Jinsheng Wang","source":"manual","import_confidence":0.72},{"id":"ad9117a5-4ba7-43cb-b864-e877731def01","orcid":null,"display_name":"Quan Sun","source":"manual","import_confidence":0.72},{"id":"5c640aa1-1c84-4ac1-bf1e-04b328ce8f24","orcid":null,"display_name":"Xiaosong Zhang","source":"manual","import_confidence":0.72},{"id":"d0d468d1-1f5a-458c-b7fe-f129c9678b91","orcid":null,"display_name":"Xinlong Wang","source":"manual","import_confidence":0.72},{"id":"e086e60c-db40-49db-92dd-2a66b963588f","orcid":null,"display_name":"Yufeng Cui","source":"manual","import_confidence":0.72},{"id":"8f31ae33-361f-4e27-a880-0a2869b5096d","orcid":null,"display_name":"Zhengxiong Luo","source":"manual","import_confidence":0.72}]}}