Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models
Pith reviewed 2026-05-18 02:44 UTC · model grok-4.3
The pith
Mixture-of-Transformers matches dense multi-modal performance at roughly half the compute by using modality-specific parameters.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By making non-embedding parameters modality-specific and retaining only global self-attention, the architecture delivers the performance of a dense baseline transformer on multi-modal tasks while using far fewer FLOPs during pretraining.
What carries the argument
Mixture-of-Transformers, which decouples feed-forward networks, attention matrices, and layer normalizations into per-modality copies while applying shared self-attention over the entire mixed input sequence.
Load-bearing premise
Global self-attention alone can preserve all necessary cross-modal interactions even after most parameters are made modality-specific.
What would settle it
A head-to-head evaluation on tasks that require precise alignment across modalities, such as generating images from spoken instructions, to check whether quality falls below the dense baseline at matched compute.
read the original abstract
The development of large language models (LLMs) has expanded to multi-modal systems capable of processing text, images, and speech within a unified framework. Training these models demands significantly larger datasets and computational resources compared to text-only LLMs. To address the scaling challenges, we introduce Mixture-of-Transformers (MoT), a sparse multi-modal transformer architecture that significantly reduces pretraining computational costs. MoT decouples non-embedding parameters of the model by modality -- including feed-forward networks, attention matrices, and layer normalization -- enabling modality-specific processing with global self-attention over the full input sequence. We evaluate MoT across multiple settings and model scales. In the Chameleon 7B setting (autoregressive text-and-image generation), MoT matches the dense baseline's performance using only 55.8\% of the FLOPs. When extended to include speech, MoT reaches speech performance comparable to the dense baseline with only 37.2\% of the FLOPs. In the Transfusion setting, where text and image are trained with different objectives, a 7B MoT model matches the image modality performance of the dense baseline with one third of the FLOPs, and a 760M MoT model outperforms a 1.4B dense baseline across key image generation metrics. System profiling further highlights MoT's practical benefits, achieving dense baseline image quality in 47.2\% of the wall-clock time and text quality in 75.6\% of the wall-clock time (measured on AWS p4de.24xlarge instances with NVIDIA A100 GPUs).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Mixture-of-Transformers (MoT), a sparse multi-modal architecture that decouples non-embedding parameters (FFNs, attention matrices, and layer norms) by modality while retaining global self-attention over the full sequence. It reports empirical results on Chameleon (text-image and text-image-speech autoregressive generation) and Transfusion (text-image with differing objectives), claiming that MoT matches or exceeds dense baseline performance at substantially lower compute: 55.8% FLOPs for 7B Chameleon text-image, 37.2% FLOPs when adding speech, one-third FLOPs for 7B Transfusion image metrics, and a 760M MoT outperforming a 1.4B dense baseline on image metrics, plus wall-clock reductions of 47.2% for images and 75.6% for text.
Significance. If the performance parity holds under controlled conditions, the work offers a practical route to scaling multi-modal foundation models with lower pretraining costs. The concrete FLOPs and wall-clock measurements across two distinct regimes (Chameleon and Transfusion) and multiple scales constitute a clear strength, as do the direct comparisons to dense baselines at matched or smaller parameter counts.
major comments (2)
- [§3] §3 (Architecture): The central design assumes that modality-specific Q/K/V projections and FFNs plus global self-attention preserve all necessary cross-modal interactions present in the shared-parameter dense baseline. Because cross-attention scores are now computed between vectors from independent linear maps, the subspaces may be misaligned; no analysis of attention patterns across modalities or controlled probing tasks is reported to test whether interaction quality is maintained. This assumption is load-bearing for attributing the reported FLOPs reductions (55.8%, 37.2%, one-third) to the architecture rather than other factors.
- [§4.1] §4.1 (Chameleon experiments): The claim that a 7B MoT matches the dense baseline at 55.8% FLOPs requires confirmation that both models used identical data, tokenization, optimizer schedules, and total training steps. Any deviation in effective training compute or regularization would undermine the direct attribution of parity to the sparse design.
minor comments (2)
- [Table 1] Table 1 and associated text: the exact breakdown of FLOPs into embedding vs. non-embedding components should be stated explicitly so readers can verify the 55.8% and 37.2% figures independently.
- [Figure 3] Figure 3 (wall-clock profiling): adding standard deviation across multiple runs or at least two independent seeds would strengthen the reported 47.2% and 75.6% time savings.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments. We address each major point below with clarifications based on the manuscript and indicate revisions where appropriate to strengthen the presentation of our results.
read point-by-point responses
-
Referee: [§3] §3 (Architecture): The central design assumes that modality-specific Q/K/V projections and FFNs plus global self-attention preserve all necessary cross-modal interactions present in the shared-parameter dense baseline. Because cross-attention scores are now computed between vectors from independent linear maps, the subspaces may be misaligned; no analysis of attention patterns across modalities or controlled probing tasks is reported to test whether interaction quality is maintained. This assumption is load-bearing for attributing the reported FLOPs reductions (55.8%, 37.2%, one-third) to the architecture rather than other factors.
Authors: We agree that verifying the quality of cross-modal interactions is important for attributing performance to the architecture. In MoT, modality-specific projections are used for Q/K/V and FFNs, but self-attention remains fully global over the concatenated sequence, allowing tokens from different modalities to directly attend to one another. This preserves the mechanism for learning cross-modal alignments, unlike fully decoupled models. Our results on Chameleon and Transfusion show that MoT matches or exceeds dense baselines on multi-modal tasks, which would be unlikely if interactions were substantially degraded. That said, we acknowledge the value of direct evidence and will add attention pattern analysis (e.g., average cross-modal attention scores and visualizations) plus a small probing experiment in the revised manuscript to explicitly compare interaction quality with the dense baseline. revision: yes
-
Referee: [§4.1] §4.1 (Chameleon experiments): The claim that a 7B MoT matches the dense baseline at 55.8% FLOPs requires confirmation that both models used identical data, tokenization, optimizer schedules, and total training steps. Any deviation in effective training compute or regularization would undermine the direct attribution of parity to the sparse design.
Authors: Both the 7B dense baseline and 7B MoT were trained with exactly the same data mixture, tokenization, optimizer (AdamW), learning-rate schedule, batch size, and total number of training steps. This controlled setup is described in Section 4.1 and the experimental details appendix; the only difference is the architecture itself. FLOPs are measured via standard forward-pass accounting on the same hardware, and wall-clock times are reported from identical training runs. We will add a short explicit statement in Section 4.1 reiterating the matched training protocol to make this clearer for readers. revision: partial
Circularity Check
No significant circularity; results are direct empirical comparisons
full rationale
The paper proposes the MoT architecture and reports measured performance and FLOPs reductions against dense baselines in Chameleon and Transfusion settings. These outcomes are obtained through training and evaluation rather than any derivation that reduces a claimed prediction to a fitted parameter or self-citation by construction. No equations define target metrics as functions of inputs chosen to match them, and the central claims rest on experimental data rather than self-referential steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard transformer self-attention and feed-forward blocks can be trained end-to-end with modality-specific parameter copies without loss of cross-modal capability.
Forward citations
Cited by 18 Pith papers
-
MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving
MindVLA-U1 introduces a unified streaming VLA with shared backbone, framewise memory, and language-guided action diffusion that surpasses human drivers on WOD-E2E planning metrics.
-
Action Emergence from Streaming Intent
A new VLA model called SI uses a four-step chain-of-thought to derive driving intent and applies it via classifier-free guidance to a flow-matching trajectory generator, showing competitive Waymo scores and intent-con...
-
NoiseGate: Learning Per-Latent Timestep Schedules as Information Gating in World Action Models
NoiseGate learns per-latent timestep schedules as an information-gating policy in diffusion-based world action models, yielding consistent gains on RoboTwin manipulation tasks.
-
Modular Sensory Stream for Integrating Physical Feedback in Vision-Language-Action Models
MoSS augments VLAs with decoupled modality streams for multiple physical signals, achieving synergistic gains in real-world robot tasks via joint attention and auxiliary future-signal prediction.
-
Pseudo-Unification: Entropy Probing Reveals Divergent Information Patterns in Unified Multimodal Models
Unified multimodal models exhibit pseudo-unification due to modality-asymmetric entropy encoding and pattern-split responses between text and image generation.
-
ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body
ViBES introduces a speech-language-behavior model using modality-specific transformer experts that jointly generates dialogue and 3D body actions, showing gains over separate co-speech and text-to-motion baselines on ...
-
MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving
MindVLA-U1 is the first unified streaming VLA architecture that surpasses human drivers on WOD-E2E planning metrics while matching VA latency and preserving language interfaces.
-
Action Emergence from Streaming Intent
Streaming Intent lets a VLA model derive driving intent via streamed chain-of-thought reasoning and use it to steer a flow-matching action head, yielding competitive Waymo scores plus intent-based trajectory control w...
-
SpatialFusion: Endowing Unified Image Generation with Intrinsic 3D Geometric Awareness
SpatialFusion internalizes 3D geometric awareness into unified image generation models by pairing an MLLM with a spatial transformer that produces depth maps to constrain diffusion generation.
-
Meta-CoT: Enhancing Granularity and Generalization in Image Editing
Meta-CoT uses two-level decomposition of editing operations into meta-tasks and a CoT consistency reward to improve granularity and generalization, reporting 15.8% gains across 21 tasks.
-
DriveLaW:Unifying Planning and Video Generation in a Latent Driving World
DriveLaW unifies video world modeling and trajectory planning by injecting video-generator latents into a diffusion planner, achieving SOTA video prediction and a new record on the NAVSIM planning benchmark.
-
F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions
F1 integrates next-scale visual foresight prediction into a Mixture-of-Transformer VLA architecture to reformulate action generation as foresight-guided inverse dynamics, achieving higher success rates on 136 tasks.
-
Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation
Mogao presents a causal unified model with deep fusion, dual encoders, and interleaved position embeddings that achieves strong performance on multi-modal understanding, text-to-image generation, and coherent interlea...
-
SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture
SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.
-
Steering Visual Generation in Unified Multimodal Models with Understanding Supervision
Using understanding tasks as direct supervision during post-training improves image generation and editing in unified multimodal models.
-
GR-3 Technical Report
GR-3 is a VLA model that generalizes to novel objects, environments, and abstract instructions, outperforms the π0 baseline, and integrates with the new ByteMini bi-manual mobile robot.
-
OmniVLA-RL: A Vision-Language-Action Model with Spatial Understanding and Online RL
OmniVLA-RL uses a mix-of-transformers architecture and flow-matching reformulated as SDE with group segmented policy optimization to surpass prior VLA models on LIBERO benchmarks.
-
Wan-Image: Pushing the Boundaries of Generative Visual Intelligence
Wan-Image is a unified multi-modal system that integrates LLMs and diffusion transformers to deliver professional-grade image generation features including complex typography, multi-subject consistency, and precise ed...
Reference graph
Works this paper leans on
-
[1]
com/blog/continuous-batching-llm-inference
Achieve 23x LLM Inference Throughput & Reduce p50 Latency — anyscale.com.https://www.anyscale. com/blog/continuous-batching-llm-inference. [Accessed 22-03-2025]. tutorials/Conceptual_Guide/Part_2-improving_resource_utilization at main · triton-inference- server/tutorials — github.com. https://github.com/triton-inference-server/tutorials/tree/ main/Concept...
-
[2]
URLhttps://arxiv.org/abs/2309.15564. Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems , 35:23716–23736,
-
[3]
BEiT: BERT Pre-Training of Image Transformers
Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. Beit: Bert pre-training of image transformers.arXiv preprint arXiv:2106.08254,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Vlmo: Unified vision-language pre-training with mixture-of-modality- experts
Hangbo Bao, Wenhui Wang, Li Dong, Qiang Liu, Owais Khan Mohammed, Kriti Aggarwal, Subhojit Som, Songhao Piao, and Furu Wei. Vlmo: Unified vision-language pre-training with mixture-of-modality- experts. Advances in Neural Information Processing Systems , 35:32897–32912, 2022a. Hangbo Bao, Wenhui Wang, Li Dong, Qiang Liu, Owais Khan Mohammed, Kriti Aggarwal...
-
[5]
Chameleon: Mixed-Modal Early-Fusion Foundation Models
URLhttps://arxiv. org/abs/2405.09818. Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts.CoRR, abs/2102.08981,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Jun Chen, Han Guo, Kai Yi, Boyang Li, and Mohamed Elhoseiny
URLhttps: //arxiv.org/abs/2102.08981. Jun Chen, Han Guo, Kai Yi, Boyang Li, and Mohamed Elhoseiny. Visualgpt: Data-efficient adaptation of pretrained language models for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pp. 18030–18040,
-
[7]
100,000 podcasts: A spoken English document corpus
36 Published in Transactions on Machine Learning Research (04/2025) Ann Clifton, Sravana Reddy, Yongze Yu, Aasish Pappu, Rezvaneh Rezapour, Hamed Bonab, Maria Eskevich, Gareth Jones, Jussi Karlgren, Ben Carterette, and Rosie Jones. 100,000 podcasts: A spoken English document corpus. In Proceedings of the 28th International Conference on Computational Ling...
work page 2025
-
[8]
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
International Committee on Computational Linguistics. URL https://www.aclweb.org/anthology/2020.coling-main.519. Damai Dai, Chengqi Deng, Chenggang Zhao, RX Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y Wu, et al. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. arXiv preprint arXiv:2401.06066 ,
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[9]
URLhttps://arxiv.org/abs/2412.19437. David Eigen, Marc’Aurelio Ranzato, and Ilya Sutskever. Learning factored representations in a deep mixture of experts. arXiv preprint arXiv:1312.4314 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
URLhttps://arxiv.org/abs/2403.03206. William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
URLhttps://arxiv.org/abs/2101.03961. Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman. Make-a-scene: Scene-based text-to-image generation with human priors.arXiv preprint arXiv:2203.13131 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
URL https://arxiv.org/abs/2211.15841. 37 Published in Transactions on Machine Learning Research (04/2025) Daniel Galvez, Greg Diamos, Juan Ciro, Juan Felipe Cerón, Keith Achorn, Anjali Gopi, David Kanter, Maximilian Lam, Mark Mazumder, and Vijay Janapa Reddi. The people’s speech: A large-scale diverse english speech recognition dataset for commercial usage,
-
[13]
Mars: Mixture of auto-regressive models for fine-grained text-to-image synthesis
Wanggui He, Siming Fu, Mushui Liu, Xierui Wang, Wenyi Xiao, Fangxun Shu, Yi Wang, Lei Zhang, Zhelun Yu, Haoyuan Li, et al. Mars: Mixture of auto-regressive models for fine-grained text-to-image synthesis. arXiv preprint arXiv:2407.07614 ,
-
[14]
Classifier-Free Diffusion Guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Kaiyi Huang, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xihui Liu. T2i-compbench: A comprehensive bench- mark for open-world compositional text-to-image generation.arXiv preprint arXiv:2307.06350 ,
-
[16]
URLhttps: //arxiv.org/abs/2401.04088. J. Kahn, M. Rivière, W. Zheng, E. Kharitonov, Q. Xu, P.E. Mazaré, J. Karadayi, V. Liptchinsky, R. Col- lobert, C. Fuegen, T. Likhomanenko, G. Synnaeve, A. Joulin, A. Mohamed, and E. Dupoux. Libri- light: A benchmark for asr with limited or no supervision. In ICASSP 2020 - 2020 IEEE Interna- tional Conference on Acoust...
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[17]
Diederik P Kingma and Max Welling
doi: 10.1109/ICASSP40776.2020.9052942. Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114 ,
-
[18]
Auto-Encoding Variational Bayes
URLhttps://arxiv.org/ abs/1312.6114. Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M Rush, Douwe Kiela, et al. Obelisc: An open web-scale filtered dataset of interleaved image-text documents.arXiv preprint arXiv:2306.16527 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
URLhttps://arxiv.org/abs/2006.16668. Victor Weixin Liang, Yuhui Zhang, Yongchan Kwon, Serena Yeung, and James Y Zou. Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning.Advances in Neural Information Processing Systems, 35:17612–17625,
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[20]
Moma: Efficientearly-fusionpre-trainingwithmixtureofmodality-awareexperts
38 Published in Transactions on Machine Learning Research (04/2025) Xi Victoria Lin, Akshat Shrivastava, Liang Luo, Srinivasan Iyer, Mike Lewis, Gargi Gosh, Luke Zettlemoyer, andArmenAghajanyan. Moma: Efficientearly-fusionpre-trainingwithmixtureofmodality-awareexperts. arXiv preprint arXiv:2407.21770 ,
-
[21]
AlexanderH.Liu, Heng-JuiChang, MichaelAuli, Wei-NingHsu, andJamesR.Glass. Dinosr: Self-distillation and online clustering for self-supervised speech representation learning, 2024a. URLhttps://arxiv.org/ abs/2305.10005. Bingchen Liu, Ehsan Akhgari, Alexander Visheratin, Aleks Kamko, Linmiao Xu, Shivam Shrirao, Chase Lambert, Joao Souza, Suhail Doshi, and D...
-
[22]
URLhttps://arxiv.org/abs/2407.21783. Zijun Long, George Killick, Richard McCreadie, and Gerardo Aragon Camarasa. Multiway-adapater: Adapt- ing large-scale multi-modal models for scalable image-text retrieval. arXiv preprint arXiv:2309.01516 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
Associa- tion for Computing Machinery. ISBN 9781450360111. doi: 10.1145/3267809.3267840. URL https: //doi.org/10.1145/3267809.3267840. Liang Luo, Peter West, Jacob Nelson, Arvind Krishnamurthy, and Luis Ceze. Plink: Dis- covering and exploiting locality for accelerated distributed training on the public cloud. In I. Dhillon, D. Papailiopoulos, and V. Sze ...
-
[24]
URL https://proceedings.mlsys.org/paper_files/paper/2020/file/ eca986d585a03890a412587a2f5ccb43-Paper.pdf. Liang Luo, Buyun Zhang, Michael Tsang, Yinbin Ma, Ching-Hsiang Chu, Yuxin Chen, Shen Li, Yuchen Hao, YanliZhao, GunaLakshminarayanan, EllieWen, JongsooPark, DheevatsaMudigere, andMaximNaumov. Disaggregated multi-tower: Topology-aware modeling techniq...
work page 2020
-
[25]
URLhttps://proceedings.mlsys.org/paper_files/paper/2024/file/ 78834433edc3291f4c6cbbd2759324db-Paper-Conference.pdf. Niklas Muennighoff, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Jacob Morrison, Sewon Min, Weijia Shi, Pete Walsh, Oyvind Tafjord, Nathan Lambert, Yuling Gu, Shane Arora, Akshita Bhagia, Dustin Schwenk, David Wadden, Alexander Wettig, Binyuan ...
work page 2024
-
[26]
OLMoE: Open Mixture-of-Experts Language Models
URLhttps://arxiv.org/abs/2409.02060. Tu Anh Nguyen, Benjamin Muller, Bokai Yu, Marta R. Costa-jussa, Maha Elbayad, Sravya Popuri, Paul- Ambroise Duquenne, Robin Algayres, Ruslan Mavlyutov, Itai Gat, Gabriel Synnaeve, Juan Pino, Benoit Sagot, and Emmanuel Dupoux. Spirit-lm: Interleaved spoken and written language model,
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
Alexander Quinn Nichol and Prafulla Dhariwal
URL https://arxiv.org/abs/2402.05755. Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International conference on machine learning , pp. 8162–8171. PMLR,
-
[28]
Introducing grouped gemm apis in cublas and more performance updates | nvidia technical blog
39 Published in Transactions on Machine Learning Research (04/2025) Nvidia. Introducing grouped gemm apis in cublas and more performance updates | nvidia technical blog. https://developer.nvidia.com/blog/ introducing-grouped-gemm-apis-in-cublas-and-more-performance-updates/ . (Accessed on 10/04/2024). Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C...
work page 2025
-
[29]
Movie Gen: A Cast of Media Foundation Models
URLhttps://arxiv.org/abs/2410.13720. Vineel Pratap, Qiantong Xu, Anuroop Sriram, Gabriel Synnaeve, and Ronan Collobert. Mls: A large-scale multilingual dataset for speech research.ArXiv, abs/2012.03411,
work page internal anchor Pith review Pith/arXiv arXiv 2012
-
[30]
Learning Transferable Visual Models From Natural Language Supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[31]
Zero-Shot Text-to-Image Generation
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation.arXiv preprint arXiv:2102.12092 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[32]
Mixture-of-Depths: Dynamically allocating compute in transformer-based language models
David Raposo, Sam Ritter, Blake Richards, Timothy Lillicrap, Peter Conway Humphreys, and Adam San- toro. Mixture-of-depths: Dynamically allocating compute in transformer-based language models.arXiv preprint arXiv:2404.02258,
work page internal anchor Pith review Pith/arXiv arXiv
-
[34]
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
URL http://arxiv.org/abs/1701.06538. Sheng Shen, Zhewei Yao, Chunyuan Li, Trevor Darrell, Kurt Keutzer, and Yuxiong He. Scaling vision- language models with sparse mixture of experts.arXiv preprint arXiv:2303.07226 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[35]
Jetmoe: Reaching llama2 performance with 0.1 m dollars
Yikang Shen, Zhen Guo, Tianle Cai, and Zengyi Qin. Jetmoe: Reaching llama2 performance with 0.1 m dollars. arXiv preprint arXiv:2404.07413 ,
-
[36]
URLhttps://arxiv.org/abs/2403.07816. 40 Published in Transactions on Machine Learning Research (04/2025) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971...
-
[37]
CogVLM: Visual Expert for Pretrained Language Models
Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.80. URL https://aclanthology.org/2021.acl-long.80. Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, et al. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2021.acl-long.80 2021
-
[38]
Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun
URLhttps://arxiv.org/abs/2208.10442. Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. Orca: A distributed serving system for Transformer-Based generative models. In16th USENIX Symposium on Operating Sys- tems Design and Implementation (OSDI
-
[39]
USENIX Association. ISBN 978-1-939133-28-1. URL https://www.usenix.org/conference/osdi22/presentation/yu. Lili Yu, Bowen Shi, Ramakanth Pasunuru, Benjamin Muller, Olga Golovneva, Tianlu Wang, Arun Babu, Binh Tang, Brian Karrer, Shelly Sheynin, et al. Scaling autoregressive multi-modal models: Pretraining and instruction tuning. arXiv preprint arXiv:2309.02591 ,
-
[40]
ISSN 2150-8097. doi: 10.14778/3611540.3611569. URL https://doi.org/10.14778/3611540.3611569. Zexuan Zhong, Mengzhou Xia, Danqi Chen, and Mike Lewis. Lory: Fully differentiable mixture-of-experts for autoregressive language model pre-training. InFirst Conference on Language Modeling ,
-
[41]
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
URL https://openreview.net/forum?id=LKEJPySnlt. Chunting Zhou, Lili Yu, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, and Omer Levy. Transfusion: Predict the next token and diffuse images with one multi-modal model.arXiv preprint arXiv:2408.11039 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[42]
URLhttps://arxiv. org/abs/2202.09368. 41 Published in Transactions on Machine Learning Research (04/2025) A Tranfusion: Preliminaries A.1 Diffusion for Image Generation Diffusion models have emerged as a powerful class of generative models capable of producing high-fidelity data across various modalities. These models utilize a Markov chain that progressi...
-
[43]
, which perform the diffusion process in a lower-dimensional latent space (e.g. represent every 8×8 pixel patch as an 8-dimensional vector.) rather than directly in the high-dimensional data space. Specifically, we first encode the original datax0 into a latent representationz0 using a Variational autoencoders (VAEs) (Kingma & Welling, 2013)). The diffusi...
work page 2013
-
[44]
In Figure 21, both MoT and dense fine-tuned models successfully follow the prompts. However, in Figure 20, the MoT fine-tuned model demonstratessuperiorperformance, producingimagesthatareeithermorevisuallyappealingormorefaithful to the prompts. In Figure 22, both models struggle to perfectly follow the text prompts and fail to capture all the details accu...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.