pith. machine review for the scientific record. sign in

arxiv: 2411.04996 · v2 · pith:V26QLI5Dnew · submitted 2024-11-07 · 💻 cs.CL

Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models

Pith reviewed 2026-05-18 02:44 UTC · model grok-4.3

classification 💻 cs.CL
keywords mixture-of-transformersmulti-modal modelssparse architecturecompute efficiencymodality-specific parameterstransformer pretrainingfoundation models
0
0 comments X

The pith

Mixture-of-Transformers matches dense multi-modal performance at roughly half the compute by using modality-specific parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Mixture-of-Transformers as a sparse architecture for models that jointly handle text, images, and speech. It assigns separate parameters for feed-forward networks, attention matrices, and layer normalizations to each modality while keeping self-attention global across the full sequence. This separation targets the high cost of pretraining large multi-modal systems. Experiments demonstrate that the approach reaches comparable or better quality than standard dense models at substantially lower floating-point operations and wall-clock time. The result points to a direct way to scale these models without proportional growth in resources.

Core claim

By making non-embedding parameters modality-specific and retaining only global self-attention, the architecture delivers the performance of a dense baseline transformer on multi-modal tasks while using far fewer FLOPs during pretraining.

What carries the argument

Mixture-of-Transformers, which decouples feed-forward networks, attention matrices, and layer normalizations into per-modality copies while applying shared self-attention over the entire mixed input sequence.

Load-bearing premise

Global self-attention alone can preserve all necessary cross-modal interactions even after most parameters are made modality-specific.

What would settle it

A head-to-head evaluation on tasks that require precise alignment across modalities, such as generating images from spoken instructions, to check whether quality falls below the dense baseline at matched compute.

read the original abstract

The development of large language models (LLMs) has expanded to multi-modal systems capable of processing text, images, and speech within a unified framework. Training these models demands significantly larger datasets and computational resources compared to text-only LLMs. To address the scaling challenges, we introduce Mixture-of-Transformers (MoT), a sparse multi-modal transformer architecture that significantly reduces pretraining computational costs. MoT decouples non-embedding parameters of the model by modality -- including feed-forward networks, attention matrices, and layer normalization -- enabling modality-specific processing with global self-attention over the full input sequence. We evaluate MoT across multiple settings and model scales. In the Chameleon 7B setting (autoregressive text-and-image generation), MoT matches the dense baseline's performance using only 55.8\% of the FLOPs. When extended to include speech, MoT reaches speech performance comparable to the dense baseline with only 37.2\% of the FLOPs. In the Transfusion setting, where text and image are trained with different objectives, a 7B MoT model matches the image modality performance of the dense baseline with one third of the FLOPs, and a 760M MoT model outperforms a 1.4B dense baseline across key image generation metrics. System profiling further highlights MoT's practical benefits, achieving dense baseline image quality in 47.2\% of the wall-clock time and text quality in 75.6\% of the wall-clock time (measured on AWS p4de.24xlarge instances with NVIDIA A100 GPUs).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Mixture-of-Transformers (MoT), a sparse multi-modal architecture that decouples non-embedding parameters (FFNs, attention matrices, and layer norms) by modality while retaining global self-attention over the full sequence. It reports empirical results on Chameleon (text-image and text-image-speech autoregressive generation) and Transfusion (text-image with differing objectives), claiming that MoT matches or exceeds dense baseline performance at substantially lower compute: 55.8% FLOPs for 7B Chameleon text-image, 37.2% FLOPs when adding speech, one-third FLOPs for 7B Transfusion image metrics, and a 760M MoT outperforming a 1.4B dense baseline on image metrics, plus wall-clock reductions of 47.2% for images and 75.6% for text.

Significance. If the performance parity holds under controlled conditions, the work offers a practical route to scaling multi-modal foundation models with lower pretraining costs. The concrete FLOPs and wall-clock measurements across two distinct regimes (Chameleon and Transfusion) and multiple scales constitute a clear strength, as do the direct comparisons to dense baselines at matched or smaller parameter counts.

major comments (2)
  1. [§3] §3 (Architecture): The central design assumes that modality-specific Q/K/V projections and FFNs plus global self-attention preserve all necessary cross-modal interactions present in the shared-parameter dense baseline. Because cross-attention scores are now computed between vectors from independent linear maps, the subspaces may be misaligned; no analysis of attention patterns across modalities or controlled probing tasks is reported to test whether interaction quality is maintained. This assumption is load-bearing for attributing the reported FLOPs reductions (55.8%, 37.2%, one-third) to the architecture rather than other factors.
  2. [§4.1] §4.1 (Chameleon experiments): The claim that a 7B MoT matches the dense baseline at 55.8% FLOPs requires confirmation that both models used identical data, tokenization, optimizer schedules, and total training steps. Any deviation in effective training compute or regularization would undermine the direct attribution of parity to the sparse design.
minor comments (2)
  1. [Table 1] Table 1 and associated text: the exact breakdown of FLOPs into embedding vs. non-embedding components should be stated explicitly so readers can verify the 55.8% and 37.2% figures independently.
  2. [Figure 3] Figure 3 (wall-clock profiling): adding standard deviation across multiple runs or at least two independent seeds would strengthen the reported 47.2% and 75.6% time savings.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below with clarifications based on the manuscript and indicate revisions where appropriate to strengthen the presentation of our results.

read point-by-point responses
  1. Referee: [§3] §3 (Architecture): The central design assumes that modality-specific Q/K/V projections and FFNs plus global self-attention preserve all necessary cross-modal interactions present in the shared-parameter dense baseline. Because cross-attention scores are now computed between vectors from independent linear maps, the subspaces may be misaligned; no analysis of attention patterns across modalities or controlled probing tasks is reported to test whether interaction quality is maintained. This assumption is load-bearing for attributing the reported FLOPs reductions (55.8%, 37.2%, one-third) to the architecture rather than other factors.

    Authors: We agree that verifying the quality of cross-modal interactions is important for attributing performance to the architecture. In MoT, modality-specific projections are used for Q/K/V and FFNs, but self-attention remains fully global over the concatenated sequence, allowing tokens from different modalities to directly attend to one another. This preserves the mechanism for learning cross-modal alignments, unlike fully decoupled models. Our results on Chameleon and Transfusion show that MoT matches or exceeds dense baselines on multi-modal tasks, which would be unlikely if interactions were substantially degraded. That said, we acknowledge the value of direct evidence and will add attention pattern analysis (e.g., average cross-modal attention scores and visualizations) plus a small probing experiment in the revised manuscript to explicitly compare interaction quality with the dense baseline. revision: yes

  2. Referee: [§4.1] §4.1 (Chameleon experiments): The claim that a 7B MoT matches the dense baseline at 55.8% FLOPs requires confirmation that both models used identical data, tokenization, optimizer schedules, and total training steps. Any deviation in effective training compute or regularization would undermine the direct attribution of parity to the sparse design.

    Authors: Both the 7B dense baseline and 7B MoT were trained with exactly the same data mixture, tokenization, optimizer (AdamW), learning-rate schedule, batch size, and total number of training steps. This controlled setup is described in Section 4.1 and the experimental details appendix; the only difference is the architecture itself. FLOPs are measured via standard forward-pass accounting on the same hardware, and wall-clock times are reported from identical training runs. We will add a short explicit statement in Section 4.1 reiterating the matched training protocol to make this clearer for readers. revision: partial

Circularity Check

0 steps flagged

No significant circularity; results are direct empirical comparisons

full rationale

The paper proposes the MoT architecture and reports measured performance and FLOPs reductions against dense baselines in Chameleon and Transfusion settings. These outcomes are obtained through training and evaluation rather than any derivation that reduces a claimed prediction to a fitted parameter or self-citation by construction. No equations define target metrics as functions of inputs chosen to match them, and the central claims rest on experimental data rather than self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces an architectural design choice rather than new physical entities or fitted constants; the main unstated premises are standard transformer training assumptions and the sufficiency of global attention for cross-modal fusion.

axioms (1)
  • domain assumption Standard transformer self-attention and feed-forward blocks can be trained end-to-end with modality-specific parameter copies without loss of cross-modal capability.
    Invoked when claiming that decoupling non-embedding parameters preserves performance.

pith-pipeline@v0.9.0 · 5856 in / 1403 out tokens · 26665 ms · 2026-05-18T02:44:13.719788+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving

    cs.RO 2026-05 unverdicted novelty 7.0

    MindVLA-U1 introduces a unified streaming VLA with shared backbone, framewise memory, and language-guided action diffusion that surpasses human drivers on WOD-E2E planning metrics.

  2. Action Emergence from Streaming Intent

    cs.RO 2026-05 unverdicted novelty 7.0

    A new VLA model called SI uses a four-step chain-of-thought to derive driving intent and applies it via classifier-free guidance to a flow-matching trajectory generator, showing competitive Waymo scores and intent-con...

  3. NoiseGate: Learning Per-Latent Timestep Schedules as Information Gating in World Action Models

    cs.RO 2026-05 unverdicted novelty 7.0

    NoiseGate learns per-latent timestep schedules as an information-gating policy in diffusion-based world action models, yielding consistent gains on RoboTwin manipulation tasks.

  4. Modular Sensory Stream for Integrating Physical Feedback in Vision-Language-Action Models

    cs.RO 2026-04 unverdicted novelty 7.0

    MoSS augments VLAs with decoupled modality streams for multiple physical signals, achieving synergistic gains in real-world robot tasks via joint attention and auxiliary future-signal prediction.

  5. Pseudo-Unification: Entropy Probing Reveals Divergent Information Patterns in Unified Multimodal Models

    cs.CV 2026-04 unverdicted novelty 7.0

    Unified multimodal models exhibit pseudo-unification due to modality-asymmetric entropy encoding and pattern-split responses between text and image generation.

  6. ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body

    cs.CV 2025-12 unverdicted novelty 7.0

    ViBES introduces a speech-language-behavior model using modality-specific transformer experts that jointly generates dialogue and 3D body actions, showing gains over separate co-speech and text-to-motion baselines on ...

  7. MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving

    cs.RO 2026-05 unverdicted novelty 6.0

    MindVLA-U1 is the first unified streaming VLA architecture that surpasses human drivers on WOD-E2E planning metrics while matching VA latency and preserving language interfaces.

  8. Action Emergence from Streaming Intent

    cs.RO 2026-05 unverdicted novelty 6.0

    Streaming Intent lets a VLA model derive driving intent via streamed chain-of-thought reasoning and use it to steer a flow-matching action head, yielding competitive Waymo scores plus intent-based trajectory control w...

  9. SpatialFusion: Endowing Unified Image Generation with Intrinsic 3D Geometric Awareness

    cs.CV 2026-04 unverdicted novelty 6.0

    SpatialFusion internalizes 3D geometric awareness into unified image generation models by pairing an MLLM with a spatial transformer that produces depth maps to constrain diffusion generation.

  10. Meta-CoT: Enhancing Granularity and Generalization in Image Editing

    cs.CV 2026-04 unverdicted novelty 6.0

    Meta-CoT uses two-level decomposition of editing operations into meta-tasks and a CoT consistency reward to improve granularity and generalization, reporting 15.8% gains across 21 tasks.

  11. DriveLaW:Unifying Planning and Video Generation in a Latent Driving World

    cs.CV 2025-12 unverdicted novelty 6.0

    DriveLaW unifies video world modeling and trajectory planning by injecting video-generator latents into a diffusion planner, achieving SOTA video prediction and a new record on the NAVSIM planning benchmark.

  12. F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions

    cs.RO 2025-09 unverdicted novelty 6.0

    F1 integrates next-scale visual foresight prediction into a Mixture-of-Transformer VLA architecture to reformulate action generation as foresight-guided inverse dynamics, achieving higher success rates on 136 tasks.

  13. Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation

    cs.CV 2025-05 unverdicted novelty 6.0

    Mogao presents a causal unified model with deep fusion, dual encoders, and interleaved position embeddings that achieves strong performance on multi-modal understanding, text-to-image generation, and coherent interlea...

  14. SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

    cs.CV 2026-05 unverdicted novelty 5.0

    SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.

  15. Steering Visual Generation in Unified Multimodal Models with Understanding Supervision

    cs.CV 2026-05 unverdicted novelty 5.0

    Using understanding tasks as direct supervision during post-training improves image generation and editing in unified multimodal models.

  16. GR-3 Technical Report

    cs.RO 2025-07 unverdicted novelty 5.0

    GR-3 is a VLA model that generalizes to novel objects, environments, and abstract instructions, outperforms the π0 baseline, and integrates with the new ByteMini bi-manual mobile robot.

  17. OmniVLA-RL: A Vision-Language-Action Model with Spatial Understanding and Online RL

    cs.RO 2026-04 unverdicted novelty 4.0

    OmniVLA-RL uses a mix-of-transformers architecture and flow-matching reformulated as SDE with group segmented policy optimization to surpass prior VLA models on LIBERO benchmarks.

  18. Wan-Image: Pushing the Boundaries of Generative Visual Intelligence

    cs.CV 2026-04 unverdicted novelty 3.0

    Wan-Image is a unified multi-modal system that integrates LLMs and diffusion transformers to deliver professional-grade image generation features including complex typography, multi-subject consistency, and precise ed...

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · cited by 16 Pith papers · 19 internal anchors

  1. [1]

    com/blog/continuous-batching-llm-inference

    Achieve 23x LLM Inference Throughput & Reduce p50 Latency — anyscale.com.https://www.anyscale. com/blog/continuous-batching-llm-inference. [Accessed 22-03-2025]. tutorials/Conceptual_Guide/Part_2-improving_resource_utilization at main · triton-inference- server/tutorials — github.com. https://github.com/triton-inference-server/tutorials/tree/ main/Concept...

  2. [2]

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al

    URLhttps://arxiv.org/abs/2309.15564. Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems , 35:23716–23736,

  3. [3]

    BEiT: BERT Pre-Training of Image Transformers

    Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. Beit: Bert pre-training of image transformers.arXiv preprint arXiv:2106.08254,

  4. [4]

    Vlmo: Unified vision-language pre-training with mixture-of-modality- experts

    Hangbo Bao, Wenhui Wang, Li Dong, Qiang Liu, Owais Khan Mohammed, Kriti Aggarwal, Subhojit Som, Songhao Piao, and Furu Wei. Vlmo: Unified vision-language pre-training with mixture-of-modality- experts. Advances in Neural Information Processing Systems , 35:32897–32912, 2022a. Hangbo Bao, Wenhui Wang, Li Dong, Qiang Liu, Owais Khan Mohammed, Kriti Aggarwal...

  5. [5]

    Chameleon: Mixed-Modal Early-Fusion Foundation Models

    URLhttps://arxiv. org/abs/2405.09818. Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts.CoRR, abs/2102.08981,

  6. [6]

    Jun Chen, Han Guo, Kai Yi, Boyang Li, and Mohamed Elhoseiny

    URLhttps: //arxiv.org/abs/2102.08981. Jun Chen, Han Guo, Kai Yi, Boyang Li, and Mohamed Elhoseiny. Visualgpt: Data-efficient adaptation of pretrained language models for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pp. 18030–18040,

  7. [7]

    100,000 podcasts: A spoken English document corpus

    36 Published in Transactions on Machine Learning Research (04/2025) Ann Clifton, Sravana Reddy, Yongze Yu, Aasish Pappu, Rezvaneh Rezapour, Hamed Bonab, Maria Eskevich, Gareth Jones, Jussi Karlgren, Ben Carterette, and Rosie Jones. 100,000 podcasts: A spoken English document corpus. In Proceedings of the 28th International Conference on Computational Ling...

  8. [8]

    DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

    International Committee on Computational Linguistics. URL https://www.aclweb.org/anthology/2020.coling-main.519. Damai Dai, Chengqi Deng, Chenggang Zhao, RX Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y Wu, et al. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. arXiv preprint arXiv:2401.06066 ,

  9. [9]

    DeepSeek-V3 Technical Report

    URLhttps://arxiv.org/abs/2412.19437. David Eigen, Marc’Aurelio Ranzato, and Ilya Sutskever. Learning factored representations in a deep mixture of experts. arXiv preprint arXiv:1312.4314 ,

  10. [10]

    Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

    URLhttps://arxiv.org/abs/2403.03206. William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,

  11. [11]

    Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

    URLhttps://arxiv.org/abs/2101.03961. Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman. Make-a-scene: Scene-based text-to-image generation with human priors.arXiv preprint arXiv:2203.13131 ,

  12. [12]

    URL https://arxiv.org/abs/2211.15841. 37 Published in Transactions on Machine Learning Research (04/2025) Daniel Galvez, Greg Diamos, Juan Ciro, Juan Felipe Cerón, Keith Achorn, Anjali Gopi, David Kanter, Maximilian Lam, Mark Mazumder, and Vijay Janapa Reddi. The people’s speech: A large-scale diverse english speech recognition dataset for commercial usage,

  13. [13]

    Mars: Mixture of auto-regressive models for fine-grained text-to-image synthesis

    Wanggui He, Siming Fu, Mushui Liu, Xierui Wang, Wenyi Xiao, Fangxun Shu, Yi Wang, Lei Zhang, Zhelun Yu, Haoyuan Li, et al. Mars: Mixture of auto-regressive models for fine-grained text-to-image synthesis. arXiv preprint arXiv:2407.07614 ,

  14. [14]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598 ,

  15. [15]

    T2i-compbench: A comprehensive bench- mark for open-world compositional text-to-image generation.arXiv preprint arXiv:2307.06350 ,

    Kaiyi Huang, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xihui Liu. T2i-compbench: A comprehensive bench- mark for open-world compositional text-to-image generation.arXiv preprint arXiv:2307.06350 ,

  16. [16]

    URLhttps: //arxiv.org/abs/2401.04088. J. Kahn, M. Rivière, W. Zheng, E. Kharitonov, Q. Xu, P.E. Mazaré, J. Karadayi, V. Liptchinsky, R. Col- lobert, C. Fuegen, T. Likhomanenko, G. Synnaeve, A. Joulin, A. Mohamed, and E. Dupoux. Libri- light: A benchmark for asr with limited or no supervision. In ICASSP 2020 - 2020 IEEE Interna- tional Conference on Acoust...

  17. [17]

    Diederik P Kingma and Max Welling

    doi: 10.1109/ICASSP40776.2020.9052942. Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114 ,

  18. [18]

    Auto-Encoding Variational Bayes

    URLhttps://arxiv.org/ abs/1312.6114. Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M Rush, Douwe Kiela, et al. Obelisc: An open web-scale filtered dataset of interleaved image-text documents.arXiv preprint arXiv:2306.16527 ,

  19. [19]

    GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

    URLhttps://arxiv.org/abs/2006.16668. Victor Weixin Liang, Yuhui Zhang, Yongchan Kwon, Serena Yeung, and James Y Zou. Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning.Advances in Neural Information Processing Systems, 35:17612–17625,

  20. [20]

    Moma: Efficientearly-fusionpre-trainingwithmixtureofmodality-awareexperts

    38 Published in Transactions on Machine Learning Research (04/2025) Xi Victoria Lin, Akshat Shrivastava, Liang Luo, Srinivasan Iyer, Mike Lewis, Gargi Gosh, Luke Zettlemoyer, andArmenAghajanyan. Moma: Efficientearly-fusionpre-trainingwithmixtureofmodality-awareexperts. arXiv preprint arXiv:2407.21770 ,

  21. [21]

    Dinosr: Self-distillation and online clustering for self-supervised speech representation learning, 2024a

    AlexanderH.Liu, Heng-JuiChang, MichaelAuli, Wei-NingHsu, andJamesR.Glass. Dinosr: Self-distillation and online clustering for self-supervised speech representation learning, 2024a. URLhttps://arxiv.org/ abs/2305.10005. Bingchen Liu, Ehsan Akhgari, Alexander Visheratin, Aleks Kamko, Linmiao Xu, Shivam Shrirao, Chase Lambert, Joao Souza, Suhail Doshi, and D...

  22. [22]

    The Llama 3 Herd of Models

    URLhttps://arxiv.org/abs/2407.21783. Zijun Long, George Killick, Richard McCreadie, and Gerardo Aragon Camarasa. Multiway-adapater: Adapt- ing large-scale multi-modal models for scalable image-text retrieval. arXiv preprint arXiv:2309.01516 ,

  23. [23]

    ISBN 9781450360111

    Associa- tion for Computing Machinery. ISBN 9781450360111. doi: 10.1145/3267809.3267840. URL https: //doi.org/10.1145/3267809.3267840. Liang Luo, Peter West, Jacob Nelson, Arvind Krishnamurthy, and Luis Ceze. Plink: Dis- covering and exploiting locality for accelerated distributed training on the public cloud. In I. Dhillon, D. Papailiopoulos, and V. Sze ...

  24. [24]

    URL https://proceedings.mlsys.org/paper_files/paper/2020/file/ eca986d585a03890a412587a2f5ccb43-Paper.pdf. Liang Luo, Buyun Zhang, Michael Tsang, Yinbin Ma, Ching-Hsiang Chu, Yuxin Chen, Shen Li, Yuchen Hao, YanliZhao, GunaLakshminarayanan, EllieWen, JongsooPark, DheevatsaMudigere, andMaximNaumov. Disaggregated multi-tower: Topology-aware modeling techniq...

  25. [25]

    URLhttps://proceedings.mlsys.org/paper_files/paper/2024/file/ 78834433edc3291f4c6cbbd2759324db-Paper-Conference.pdf. Niklas Muennighoff, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Jacob Morrison, Sewon Min, Weijia Shi, Pete Walsh, Oyvind Tafjord, Nathan Lambert, Yuling Gu, Shane Arora, Akshita Bhagia, Dustin Schwenk, David Wadden, Alexander Wettig, Binyuan ...

  26. [26]

    OLMoE: Open Mixture-of-Experts Language Models

    URLhttps://arxiv.org/abs/2409.02060. Tu Anh Nguyen, Benjamin Muller, Bokai Yu, Marta R. Costa-jussa, Maha Elbayad, Sravya Popuri, Paul- Ambroise Duquenne, Robin Algayres, Ruslan Mavlyutov, Itai Gat, Gabriel Synnaeve, Juan Pino, Benoit Sagot, and Emmanuel Dupoux. Spirit-lm: Interleaved spoken and written language model,

  27. [27]

    Alexander Quinn Nichol and Prafulla Dhariwal

    URL https://arxiv.org/abs/2402.05755. Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International conference on machine learning , pp. 8162–8171. PMLR,

  28. [28]

    Introducing grouped gemm apis in cublas and more performance updates | nvidia technical blog

    39 Published in Transactions on Machine Learning Research (04/2025) Nvidia. Introducing grouped gemm apis in cublas and more performance updates | nvidia technical blog. https://developer.nvidia.com/blog/ introducing-grouped-gemm-apis-in-cublas-and-more-performance-updates/ . (Accessed on 10/04/2024). Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C...

  29. [29]

    Movie Gen: A Cast of Media Foundation Models

    URLhttps://arxiv.org/abs/2410.13720. Vineel Pratap, Qiantong Xu, Anuroop Sriram, Gabriel Synnaeve, and Ronan Collobert. Mls: A large-scale multilingual dataset for speech research.ArXiv, abs/2012.03411,

  30. [30]

    Learning Transferable Visual Models From Natural Language Supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020 ,

  31. [31]

    Zero-Shot Text-to-Image Generation

    Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation.arXiv preprint arXiv:2102.12092 ,

  32. [32]

    Mixture-of-Depths: Dynamically allocating compute in transformer-based language models

    David Raposo, Sam Ritter, Blake Richards, Timothy Lillicrap, Peter Conway Humphreys, and Adam San- toro. Mixture-of-depths: Dynamically allocating compute in transformer-based language models.arXiv preprint arXiv:2404.02258,

  33. [34]

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

    URL http://arxiv.org/abs/1701.06538. Sheng Shen, Zhewei Yao, Chunyuan Li, Trevor Darrell, Kurt Keutzer, and Yuxiong He. Scaling vision- language models with sparse mixture of experts.arXiv preprint arXiv:2303.07226 ,

  34. [35]

    Jetmoe: Reaching llama2 performance with 0.1 m dollars

    Yikang Shen, Zhen Guo, Tianle Cai, and Zengyi Qin. Jetmoe: Reaching llama2 performance with 0.1 m dollars. arXiv preprint arXiv:2404.07413 ,

  35. [36]

    URLhttps://arxiv.org/abs/2403.07816. 40 Published in Transactions on Machine Learning Research (04/2025) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971...

  36. [37]

    CogVLM: Visual Expert for Pretrained Language Models

    Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.80. URL https://aclanthology.org/2021.acl-long.80. Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, et al. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079,

  37. [38]

    Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun

    URLhttps://arxiv.org/abs/2208.10442. Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. Orca: A distributed serving system for Transformer-Based generative models. In16th USENIX Symposium on Operating Sys- tems Design and Implementation (OSDI

  38. [39]

    ISBN 978-1-939133-28-1

    USENIX Association. ISBN 978-1-939133-28-1. URL https://www.usenix.org/conference/osdi22/presentation/yu. Lili Yu, Bowen Shi, Ramakanth Pasunuru, Benjamin Muller, Olga Golovneva, Tianlu Wang, Arun Babu, Binh Tang, Brian Karrer, Shelly Sheynin, et al. Scaling autoregressive multi-modal models: Pretraining and instruction tuning. arXiv preprint arXiv:2309.02591 ,

  39. [40]

    doi: 10.14778/3611540.3611569

    ISSN 2150-8097. doi: 10.14778/3611540.3611569. URL https://doi.org/10.14778/3611540.3611569. Zexuan Zhong, Mengzhou Xia, Danqi Chen, and Mike Lewis. Lory: Fully differentiable mixture-of-experts for autoregressive language model pre-training. InFirst Conference on Language Modeling ,

  40. [41]

    Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

    URL https://openreview.net/forum?id=LKEJPySnlt. Chunting Zhou, Lili Yu, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, and Omer Levy. Transfusion: Predict the next token and diffuse images with one multi-modal model.arXiv preprint arXiv:2408.11039 ,

  41. [42]

    org/abs/2202.09368

    URLhttps://arxiv. org/abs/2202.09368. 41 Published in Transactions on Machine Learning Research (04/2025) A Tranfusion: Preliminaries A.1 Diffusion for Image Generation Diffusion models have emerged as a powerful class of generative models capable of producing high-fidelity data across various modalities. These models utilize a Markov chain that progressi...

  42. [43]

    represent every 8×8 pixel patch as an 8-dimensional vector.) rather than directly in the high-dimensional data space

    , which perform the diffusion process in a lower-dimensional latent space (e.g. represent every 8×8 pixel patch as an 8-dimensional vector.) rather than directly in the high-dimensional data space. Specifically, we first encode the original datax0 into a latent representationz0 using a Variational autoencoders (VAEs) (Kingma & Welling, 2013)). The diffusi...

  43. [44]

    However, in Figure 20, the MoT fine-tuned model demonstratessuperiorperformance, producingimagesthatareeithermorevisuallyappealingormorefaithful to the prompts

    In Figure 21, both MoT and dense fine-tuned models successfully follow the prompts. However, in Figure 20, the MoT fine-tuned model demonstratessuperiorperformance, producingimagesthatareeithermorevisuallyappealingormorefaithful to the prompts. In Figure 22, both models struggle to perfectly follow the text prompts and fail to capture all the details accu...