arxiv: 2411.04996 · v2 · pith:V26QLI5Dnew · submitted 2024-11-07 · 💻 cs.CL

Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models

Weixin Liang , Lili Yu , Liang Luo , Srinivasan Iyer , Ning Dong , Chunting Zhou , Gargi Ghosh , Mike Lewis

show 3 more authors

Wen-tau Yih Luke Zettlemoyer Xi Victoria Lin

This is my paper

Pith reviewed 2026-05-18 02:44 UTC · model grok-4.3

classification 💻 cs.CL

keywords mixture-of-transformersmulti-modal modelssparse architecturecompute efficiencymodality-specific parameterstransformer pretrainingfoundation models

0 comments

The pith

Mixture-of-Transformers matches dense multi-modal performance at roughly half the compute by using modality-specific parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Mixture-of-Transformers as a sparse architecture for models that jointly handle text, images, and speech. It assigns separate parameters for feed-forward networks, attention matrices, and layer normalizations to each modality while keeping self-attention global across the full sequence. This separation targets the high cost of pretraining large multi-modal systems. Experiments demonstrate that the approach reaches comparable or better quality than standard dense models at substantially lower floating-point operations and wall-clock time. The result points to a direct way to scale these models without proportional growth in resources.

Core claim

By making non-embedding parameters modality-specific and retaining only global self-attention, the architecture delivers the performance of a dense baseline transformer on multi-modal tasks while using far fewer FLOPs during pretraining.

What carries the argument

Mixture-of-Transformers, which decouples feed-forward networks, attention matrices, and layer normalizations into per-modality copies while applying shared self-attention over the entire mixed input sequence.

Load-bearing premise

Global self-attention alone can preserve all necessary cross-modal interactions even after most parameters are made modality-specific.

What would settle it

A head-to-head evaluation on tasks that require precise alignment across modalities, such as generating images from spoken instructions, to check whether quality falls below the dense baseline at matched compute.

read the original abstract

The development of large language models (LLMs) has expanded to multi-modal systems capable of processing text, images, and speech within a unified framework. Training these models demands significantly larger datasets and computational resources compared to text-only LLMs. To address the scaling challenges, we introduce Mixture-of-Transformers (MoT), a sparse multi-modal transformer architecture that significantly reduces pretraining computational costs. MoT decouples non-embedding parameters of the model by modality -- including feed-forward networks, attention matrices, and layer normalization -- enabling modality-specific processing with global self-attention over the full input sequence. We evaluate MoT across multiple settings and model scales. In the Chameleon 7B setting (autoregressive text-and-image generation), MoT matches the dense baseline's performance using only 55.8\% of the FLOPs. When extended to include speech, MoT reaches speech performance comparable to the dense baseline with only 37.2\% of the FLOPs. In the Transfusion setting, where text and image are trained with different objectives, a 7B MoT model matches the image modality performance of the dense baseline with one third of the FLOPs, and a 760M MoT model outperforms a 1.4B dense baseline across key image generation metrics. System profiling further highlights MoT's practical benefits, achieving dense baseline image quality in 47.2\% of the wall-clock time and text quality in 75.6\% of the wall-clock time (measured on AWS p4de.24xlarge instances with NVIDIA A100 GPUs).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MoT delivers measurable FLOPs cuts in multi-modal training by making non-embedding parameters modality-specific while retaining global attention, with solid empirical numbers but an open question on cross-modal interaction strength.

read the letter

The core takeaway is that this Mixture-of-Transformers design splits feed-forward, attention matrices, and layer norms by modality while running one shared attention over the full sequence. In the reported Chameleon 7B runs it hits dense performance at 55.8% FLOPs; adding speech brings the figure to 37.2%. The Transfusion results are sharper still: a 760M MoT beats a 1.4B dense baseline on image metrics and a 7B MoT matches the dense image numbers at roughly one-third the compute. Wall-clock numbers on A100s are also given, which is useful for practitioners.

Referee Report

2 major / 2 minor

Summary. The paper introduces Mixture-of-Transformers (MoT), a sparse multi-modal architecture that decouples non-embedding parameters (FFNs, attention matrices, and layer norms) by modality while retaining global self-attention over the full sequence. It reports empirical results on Chameleon (text-image and text-image-speech autoregressive generation) and Transfusion (text-image with differing objectives), claiming that MoT matches or exceeds dense baseline performance at substantially lower compute: 55.8% FLOPs for 7B Chameleon text-image, 37.2% FLOPs when adding speech, one-third FLOPs for 7B Transfusion image metrics, and a 760M MoT outperforming a 1.4B dense baseline on image metrics, plus wall-clock reductions of 47.2% for images and 75.6% for text.

Significance. If the performance parity holds under controlled conditions, the work offers a practical route to scaling multi-modal foundation models with lower pretraining costs. The concrete FLOPs and wall-clock measurements across two distinct regimes (Chameleon and Transfusion) and multiple scales constitute a clear strength, as do the direct comparisons to dense baselines at matched or smaller parameter counts.

major comments (2)

[§3] §3 (Architecture): The central design assumes that modality-specific Q/K/V projections and FFNs plus global self-attention preserve all necessary cross-modal interactions present in the shared-parameter dense baseline. Because cross-attention scores are now computed between vectors from independent linear maps, the subspaces may be misaligned; no analysis of attention patterns across modalities or controlled probing tasks is reported to test whether interaction quality is maintained. This assumption is load-bearing for attributing the reported FLOPs reductions (55.8%, 37.2%, one-third) to the architecture rather than other factors.
[§4.1] §4.1 (Chameleon experiments): The claim that a 7B MoT matches the dense baseline at 55.8% FLOPs requires confirmation that both models used identical data, tokenization, optimizer schedules, and total training steps. Any deviation in effective training compute or regularization would undermine the direct attribution of parity to the sparse design.

minor comments (2)

[Table 1] Table 1 and associated text: the exact breakdown of FLOPs into embedding vs. non-embedding components should be stated explicitly so readers can verify the 55.8% and 37.2% figures independently.
[Figure 3] Figure 3 (wall-clock profiling): adding standard deviation across multiple runs or at least two independent seeds would strengthen the reported 47.2% and 75.6% time savings.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below with clarifications based on the manuscript and indicate revisions where appropriate to strengthen the presentation of our results.

read point-by-point responses

Referee: [§3] §3 (Architecture): The central design assumes that modality-specific Q/K/V projections and FFNs plus global self-attention preserve all necessary cross-modal interactions present in the shared-parameter dense baseline. Because cross-attention scores are now computed between vectors from independent linear maps, the subspaces may be misaligned; no analysis of attention patterns across modalities or controlled probing tasks is reported to test whether interaction quality is maintained. This assumption is load-bearing for attributing the reported FLOPs reductions (55.8%, 37.2%, one-third) to the architecture rather than other factors.

Authors: We agree that verifying the quality of cross-modal interactions is important for attributing performance to the architecture. In MoT, modality-specific projections are used for Q/K/V and FFNs, but self-attention remains fully global over the concatenated sequence, allowing tokens from different modalities to directly attend to one another. This preserves the mechanism for learning cross-modal alignments, unlike fully decoupled models. Our results on Chameleon and Transfusion show that MoT matches or exceeds dense baselines on multi-modal tasks, which would be unlikely if interactions were substantially degraded. That said, we acknowledge the value of direct evidence and will add attention pattern analysis (e.g., average cross-modal attention scores and visualizations) plus a small probing experiment in the revised manuscript to explicitly compare interaction quality with the dense baseline. revision: yes
Referee: [§4.1] §4.1 (Chameleon experiments): The claim that a 7B MoT matches the dense baseline at 55.8% FLOPs requires confirmation that both models used identical data, tokenization, optimizer schedules, and total training steps. Any deviation in effective training compute or regularization would undermine the direct attribution of parity to the sparse design.

Authors: Both the 7B dense baseline and 7B MoT were trained with exactly the same data mixture, tokenization, optimizer (AdamW), learning-rate schedule, batch size, and total number of training steps. This controlled setup is described in Section 4.1 and the experimental details appendix; the only difference is the architecture itself. FLOPs are measured via standard forward-pass accounting on the same hardware, and wall-clock times are reported from identical training runs. We will add a short explicit statement in Section 4.1 reiterating the matched training protocol to make this clearer for readers. revision: partial

Circularity Check

0 steps flagged

No significant circularity; results are direct empirical comparisons

full rationale

The paper proposes the MoT architecture and reports measured performance and FLOPs reductions against dense baselines in Chameleon and Transfusion settings. These outcomes are obtained through training and evaluation rather than any derivation that reduces a claimed prediction to a fitted parameter or self-citation by construction. No equations define target metrics as functions of inputs chosen to match them, and the central claims rest on experimental data rather than self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces an architectural design choice rather than new physical entities or fitted constants; the main unstated premises are standard transformer training assumptions and the sufficiency of global attention for cross-modal fusion.

axioms (1)

domain assumption Standard transformer self-attention and feed-forward blocks can be trained end-to-end with modality-specific parameter copies without loss of cross-modal capability.
Invoked when claiming that decoupling non-embedding parameters preserves performance.

pith-pipeline@v0.9.0 · 5856 in / 1403 out tokens · 26665 ms · 2026-05-18T02:44:13.719788+00:00 · methodology

discussion (0)

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving
cs.RO 2026-05 unverdicted novelty 7.0

MindVLA-U1 introduces a unified streaming VLA with shared backbone, framewise memory, and language-guided action diffusion that surpasses human drivers on WOD-E2E planning metrics.
Action Emergence from Streaming Intent
cs.RO 2026-05 unverdicted novelty 7.0

A new VLA model called SI uses a four-step chain-of-thought to derive driving intent and applies it via classifier-free guidance to a flow-matching trajectory generator, showing competitive Waymo scores and intent-con...
NoiseGate: Learning Per-Latent Timestep Schedules as Information Gating in World Action Models
cs.RO 2026-05 unverdicted novelty 7.0

NoiseGate learns per-latent timestep schedules as an information-gating policy in diffusion-based world action models, yielding consistent gains on RoboTwin manipulation tasks.
Modular Sensory Stream for Integrating Physical Feedback in Vision-Language-Action Models
cs.RO 2026-04 unverdicted novelty 7.0

MoSS augments VLAs with decoupled modality streams for multiple physical signals, achieving synergistic gains in real-world robot tasks via joint attention and auxiliary future-signal prediction.
Pseudo-Unification: Entropy Probing Reveals Divergent Information Patterns in Unified Multimodal Models
cs.CV 2026-04 unverdicted novelty 7.0

Unified multimodal models exhibit pseudo-unification due to modality-asymmetric entropy encoding and pattern-split responses between text and image generation.
ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body
cs.CV 2025-12 unverdicted novelty 7.0

ViBES introduces a speech-language-behavior model using modality-specific transformer experts that jointly generates dialogue and 3D body actions, showing gains over separate co-speech and text-to-motion baselines on ...
MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving
cs.RO 2026-05 unverdicted novelty 6.0

MindVLA-U1 is the first unified streaming VLA architecture that surpasses human drivers on WOD-E2E planning metrics while matching VA latency and preserving language interfaces.
Action Emergence from Streaming Intent
cs.RO 2026-05 unverdicted novelty 6.0

Streaming Intent lets a VLA model derive driving intent via streamed chain-of-thought reasoning and use it to steer a flow-matching action head, yielding competitive Waymo scores plus intent-based trajectory control w...
SpatialFusion: Endowing Unified Image Generation with Intrinsic 3D Geometric Awareness
cs.CV 2026-04 unverdicted novelty 6.0

SpatialFusion internalizes 3D geometric awareness into unified image generation models by pairing an MLLM with a spatial transformer that produces depth maps to constrain diffusion generation.
Meta-CoT: Enhancing Granularity and Generalization in Image Editing
cs.CV 2026-04 unverdicted novelty 6.0

Meta-CoT uses two-level decomposition of editing operations into meta-tasks and a CoT consistency reward to improve granularity and generalization, reporting 15.8% gains across 21 tasks.
DriveLaW:Unifying Planning and Video Generation in a Latent Driving World
cs.CV 2025-12 unverdicted novelty 6.0

DriveLaW unifies video world modeling and trajectory planning by injecting video-generator latents into a diffusion planner, achieving SOTA video prediction and a new record on the NAVSIM planning benchmark.
F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions
cs.RO 2025-09 unverdicted novelty 6.0

F1 integrates next-scale visual foresight prediction into a Mixture-of-Transformer VLA architecture to reformulate action generation as foresight-guided inverse dynamics, achieving higher success rates on 136 tasks.
Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation
cs.CV 2025-05 unverdicted novelty 6.0

Mogao presents a causal unified model with deep fusion, dual encoders, and interleaved position embeddings that achieves strong performance on multi-modal understanding, text-to-image generation, and coherent interlea...
SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture
cs.CV 2026-05 unverdicted novelty 5.0

SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.
Steering Visual Generation in Unified Multimodal Models with Understanding Supervision
cs.CV 2026-05 unverdicted novelty 5.0

Using understanding tasks as direct supervision during post-training improves image generation and editing in unified multimodal models.
GR-3 Technical Report
cs.RO 2025-07 unverdicted novelty 5.0

GR-3 is a VLA model that generalizes to novel objects, environments, and abstract instructions, outperforms the π0 baseline, and integrates with the new ByteMini bi-manual mobile robot.
OmniVLA-RL: A Vision-Language-Action Model with Spatial Understanding and Online RL
cs.RO 2026-04 unverdicted novelty 4.0

OmniVLA-RL uses a mix-of-transformers architecture and flow-matching reformulated as SDE with group segmented policy optimization to surpass prior VLA models on LIBERO benchmarks.
Wan-Image: Pushing the Boundaries of Generative Visual Intelligence
cs.CV 2026-04 unverdicted novelty 3.0

Wan-Image is a unified multi-modal system that integrates LLMs and diffusion transformers to deliver professional-grade image generation features including complex typography, multi-subject consistency, and precise ed...

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · cited by 16 Pith papers · 19 internal anchors

[1]

com/blog/continuous-batching-llm-inference

Achieve 23x LLM Inference Throughput & Reduce p50 Latency — anyscale.com.https://www.anyscale. com/blog/continuous-batching-llm-inference. [Accessed 22-03-2025]. tutorials/Conceptual_Guide/Part_2-improving_resource_utilization at main · triton-inference- server/tutorials — github.com. https://github.com/triton-inference-server/tutorials/tree/ main/Concept...

work page arXiv 2025
[2]

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al

URLhttps://arxiv.org/abs/2309.15564. Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems , 35:23716–23736,

work page arXiv
[3]

BEiT: BERT Pre-Training of Image Transformers

Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. Beit: Bert pre-training of image transformers.arXiv preprint arXiv:2106.08254,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Vlmo: Unified vision-language pre-training with mixture-of-modality- experts

Hangbo Bao, Wenhui Wang, Li Dong, Qiang Liu, Owais Khan Mohammed, Kriti Aggarwal, Subhojit Som, Songhao Piao, and Furu Wei. Vlmo: Unified vision-language pre-training with mixture-of-modality- experts. Advances in Neural Information Processing Systems , 35:32897–32912, 2022a. Hangbo Bao, Wenhui Wang, Li Dong, Qiang Liu, Owais Khan Mohammed, Kriti Aggarwal...

work page arXiv
[5]

Chameleon: Mixed-Modal Early-Fusion Foundation Models

URLhttps://arxiv. org/abs/2405.09818. Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts.CoRR, abs/2102.08981,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Jun Chen, Han Guo, Kai Yi, Boyang Li, and Mohamed Elhoseiny

URLhttps: //arxiv.org/abs/2102.08981. Jun Chen, Han Guo, Kai Yi, Boyang Li, and Mohamed Elhoseiny. Visualgpt: Data-efficient adaptation of pretrained language models for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pp. 18030–18040,

work page arXiv
[7]

100,000 podcasts: A spoken English document corpus

36 Published in Transactions on Machine Learning Research (04/2025) Ann Clifton, Sravana Reddy, Yongze Yu, Aasish Pappu, Rezvaneh Rezapour, Hamed Bonab, Maria Eskevich, Gareth Jones, Jussi Karlgren, Ben Carterette, and Rosie Jones. 100,000 podcasts: A spoken English document corpus. In Proceedings of the 28th International Conference on Computational Ling...

work page 2025
[8]

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

International Committee on Computational Linguistics. URL https://www.aclweb.org/anthology/2020.coling-main.519. Damai Dai, Chengqi Deng, Chenggang Zhao, RX Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y Wu, et al. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. arXiv preprint arXiv:2401.06066 ,

work page internal anchor Pith review Pith/arXiv arXiv 2020
[9]

DeepSeek-V3 Technical Report

URLhttps://arxiv.org/abs/2412.19437. David Eigen, Marc’Aurelio Ranzato, and Ilya Sutskever. Learning factored representations in a deep mixture of experts. arXiv preprint arXiv:1312.4314 ,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

URLhttps://arxiv.org/abs/2403.03206. William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

URLhttps://arxiv.org/abs/2101.03961. Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman. Make-a-scene: Scene-based text-to-image generation with human priors.arXiv preprint arXiv:2203.13131 ,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

URL https://arxiv.org/abs/2211.15841. 37 Published in Transactions on Machine Learning Research (04/2025) Daniel Galvez, Greg Diamos, Juan Ciro, Juan Felipe Cerón, Keith Achorn, Anjali Gopi, David Kanter, Maximilian Lam, Mark Mazumder, and Vijay Janapa Reddi. The people’s speech: A large-scale diverse english speech recognition dataset for commercial usage,

work page arXiv 2025
[13]

Mars: Mixture of auto-regressive models for fine-grained text-to-image synthesis

Wanggui He, Siming Fu, Mushui Liu, Xierui Wang, Wenyi Xiao, Fangxun Shu, Yi Wang, Lei Zhang, Zhelun Yu, Haoyuan Li, et al. Mars: Mixture of auto-regressive models for fine-grained text-to-image synthesis. arXiv preprint arXiv:2407.07614 ,

work page arXiv
[14]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598 ,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

T2i-compbench: A comprehensive bench- mark for open-world compositional text-to-image generation.arXiv preprint arXiv:2307.06350 ,

Kaiyi Huang, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xihui Liu. T2i-compbench: A comprehensive bench- mark for open-world compositional text-to-image generation.arXiv preprint arXiv:2307.06350 ,

work page arXiv
[16]

URLhttps: //arxiv.org/abs/2401.04088. J. Kahn, M. Rivière, W. Zheng, E. Kharitonov, Q. Xu, P.E. Mazaré, J. Karadayi, V. Liptchinsky, R. Col- lobert, C. Fuegen, T. Likhomanenko, G. Synnaeve, A. Joulin, A. Mohamed, and E. Dupoux. Libri- light: A benchmark for asr with limited or no supervision. In ICASSP 2020 - 2020 IEEE Interna- tional Conference on Acoust...

work page internal anchor Pith review Pith/arXiv arXiv 2020
[17]

Diederik P Kingma and Max Welling

doi: 10.1109/ICASSP40776.2020.9052942. Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114 ,

work page doi:10.1109/icassp40776.2020.9052942 2020
[18]

Auto-Encoding Variational Bayes

URLhttps://arxiv.org/ abs/1312.6114. Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M Rush, Douwe Kiela, et al. Obelisc: An open web-scale filtered dataset of interleaved image-text documents.arXiv preprint arXiv:2306.16527 ,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

URLhttps://arxiv.org/abs/2006.16668. Victor Weixin Liang, Yuhui Zhang, Yongchan Kwon, Serena Yeung, and James Y Zou. Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning.Advances in Neural Information Processing Systems, 35:17612–17625,

work page internal anchor Pith review Pith/arXiv arXiv 2006
[20]

Moma: Efficientearly-fusionpre-trainingwithmixtureofmodality-awareexperts

38 Published in Transactions on Machine Learning Research (04/2025) Xi Victoria Lin, Akshat Shrivastava, Liang Luo, Srinivasan Iyer, Mike Lewis, Gargi Gosh, Luke Zettlemoyer, andArmenAghajanyan. Moma: Efficientearly-fusionpre-trainingwithmixtureofmodality-awareexperts. arXiv preprint arXiv:2407.21770 ,

work page arXiv 2025
[21]

Dinosr: Self-distillation and online clustering for self-supervised speech representation learning, 2024a

AlexanderH.Liu, Heng-JuiChang, MichaelAuli, Wei-NingHsu, andJamesR.Glass. Dinosr: Self-distillation and online clustering for self-supervised speech representation learning, 2024a. URLhttps://arxiv.org/ abs/2305.10005. Bingchen Liu, Ehsan Akhgari, Alexander Visheratin, Aleks Kamko, Linmiao Xu, Shivam Shrirao, Chase Lambert, Joao Souza, Suhail Doshi, and D...

work page arXiv
[22]

The Llama 3 Herd of Models

URLhttps://arxiv.org/abs/2407.21783. Zijun Long, George Killick, Richard McCreadie, and Gerardo Aragon Camarasa. Multiway-adapater: Adapt- ing large-scale multi-modal models for scalable image-text retrieval. arXiv preprint arXiv:2309.01516 ,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

ISBN 9781450360111

Associa- tion for Computing Machinery. ISBN 9781450360111. doi: 10.1145/3267809.3267840. URL https: //doi.org/10.1145/3267809.3267840. Liang Luo, Peter West, Jacob Nelson, Arvind Krishnamurthy, and Luis Ceze. Plink: Dis- covering and exploiting locality for accelerated distributed training on the public cloud. In I. Dhillon, D. Papailiopoulos, and V. Sze ...

work page doi:10.1145/3267809.3267840
[24]

URL https://proceedings.mlsys.org/paper_files/paper/2020/file/ eca986d585a03890a412587a2f5ccb43-Paper.pdf. Liang Luo, Buyun Zhang, Michael Tsang, Yinbin Ma, Ching-Hsiang Chu, Yuxin Chen, Shen Li, Yuchen Hao, YanliZhao, GunaLakshminarayanan, EllieWen, JongsooPark, DheevatsaMudigere, andMaximNaumov. Disaggregated multi-tower: Topology-aware modeling techniq...

work page 2020
[25]

URLhttps://proceedings.mlsys.org/paper_files/paper/2024/file/ 78834433edc3291f4c6cbbd2759324db-Paper-Conference.pdf. Niklas Muennighoff, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Jacob Morrison, Sewon Min, Weijia Shi, Pete Walsh, Oyvind Tafjord, Nathan Lambert, Yuling Gu, Shane Arora, Akshita Bhagia, Dustin Schwenk, David Wadden, Alexander Wettig, Binyuan ...

work page 2024
[26]

OLMoE: Open Mixture-of-Experts Language Models

URLhttps://arxiv.org/abs/2409.02060. Tu Anh Nguyen, Benjamin Muller, Bokai Yu, Marta R. Costa-jussa, Maha Elbayad, Sravya Popuri, Paul- Ambroise Duquenne, Robin Algayres, Ruslan Mavlyutov, Itai Gat, Gabriel Synnaeve, Juan Pino, Benoit Sagot, and Emmanuel Dupoux. Spirit-lm: Interleaved spoken and written language model,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Alexander Quinn Nichol and Prafulla Dhariwal

URL https://arxiv.org/abs/2402.05755. Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International conference on machine learning , pp. 8162–8171. PMLR,

work page arXiv
[28]

Introducing grouped gemm apis in cublas and more performance updates | nvidia technical blog

39 Published in Transactions on Machine Learning Research (04/2025) Nvidia. Introducing grouped gemm apis in cublas and more performance updates | nvidia technical blog. https://developer.nvidia.com/blog/ introducing-grouped-gemm-apis-in-cublas-and-more-performance-updates/ . (Accessed on 10/04/2024). Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C...

work page 2025
[29]

Movie Gen: A Cast of Media Foundation Models

URLhttps://arxiv.org/abs/2410.13720. Vineel Pratap, Qiantong Xu, Anuroop Sriram, Gabriel Synnaeve, and Ronan Collobert. Mls: A large-scale multilingual dataset for speech research.ArXiv, abs/2012.03411,

work page internal anchor Pith review Pith/arXiv arXiv 2012
[30]

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020 ,

work page internal anchor Pith review Pith/arXiv arXiv
[31]

Zero-Shot Text-to-Image Generation

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation.arXiv preprint arXiv:2102.12092 ,

work page internal anchor Pith review Pith/arXiv arXiv
[32]

Mixture-of-Depths: Dynamically allocating compute in transformer-based language models

David Raposo, Sam Ritter, Blake Richards, Timothy Lillicrap, Peter Conway Humphreys, and Adam San- toro. Mixture-of-depths: Dynamically allocating compute in transformer-based language models.arXiv preprint arXiv:2404.02258,

work page internal anchor Pith review Pith/arXiv arXiv
[34]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

URL http://arxiv.org/abs/1701.06538. Sheng Shen, Zhewei Yao, Chunyuan Li, Trevor Darrell, Kurt Keutzer, and Yuxiong He. Scaling vision- language models with sparse mixture of experts.arXiv preprint arXiv:2303.07226 ,

work page internal anchor Pith review Pith/arXiv arXiv
[35]

Jetmoe: Reaching llama2 performance with 0.1 m dollars

Yikang Shen, Zhen Guo, Tianle Cai, and Zengyi Qin. Jetmoe: Reaching llama2 performance with 0.1 m dollars. arXiv preprint arXiv:2404.07413 ,

work page arXiv
[36]

URLhttps://arxiv.org/abs/2403.07816. 40 Published in Transactions on Machine Learning Research (04/2025) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971...

work page arXiv 2025
[37]

CogVLM: Visual Expert for Pretrained Language Models

Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.80. URL https://aclanthology.org/2021.acl-long.80. Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, et al. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2021.acl-long.80 2021
[38]

Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun

URLhttps://arxiv.org/abs/2208.10442. Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. Orca: A distributed serving system for Transformer-Based generative models. In16th USENIX Symposium on Operating Sys- tems Design and Implementation (OSDI

work page arXiv
[39]

ISBN 978-1-939133-28-1

USENIX Association. ISBN 978-1-939133-28-1. URL https://www.usenix.org/conference/osdi22/presentation/yu. Lili Yu, Bowen Shi, Ramakanth Pasunuru, Benjamin Muller, Olga Golovneva, Tianlu Wang, Arun Babu, Binh Tang, Brian Karrer, Shelly Sheynin, et al. Scaling autoregressive multi-modal models: Pretraining and instruction tuning. arXiv preprint arXiv:2309.02591 ,

work page arXiv
[40]

doi: 10.14778/3611540.3611569

ISSN 2150-8097. doi: 10.14778/3611540.3611569. URL https://doi.org/10.14778/3611540.3611569. Zexuan Zhong, Mengzhou Xia, Danqi Chen, and Mike Lewis. Lory: Fully differentiable mixture-of-experts for autoregressive language model pre-training. InFirst Conference on Language Modeling ,

work page doi:10.14778/3611540.3611569
[41]

Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

URL https://openreview.net/forum?id=LKEJPySnlt. Chunting Zhou, Lili Yu, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, and Omer Levy. Transfusion: Predict the next token and diffuse images with one multi-modal model.arXiv preprint arXiv:2408.11039 ,

work page internal anchor Pith review Pith/arXiv arXiv
[42]

org/abs/2202.09368

URLhttps://arxiv. org/abs/2202.09368. 41 Published in Transactions on Machine Learning Research (04/2025) A Tranfusion: Preliminaries A.1 Diffusion for Image Generation Diffusion models have emerged as a powerful class of generative models capable of producing high-fidelity data across various modalities. These models utilize a Markov chain that progressi...

work page arXiv 2025
[43]

represent every 8×8 pixel patch as an 8-dimensional vector.) rather than directly in the high-dimensional data space

, which perform the diffusion process in a lower-dimensional latent space (e.g. represent every 8×8 pixel patch as an 8-dimensional vector.) rather than directly in the high-dimensional data space. Specifically, we first encode the original datax0 into a latent representationz0 using a Variational autoencoders (VAEs) (Kingma & Welling, 2013)). The diffusi...

work page 2013
[44]

However, in Figure 20, the MoT fine-tuned model demonstratessuperiorperformance, producingimagesthatareeithermorevisuallyappealingormorefaithful to the prompts

In Figure 21, both MoT and dense fine-tuned models successfully follow the prompts. However, in Figure 20, the MoT fine-tuned model demonstratessuperiorperformance, producingimagesthatareeithermorevisuallyappealingormorefaithful to the prompts. In Figure 22, both models struggle to perfectly follow the text prompts and fail to capture all the details accu...

work page 2025