MoT decouples non-embedding parameters by modality in transformers to match dense multi-modal performance with roughly one-third to one-half the FLOPs.
Title resolution pending
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
method 1polarities
background 1representative citing papers
Presents open-source 7B models for million-token video and language understanding via Blockwise RingAttention, setting new benchmarks in retrieval and long video tasks.
This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.
citing papers explorer
-
Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models
MoT decouples non-embedding parameters by modality in transformers to match dense multi-modal performance with roughly one-third to one-half the FLOPs.
-
World Model on Million-Length Video And Language With Blockwise RingAttention
Presents open-source 7B models for million-token video and language understanding via Blockwise RingAttention, setting new benchmarks in retrieval and long video tasks.
-
A Survey on Multimodal Large Language Models
This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.