AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model

Andrea Madotto; Anuj Kumar; Babak Damavandi; Chun-Fu Yeh; Kavya Srinet; Matt Smith; Peyman Heidari; Prakash Murugesan; Seungwhan Moon; Shashank Jain

arxiv: 2309.16058 · v1 · pith:APF7BJU7new · submitted 2023-09-27 · 💻 cs.LG · cs.CL· cs.CV

AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model

Seungwhan Moon , Andrea Madotto , Zhaojiang Lin , Tushar Nagarajan , Matt Smith , Shashank Jain , Chun-Fu Yeh , Prakash Murugesan

show 5 more authors

Peyman Heidari Yue Liu Kavya Srinet Babak Damavandi Anuj Kumar

This is my paper

classification 💻 cs.LG cs.CLcs.CV

keywords modelanymalmultimodalany-modalityaugmenteddiverselanguagesignals

0 comments

read the original abstract

We present Any-Modality Augmented Language Model (AnyMAL), a unified model that reasons over diverse input modality signals (i.e. text, image, video, audio, IMU motion sensor), and generates textual responses. AnyMAL inherits the powerful text-based reasoning abilities of the state-of-the-art LLMs including LLaMA-2 (70B), and converts modality-specific signals to the joint textual space through a pre-trained aligner module. To further strengthen the multimodal LLM's capabilities, we fine-tune the model with a multimodal instruction set manually collected to cover diverse topics and tasks beyond simple QAs. We conduct comprehensive empirical analysis comprising both human and automatic evaluations, and demonstrate state-of-the-art performance on various multimodal tasks.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

A Survey on Multimodal Large Language Models
cs.CV 2023-06 accept novelty 3.0

This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.
Multilingual and Multimodal LLMs in the Wild: Building for Low-Resource Languages
cs.CL 2026-05 unverdicted novelty 2.0

A tutorial synthesizing foundations, recent models such as PALO and Maya, and low-cost methods for tri-modal multilingual AI in resource-constrained settings.