X-LLM: Bootstrapping advanced large language models by treating multi-modalities as foreign languages.arXiv preprint arXiv:2305.04160

Feilong Chen, Minglun Han, Haozhi Zhao, Qingyang Zhang, Jing Shi, Shuang Xu, Bo Xu · 2023 · arXiv 2305.04160

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

read on arXiv browse 9 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

MECAT: A Multi-Experts Constructed Benchmark for Fine-Grained Audio Understanding Tasks

eess.AS · 2025-07-31 · unverdicted · novelty 7.0

MECAT is a multi-expert benchmark for audio AI offering fine-grained captions and QA pairs generated via expert models and LLM reasoning, paired with the DATE metric that combines semantic similarity and cross-sample discriminability to favor detailed outputs.

RaTA-Tool: Retrieval-based Tool Selection with Multimodal Large Language Models

cs.CV · 2026-04-16 · unverdicted · novelty 6.0

RaTA-Tool retrieves suitable external tools for multimodal queries by matching generated task descriptions against tool metadata, supported by a new Hugging Face-derived dataset and DPO optimization.

Mind the Gap No More: Achieving Zero-Gap Multimodal Integration via One Tokenizer

q-bio.GN · 2026-01-21 · unverdicted · novelty 6.0

One Tokenizer achieves zero-gap multimodal integration by mapping all inputs to a unified token vocabulary, allowing native LLMs to perform deep cross-modal reasoning without modular encoders or fusion layers, and outperforming encoder-based baselines on DNA-text tasks.

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

cs.CV · 2023-11-16 · unverdicted · novelty 6.0

Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.

SALMONN: Towards Generic Hearing Abilities for Large Language Models

cs.SD · 2023-10-20 · unverdicted · novelty 6.0

SALMONN integrates speech and audio encoders with a text-based LLM to process general audio inputs, achieve competitive results on trained tasks, and exhibit emergent cross-modal abilities.

DFLOP: A Data-driven Framework for Multimodal LLM Training Pipeline Optimization

cs.DC · 2026-03-26 · unverdicted · novelty 4.0

DFLOP is a data-driven framework that profiles data-induced computation variance and uses predictive scheduling to balance workloads in multimodal LLM training pipelines, claiming up to 3.6x faster training than existing frameworks.

The Rise and Potential of Large Language Model Based Agents: A Survey

cs.AI · 2023-09-14 · accept · novelty 4.0

The paper surveys the origins, frameworks, applications, and open challenges of AI agents built on large language models.

On The Landscape of Spoken Language Models: A Comprehensive Survey

cs.CL · 2025-04-11 · unverdicted · novelty 3.0

A literature survey that organizes spoken language models by architecture, training, and evaluation choices and identifies key challenges and future directions.

A Survey on Multimodal Large Language Models

cs.CV · 2023-06-23 · accept · novelty 3.0

This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.

citing papers explorer

Showing 9 of 9 citing papers.

MECAT: A Multi-Experts Constructed Benchmark for Fine-Grained Audio Understanding Tasks eess.AS · 2025-07-31 · unverdicted · none · ref 22
MECAT is a multi-expert benchmark for audio AI offering fine-grained captions and QA pairs generated via expert models and LLM reasoning, paired with the DATE metric that combines semantic similarity and cross-sample discriminability to favor detailed outputs.
RaTA-Tool: Retrieval-based Tool Selection with Multimodal Large Language Models cs.CV · 2026-04-16 · unverdicted · none · ref 5
RaTA-Tool retrieves suitable external tools for multimodal queries by matching generated task descriptions against tool metadata, supported by a new Hugging Face-derived dataset and DPO optimization.
Mind the Gap No More: Achieving Zero-Gap Multimodal Integration via One Tokenizer q-bio.GN · 2026-01-21 · unverdicted · none · ref 2
One Tokenizer achieves zero-gap multimodal integration by mapping all inputs to a unified token vocabulary, allowing native LLMs to perform deep cross-modal reasoning without modular encoders or fusion layers, and outperforming encoder-based baselines on DNA-text tasks.
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection cs.CV · 2023-11-16 · unverdicted · none · ref 113
Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.
SALMONN: Towards Generic Hearing Abilities for Large Language Models cs.SD · 2023-10-20 · unverdicted · none · ref 7
SALMONN integrates speech and audio encoders with a text-based LLM to process general audio inputs, achieve competitive results on trained tasks, and exhibit emergent cross-modal abilities.
DFLOP: A Data-driven Framework for Multimodal LLM Training Pipeline Optimization cs.DC · 2026-03-26 · unverdicted · none · ref 6
DFLOP is a data-driven framework that profiles data-induced computation variance and uses predictive scheduling to balance workloads in multimodal LLM training pipelines, claiming up to 3.6x faster training than existing frameworks.
The Rise and Potential of Large Language Model Based Agents: A Survey cs.AI · 2023-09-14 · accept · none · ref 297
The paper surveys the origins, frameworks, applications, and open challenges of AI agents built on large language models.
On The Landscape of Spoken Language Models: A Comprehensive Survey cs.CL · 2025-04-11 · unverdicted · none · ref 5
A literature survey that organizes spoken language models by architecture, training, and evaluation choices and identifies key challenges and future directions.
A Survey on Multimodal Large Language Models cs.CV · 2023-06-23 · accept · none · ref 72
This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.

X-LLM: Bootstrapping advanced large language models by treating multi-modalities as foreign languages.arXiv preprint arXiv:2305.04160

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer