Fine-grained audio-visual joint representations for multimodal large language models

Fine-grained Audio-Visual Joint Representations for Multimodal Large Language Models , author= · 2023 · arXiv 2310.05863

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

read on arXiv browse 2 citing papers

citation-role summary

baseline 1

citation-polarity summary

baseline 1

representative citing papers

SALMONN: Towards Generic Hearing Abilities for Large Language Models

cs.SD · 2023-10-20 · unverdicted · novelty 6.0

SALMONN integrates speech and audio encoders with a text-based LLM to process general audio inputs, achieve competitive results on trained tasks, and exhibit emergent cross-modal abilities.

Empowering Video Translation using Multimodal Large Language Models

cs.CV · 2026-04-13 · unverdicted · novelty 4.0

The paper offers the first focused review of MLLM-based video translation organized by a three-role taxonomy of Semantic Reasoner, Expressive Performer, and Visual Synthesizer, plus open challenges.

citing papers explorer

Showing 2 of 2 citing papers.

SALMONN: Towards Generic Hearing Abilities for Large Language Models cs.SD · 2023-10-20 · unverdicted · none · ref 68
SALMONN integrates speech and audio encoders with a text-based LLM to process general audio inputs, achieve competitive results on trained tasks, and exhibit emergent cross-modal abilities.
Empowering Video Translation using Multimodal Large Language Models cs.CV · 2026-04-13 · unverdicted · none · ref 173
The paper offers the first focused review of MLLM-based video translation organized by a three-role taxonomy of Semantic Reasoner, Expressive Performer, and Visual Synthesizer, plus open challenges.

Fine-grained audio-visual joint representations for multimodal large language models

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer