VITA-1.5 integrates vision and speech into a single LLM through multi-stage training, delivering competitive benchmark results on image, video, and speech tasks with near real-time response speed.
E- chat: Emotion-sensitive spoken dialogue system with large language models.arXiv preprint arXiv:2401.00475, 2023
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CV 1years
2025 1verdicts
CONDITIONAL 1representative citing papers
citing papers explorer
-
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
VITA-1.5 integrates vision and speech into a single LLM through multi-stage training, delivering competitive benchmark results on image, video, and speech tasks with near real-time response speed.