Talkplay: Multimodal music recommendation with large language models

Seungheon Doh, Keunwoo Choi, Juhan Nam · 2025 · cs.IR · arXiv 2502.13713

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

open full Pith review browse 4 citing papers arXiv PDF

abstract

We present TALKPLAY, a novel multimodal music recommendation system that reformulates recommendation as a token generation problem using large language models (LLMs). By leveraging the instruction-following and natural language generation capabilities of LLMs, our system effectively recommends music from diverse user queries while generating contextually relevant responses. While pretrained LLMs are primarily designed for text modality, TALKPLAY extends their scope through two key innovations: a multimodal music tokenizer that encodes audio features, lyrics, metadata, semantic tags, and playlist co-occurrence signals; and a vocabulary expansion mechanism that enables unified processing and generation of both linguistic and music-relevant tokens. By integrating the recommendation system directly into the LLM architecture, TALKPLAY transforms conventional systems by: (1) unifying previous two-stage conversational recommendation systems (recommendation engines and dialogue managers) into a cohesive end-to-end system, (2) effectively utilizing long conversational context for recommendation while maintaining strong performance in extended multi-turn interactions, and (3) generating natural language responses for seamless user interaction. Our qualitative and quantitative evaluation demonstrates that TALKPLAY significantly outperforms unimodal approaches based solely on text or listening history in both recommendation performance and conversational naturalness.

citation-role summary

background 1 method 1

citation-polarity summary

background 1 use method 1

representative citing papers

DialBGM: A Benchmark for Background Music Recommendation from Everyday Multi-Turn Dialogues

cs.AI · 2026-04-09 · unverdicted · novelty 8.0

DialBGM is a new benchmark dataset revealing that existing AI models fall far short of human performance when recommending fitting background music for open-domain conversations.

Reddit2Deezer: A Scalable Dataset for Real-World Grounded Conversational Music Recommendation

cs.IR · 2026-05-09 · unverdicted · novelty 7.0

Reddit2Deezer supplies 190k authentic Reddit dialogues grounded in Deezer music entities for scalable conversational music recommendation research.

Expressiveness Limits of Autoregressive Semantic ID Generation in Generative Recommendation

cs.IR · 2026-05-07 · unverdicted · novelty 7.0

Autoregressive semantic ID generation creates tree-induced probability correlations that prevent generative recommenders from capturing simple patterns; Latte adds latent tokens to relax these correlations.

Multimodal Music Recommendation System using LLMs

cs.IR · 2026-05-28 · unverdicted · novelty 5.0

Extending E4SRec with multimodal content features on LastFM-1K yields up to 95% Recall and 79% NDCG gains over ID-only baselines, though naive fusion does not always improve results.

citing papers explorer

Showing 4 of 4 citing papers after filters.

DialBGM: A Benchmark for Background Music Recommendation from Everyday Multi-Turn Dialogues cs.AI · 2026-04-09 · unverdicted · none · ref 7 · internal anchor
DialBGM is a new benchmark dataset revealing that existing AI models fall far short of human performance when recommending fitting background music for open-domain conversations.
Reddit2Deezer: A Scalable Dataset for Real-World Grounded Conversational Music Recommendation cs.IR · 2026-05-09 · unverdicted · none · ref 13 · internal anchor
Reddit2Deezer supplies 190k authentic Reddit dialogues grounded in Deezer music entities for scalable conversational music recommendation research.
Expressiveness Limits of Autoregressive Semantic ID Generation in Generative Recommendation cs.IR · 2026-05-07 · unverdicted · none · ref 8 · internal anchor
Autoregressive semantic ID generation creates tree-induced probability correlations that prevent generative recommenders from capturing simple patterns; Latte adds latent tokens to relax these correlations.
Multimodal Music Recommendation System using LLMs cs.IR · 2026-05-28 · unverdicted · none · ref 4 · internal anchor
Extending E4SRec with multimodal content features on LastFM-1K yields up to 95% Recall and 79% NDCG gains over ID-only baselines, though naive fusion does not always improve results.

Talkplay: Multimodal music recommendation with large language models

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer