pith. sign in

Talkplay: Multimodal music recommendation with large language models

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it
abstract

We present TALKPLAY, a novel multimodal music recommendation system that reformulates recommendation as a token generation problem using large language models (LLMs). By leveraging the instruction-following and natural language generation capabilities of LLMs, our system effectively recommends music from diverse user queries while generating contextually relevant responses. While pretrained LLMs are primarily designed for text modality, TALKPLAY extends their scope through two key innovations: a multimodal music tokenizer that encodes audio features, lyrics, metadata, semantic tags, and playlist co-occurrence signals; and a vocabulary expansion mechanism that enables unified processing and generation of both linguistic and music-relevant tokens. By integrating the recommendation system directly into the LLM architecture, TALKPLAY transforms conventional systems by: (1) unifying previous two-stage conversational recommendation systems (recommendation engines and dialogue managers) into a cohesive end-to-end system, (2) effectively utilizing long conversational context for recommendation while maintaining strong performance in extended multi-turn interactions, and (3) generating natural language responses for seamless user interaction. Our qualitative and quantitative evaluation demonstrates that TALKPLAY significantly outperforms unimodal approaches based solely on text or listening history in both recommendation performance and conversational naturalness.

citation-role summary

background 1 method 1

citation-polarity summary

fields

cs.IR 3 cs.AI 1

years

2026 4

verdicts

UNVERDICTED 4

clear filters

representative citing papers

Multimodal Music Recommendation System using LLMs

cs.IR · 2026-05-28 · unverdicted · novelty 5.0

Extending E4SRec with multimodal content features on LastFM-1K yields up to 95% Recall and 79% NDCG gains over ID-only baselines, though naive fusion does not always improve results.

citing papers explorer

Showing 4 of 4 citing papers after filters.