Evaluating Recommender System Algorithms for Generating Local Music Playlists

Daniel Akimchuk; Douglas Turnbull; Timothy Clerico

arxiv: 1907.08687 · v1 · pith:I7LBAMRXnew · submitted 2019-07-17 · 💻 cs.IR · cs.LG· stat.ML

Evaluating Recommender System Algorithms for Generating Local Music Playlists

Daniel Akimchuk , Timothy Clerico , Douglas Turnbull This is my paper

Pith reviewed 2026-05-24 19:57 UTC · model grok-4.3

classification 💻 cs.IR cs.LGstat.ML

keywords local music recommendationcold-start problemitem-item neighborhoodmatrix factorizationlong-tail artistscollaborative filteringmillion playlist datasetgeographic recommendation

0 comments

The pith

Neighborhood-based recommendation outperforms matrix factorization for local music playlists from long-tail artists.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares three standard recommender algorithms on the task of generating playlists consisting only of tracks by local artists in eight cities. These local artists are mostly obscure long-tail acts with little or no user preference data, creating a cold-start problem for collaborative filtering. The authors modify the standard evaluation on the Million Playlist Dataset so that each algorithm must rank only the relevant local tracks for each city. Under this setup the item-item neighborhood method performs best, even though alternating least squares and Bayesian personalized ranking usually win on large-scale tasks. A sympathetic reader would care because local live-music scenes depend on surfacing these geographically tied but data-poor artists.

Core claim

Despite the fact that techniques based on matrix factorization (ALS, BPR) typically perform best on large recommendation tasks, the neighborhood-based approach (IIN) performs best for long-tail local music recommendation when the evaluation is restricted to ranking only tracks by local artists for each of the eight different cities.

What carries the argument

The modified evaluation procedure that restricts each algorithm to ranking only tracks by local artists for each city, enabling direct measurement of cold-start performance on geographic long-tail items.

If this is right

Item-item neighborhood methods should be considered first for any recommendation setting dominated by long-tail items with sparse user data.
Matrix factorization approaches may require additional side information or hybrid designs when the target items are both local and obscure.
Standard large-scale benchmarks can mask performance differences that appear once evaluation is restricted to a narrow geographic or thematic subset.
Playlist generation for live-event discovery benefits from neighborhood similarity rather than latent-factor modeling when artist popularity is low.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same evaluation restriction could be applied to other location-aware domains such as local food or event recommendation to test whether neighborhood methods retain an edge.
If user listening data were augmented with explicit location tags, the performance gap between IIN and matrix factorization might shrink or reverse.
The result suggests that similarity-based methods may scale better than factorization when the item catalog is partitioned by many small, disjoint user communities.

Load-bearing premise

The modified evaluation procedure that restricts each algorithm to ranking only tracks by local artists for each of the eight cities accurately captures real-world performance on the cold-start problem for local music recommendation.

What would settle it

A live A/B test in one of the eight cities in which users are shown playlists generated by IIN versus ALS or BPR and the local-track listen rate or completion rate is measured; if IIN does not produce higher engagement on local tracks the claim is falsified.

read the original abstract

We explore the task of local music recommendation: provide listeners with personalized playlists of relevant tracks by artists who play most of their live events within a small geographic area. Most local artists tend to be obscure, long-tail artists and generally have little or no available user preference data associated with them. This creates a cold-start problem for collaborative filtering-based recommendation algorithms that depend on large amounts of such information to make accurate recommendations. In this paper, we compare the performance of three standard recommender system algorithms (Item-Item Neighborhood (IIN), Alternating Least Squares for Implicit Feedback (ALS), and Bayesian Personalized Ranking (BPR)) on the task of local music recommendation using the Million Playlist Dataset. To do this, we modify the standard evaluation procedure such that the algorithms only rank tracks by local artists for each of the eight different cities. Despite the fact that techniques based on matrix factorization (ALS, BPR) typically perform best on large recommendation tasks, we find that the neighborhood-based approach (IIN) performs best for long-tail local music recommendation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper reports IIN beating ALS and BPR on a city-restricted local-artist ranking task from the Million Playlist Dataset, but the evaluation protocol does not confirm the items are true cold-start cases.

read the letter

The main thing to know is that item-item neighborhood outperforms the two matrix factorization methods when the ranking pool is limited to local artists in each of eight cities. The authors modify the standard playlist completion task on the Million Playlist Dataset so that each algorithm only ranks tracks by artists who mostly play in that city, and they report the reversal of the usual ordering. This is the concrete empirical result they add to the literature on long-tail music recommendation. The setup is straightforward and uses an existing public dataset, which keeps the comparison grounded. The task definition itself matches a real use case for local music playlists where most artists have little data. The soft spots sit in the evaluation details. The abstract does not describe how local artists are identified or show that those artists have near-zero interactions in the training split. If the local artists retain even modest co-occurrences, the observed advantage for IIN may simply reflect its known behavior on sparse matrices rather than any special handling of geographic cold-start. No information appears on hyperparameter tuning or statistical significance of the differences. This paper is for researchers working on music or niche-domain recommenders who want to see how standard algorithms behave under a geographically constrained ranking rule. It does not introduce new methods or reorganize broader theory. The experiment is worth checking with the full protocol, so I would send it to peer review.

Referee Report

2 major / 2 minor

Summary. The manuscript evaluates three recommender algorithms (Item-Item Neighborhood (IIN), ALS, and BPR) on the Million Playlist Dataset for the task of generating playlists consisting only of local artists (defined per city). The standard ranking evaluation is modified to restrict each algorithm to ranking tracks by local artists in eight cities; the central empirical finding is that IIN outperforms the matrix-factorization methods despite the latter typically excelling on large-scale tasks.

Significance. If the result holds under a properly validated cold-start protocol, the work supplies a concrete data point that neighborhood methods can be preferable to MF for geographic long-tail recommendation, with direct implications for music platforms serving local artists. The use of a public dataset and an explicitly described task modification are positive features.

major comments (2)

[Evaluation section] Evaluation section: the modified ranking protocol (restricting test items to local-artist tracks) is presented as measuring performance on the long-tail local cold-start task, yet the manuscript provides no verification that local artists retain near-zero interactions in the training split; without this check the observed IIN advantage cannot be attributed specifically to cold-start handling.
[Results section] Results section: the headline reversal (IIN > ALS/BPR) depends on the claim that the eight-city restriction isolates the desired task; no analysis is supplied showing that the removed (non-local) items are not precisely those on which MF methods excel, leaving open the possibility that the ordering simply reflects IIN's known behavior on sparse co-occurrence matrices rather than any special suitability for local recommendation.

minor comments (2)

[Experimental setup] Hyperparameter selection and tuning procedure for ALS and BPR are not described; this information is needed to interpret the comparison.
[Results section] No statistical significance tests or confidence intervals are reported for the performance differences across the eight cities.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our evaluation protocol and results interpretation. We address each major comment below and indicate planned revisions where appropriate.

read point-by-point responses

Referee: [Evaluation section] Evaluation section: the modified ranking protocol (restricting test items to local-artist tracks) is presented as measuring performance on the long-tail local cold-start task, yet the manuscript provides no verification that local artists retain near-zero interactions in the training split; without this check the observed IIN advantage cannot be attributed specifically to cold-start handling.

Authors: We agree that an explicit check on training-set interaction counts for the local artists would strengthen the cold-start interpretation. In the revised manuscript we will add a supplementary table reporting the mean and median number of playlist occurrences for local versus non-local artists in the training split for each of the eight cities. This will confirm that the local artists are indeed long-tail with near-zero interactions relative to the overall item distribution. revision: yes
Referee: [Results section] Results section: the headline reversal (IIN > ALS/BPR) depends on the claim that the eight-city restriction isolates the desired task; no analysis is supplied showing that the removed (non-local) items are not precisely those on which MF methods excel, leaving open the possibility that the ordering simply reflects IIN's known behavior on sparse co-occurrence matrices rather than any special suitability for local recommendation.

Authors: The eight-city restriction is a deliberate design choice that matches the target task of generating playlists consisting solely of local artists; performance on non-local tracks is outside the scope of the problem we study. While we recognize that matrix-factorization methods often benefit from dense popular-item data, the observed ordering is consistent with neighborhood methods' documented advantage on sparse co-occurrence data, which is the regime occupied by local artists. We will add a short paragraph in the results section discussing this alignment with prior literature on neighborhood versus factorization behavior under sparsity, but we do not believe a full re-analysis of the removed items is required to support the task-specific claim. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical algorithm comparison on held-out data with no derivations or self-referential quantities

full rationale

The paper performs a direct empirical comparison of IIN, ALS, and BPR on the Million Playlist Dataset under a modified ranking protocol that restricts candidates to local-artist tracks per city. No equations, fitted parameters renamed as predictions, self-citations used as load-bearing uniqueness theorems, or ansatzes appear in the abstract or described method. The central claim (IIN outperforms MF methods on this task) is a measured outcome on held-out playlists, not a quantity that reduces to its own inputs by construction. This matches the default expectation of a non-circular empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces no new mathematical entities or fitted constants; it relies on the standard assumption that collaborative filtering performance on a modified ranking task reflects cold-start behavior for geographically constrained artists.

axioms (1)

domain assumption Restricting the candidate set to local artists in the evaluation procedure isolates the cold-start problem for long-tail local music.
This premise is invoked when the authors modify the standard evaluation so algorithms only rank local tracks.

pith-pipeline@v0.9.0 · 5713 in / 1138 out tokens · 21055 ms · 2026-05-24T19:57:23.225532+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 1 internal anchor

[1]

Evaluating Recommender System Algorithms for Generating Local Music Playlists

INTRODUCTION If you were to move to a new city and wanted to check out the local music scene, how would you get started? You might ask an expert, such as an employee at a local mu- sic store or a barista at a local coffee shop, but they are likely to give you incomplete or biased recommendations based on their own personal experiences and interests. You m...

work page internal anchor Pith review Pith/arXiv arXiv 2018
[2]

Our main data structure is a Playlist-Track matrix which is akin to a User-Item matrix in standard CF research

RECOMMENDER SYSTEM ALGORITHMS In this section we describe three common recommenda- tion algorithms: Item-Item Neighborhood (IIN) Recom- mendation, Alternating Least Squares (ALS) for Implicit Feedback, and Bayesian Personalized Ranking (BPR). Our main data structure is a Playlist-Track matrix which is akin to a User-Item matrix in standard CF research. Ea...

work page 2009
[3]

For the paper, we consider a local artist to be an artist that performs the large majority of their live events close to or within a single city

LOCAL MUSIC DA TA Our ﬁrst task is to identify a set of local artists for a given city. For the paper, we consider a local artist to be an artist that performs the large majority of their live events close to or within a single city. We collected artist event in- formation from both Ticketﬂy 7 and Facebook 8 . Ticket- ﬂy provides information about large a...

work page 2019
[4]

That is, we use each group as the evaluation set once and the other four as part of the training set each time

EXPERIMENTS For each of these cities, we use the following evaluation procedure: Algorithm 1 Evaluation Procedure 1: foreach city do 2: foreach fold do 3: constructXtrain andXeval 4: foreach algorithm do 5: train model withXtrain 6: foreach playlist x(p)∈ Xeval do 7: split x(p) into xnon−local and xlocal 8: use xnon−local with model to predict ˆ xlocal 9:...

work page 2023
[5]

The notable exception to this is Chicago, in which the popularity baseline outperformed all other mod- els in all three metrics

RESULTS As shown in Table 2, the Item-Item Neighborhood model outperforms both baselines (Random, Popularity) and both matrix factorization models (ALS, BPR) in nearly every scenario. The notable exception to this is Chicago, in which the popularity baseline outperformed all other mod- els in all three metrics. This can be explained, however, due to the e...

work page 2017
[6]

CONCLUSIONS We have presented a novel approach for evaluating local (long-tail) music recommendation. That is, by partition- ing a large playlist-track matrix into non-local and local (mostly long-tail) tracks, and considering playlists with one or more these local tracks, we can evaluate how dif- ferent recommender systems perform on this task. Surprisin...

work page
[7]

The long tail: Why the future of busi- ness is selling less of more

Chris Anderson. The long tail: Why the future of busi- ness is selling less of more. Hachette Books, 2006

work page 2006
[8]

Statistical biases in information retrieval metrics for recommender systems

Alejandro Bellogín, Pablo Castells, and Iván Canta- dor. Statistical biases in information retrieval metrics for recommender systems. Information Retrieval Jour- nal, 20(6):606–634, 2017

work page 2017
[9]

Music recommendation

Oscar Celma. Music recommendation. In Music rec- ommendation and discovery , pages 43–85. Springer, 2010

work page 2010
[10]

From hits to niches?: or how popular artists can bias music recommendation and discovery

Òscar Celma and Pedro Cano. From hits to niches?: or how popular artists can bias music recommendation and discovery. In Proceedings of the 2nd KDD Work- shop on Large-Scale Recommender Systems and the Netﬂix Prize Competition, page 5. ACM, 2008

work page 2008
[11]

Recsys challenge 2018: Automatic music playlist continuation

Ching-Wei Chen, Paul Lamere, Markus Schedl, and Hamed Zamani. Recsys challenge 2018: Automatic music playlist continuation. In Proceedings of the 12th ACM Conference on Recommender Systems , pages 527–528. ACM, 2018

work page 2018
[12]

Interac- tive effects of personality and frequency of exposure on liking for music

Patrick G Hunter and E Glenn Schellenberg. Interac- tive effects of personality and frequency of exposure on liking for music. Personality and Individual Differ- ences, 50(2):175–179, 2011

work page 2011
[13]

Ma- trix factorization techniques for recommender systems

Yehuda Koren, Robert Bell, and Chris V olinsky. Ma- trix factorization techniques for recommender systems. Computer, (8):30–37, 2009

work page 2009
[14]

Music recommenda- tion and the long tail

Mark Levy and Klaas Bosteels. Music recommenda- tion and the long tail. In 1st Workshop On Music Rec- ommendation And Discovery (WOMRAD), ACM Rec- Sys, 2010, Barcelona, Spain. Citeseer, 2010

work page 2010
[15]

Subjective complexity, familiarity, and liking for popular music

Adrian C North and David J Hargreaves. Subjective complexity, familiarity, and liking for popular music. Psychomusicology: A Journal of Research in Music Cognition, 14(1-2):77, 1995

work page 1995
[16]

Bpr: Bayesian person- alized ranking from implicit feedback

Steffen Rendle, Christoph Freudenthaler, Zeno Gant- ner, and Lars Schmidt-Thieme. Bpr: Bayesian person- alized ranking from implicit feedback. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Arti- ﬁcial Intelligence, UAI ’09, pages 452–461, Arlington, Virginia, United States, 2009. AUAI Press

work page 2009
[17]

Item-based collaborative ﬁltering recom- mendation algorithms

Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. Item-based collaborative ﬁltering recom- mendation algorithms. In Proceedings of the 10th In- ternational Conference on World Wide Web , WWW ’01, pages 285–295, New York, NY , USA, 2001. ACM

work page 2001
[18]

Current challenges and visions in music recommender systems research

Markus Schedl, Hamed Zamani, Ching-Wei Chen, Yashar Deldjoo, and Mehdi Elahi. Current challenges and visions in music recommender systems research. International Journal of Multimedia Information Re- trieval, 7(2):95–116, 2018

work page 2018
[19]

Five approaches to collecting tags for mu- sic

Douglas Turnbull, Luke Barrington, and Gert RG Lanckriet. Five approaches to collecting tags for mu- sic. In ISMIR, volume 8, pages 225–230, 2008

work page 2008
[20]

V olinsky, Y

C. V olinsky, Y . Koren, and Y . Hu. Collaborative ﬁl- tering for implicit feedback datasets. In ICDM 2008. Eighth IEEE International Conference on Data Min- ing, pages 263–272, Los Alamitos, CA, USA, dec

work page 2008
[21]

IEEE Computer Society

work page
[22]

Two-stage model for automatic playlist continuation at scale

Maksims V olkovs, Himanshu Rai, Zhaoyue Cheng, Ga Wu, Yichao Lu, and Scott Sanner. Two-stage model for automatic playlist continuation at scale. In Pro- ceedings of the ACM Recommender Systems Challenge 2018, page 9. ACM, 2018

work page 2018

[1] [1]

Evaluating Recommender System Algorithms for Generating Local Music Playlists

INTRODUCTION If you were to move to a new city and wanted to check out the local music scene, how would you get started? You might ask an expert, such as an employee at a local mu- sic store or a barista at a local coffee shop, but they are likely to give you incomplete or biased recommendations based on their own personal experiences and interests. You m...

work page internal anchor Pith review Pith/arXiv arXiv 2018

[2] [2]

Our main data structure is a Playlist-Track matrix which is akin to a User-Item matrix in standard CF research

RECOMMENDER SYSTEM ALGORITHMS In this section we describe three common recommenda- tion algorithms: Item-Item Neighborhood (IIN) Recom- mendation, Alternating Least Squares (ALS) for Implicit Feedback, and Bayesian Personalized Ranking (BPR). Our main data structure is a Playlist-Track matrix which is akin to a User-Item matrix in standard CF research. Ea...

work page 2009

[3] [3]

For the paper, we consider a local artist to be an artist that performs the large majority of their live events close to or within a single city

LOCAL MUSIC DA TA Our ﬁrst task is to identify a set of local artists for a given city. For the paper, we consider a local artist to be an artist that performs the large majority of their live events close to or within a single city. We collected artist event in- formation from both Ticketﬂy 7 and Facebook 8 . Ticket- ﬂy provides information about large a...

work page 2019

[4] [4]

That is, we use each group as the evaluation set once and the other four as part of the training set each time

EXPERIMENTS For each of these cities, we use the following evaluation procedure: Algorithm 1 Evaluation Procedure 1: foreach city do 2: foreach fold do 3: constructXtrain andXeval 4: foreach algorithm do 5: train model withXtrain 6: foreach playlist x(p)∈ Xeval do 7: split x(p) into xnon−local and xlocal 8: use xnon−local with model to predict ˆ xlocal 9:...

work page 2023

[5] [5]

The notable exception to this is Chicago, in which the popularity baseline outperformed all other mod- els in all three metrics

RESULTS As shown in Table 2, the Item-Item Neighborhood model outperforms both baselines (Random, Popularity) and both matrix factorization models (ALS, BPR) in nearly every scenario. The notable exception to this is Chicago, in which the popularity baseline outperformed all other mod- els in all three metrics. This can be explained, however, due to the e...

work page 2017

[6] [6]

CONCLUSIONS We have presented a novel approach for evaluating local (long-tail) music recommendation. That is, by partition- ing a large playlist-track matrix into non-local and local (mostly long-tail) tracks, and considering playlists with one or more these local tracks, we can evaluate how dif- ferent recommender systems perform on this task. Surprisin...

work page

[7] [7]

The long tail: Why the future of busi- ness is selling less of more

Chris Anderson. The long tail: Why the future of busi- ness is selling less of more. Hachette Books, 2006

work page 2006

[8] [8]

Statistical biases in information retrieval metrics for recommender systems

Alejandro Bellogín, Pablo Castells, and Iván Canta- dor. Statistical biases in information retrieval metrics for recommender systems. Information Retrieval Jour- nal, 20(6):606–634, 2017

work page 2017

[9] [9]

Music recommendation

Oscar Celma. Music recommendation. In Music rec- ommendation and discovery , pages 43–85. Springer, 2010

work page 2010

[10] [10]

From hits to niches?: or how popular artists can bias music recommendation and discovery

Òscar Celma and Pedro Cano. From hits to niches?: or how popular artists can bias music recommendation and discovery. In Proceedings of the 2nd KDD Work- shop on Large-Scale Recommender Systems and the Netﬂix Prize Competition, page 5. ACM, 2008

work page 2008

[11] [11]

Recsys challenge 2018: Automatic music playlist continuation

Ching-Wei Chen, Paul Lamere, Markus Schedl, and Hamed Zamani. Recsys challenge 2018: Automatic music playlist continuation. In Proceedings of the 12th ACM Conference on Recommender Systems , pages 527–528. ACM, 2018

work page 2018

[12] [12]

Interac- tive effects of personality and frequency of exposure on liking for music

Patrick G Hunter and E Glenn Schellenberg. Interac- tive effects of personality and frequency of exposure on liking for music. Personality and Individual Differ- ences, 50(2):175–179, 2011

work page 2011

[13] [13]

Ma- trix factorization techniques for recommender systems

Yehuda Koren, Robert Bell, and Chris V olinsky. Ma- trix factorization techniques for recommender systems. Computer, (8):30–37, 2009

work page 2009

[14] [14]

Music recommenda- tion and the long tail

Mark Levy and Klaas Bosteels. Music recommenda- tion and the long tail. In 1st Workshop On Music Rec- ommendation And Discovery (WOMRAD), ACM Rec- Sys, 2010, Barcelona, Spain. Citeseer, 2010

work page 2010

[15] [15]

Subjective complexity, familiarity, and liking for popular music

Adrian C North and David J Hargreaves. Subjective complexity, familiarity, and liking for popular music. Psychomusicology: A Journal of Research in Music Cognition, 14(1-2):77, 1995

work page 1995

[16] [16]

Bpr: Bayesian person- alized ranking from implicit feedback

Steffen Rendle, Christoph Freudenthaler, Zeno Gant- ner, and Lars Schmidt-Thieme. Bpr: Bayesian person- alized ranking from implicit feedback. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Arti- ﬁcial Intelligence, UAI ’09, pages 452–461, Arlington, Virginia, United States, 2009. AUAI Press

work page 2009

[17] [17]

Item-based collaborative ﬁltering recom- mendation algorithms

Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. Item-based collaborative ﬁltering recom- mendation algorithms. In Proceedings of the 10th In- ternational Conference on World Wide Web , WWW ’01, pages 285–295, New York, NY , USA, 2001. ACM

work page 2001

[18] [18]

Current challenges and visions in music recommender systems research

Markus Schedl, Hamed Zamani, Ching-Wei Chen, Yashar Deldjoo, and Mehdi Elahi. Current challenges and visions in music recommender systems research. International Journal of Multimedia Information Re- trieval, 7(2):95–116, 2018

work page 2018

[19] [19]

Five approaches to collecting tags for mu- sic

Douglas Turnbull, Luke Barrington, and Gert RG Lanckriet. Five approaches to collecting tags for mu- sic. In ISMIR, volume 8, pages 225–230, 2008

work page 2008

[20] [20]

V olinsky, Y

C. V olinsky, Y . Koren, and Y . Hu. Collaborative ﬁl- tering for implicit feedback datasets. In ICDM 2008. Eighth IEEE International Conference on Data Min- ing, pages 263–272, Los Alamitos, CA, USA, dec

work page 2008

[21] [21]

IEEE Computer Society

work page

[22] [22]

Two-stage model for automatic playlist continuation at scale

Maksims V olkovs, Himanshu Rai, Zhaoyue Cheng, Ga Wu, Yichao Lu, and Scott Sanner. Two-stage model for automatic playlist continuation at scale. In Pro- ceedings of the ACM Recommender Systems Challenge 2018, page 9. ACM, 2018

work page 2018