Flatter is better: Percentile Transformations for Recommender Systems

Bamshad Mobasher; Masoud Mansoury; Robin Burke

arxiv: 1907.07766 · v1 · pith:VA367WLOnew · submitted 2019-07-10 · 💻 cs.IR · cs.LG

Flatter is better: Percentile Transformations for Recommender Systems

Masoud Mansoury , Robin Burke , Bamshad Mobasher This is my paper

Pith reviewed 2026-05-24 23:15 UTC · model grok-4.3

classification 💻 cs.IR cs.LG

keywords recommender systemsrating transformationpercentilerating distributionuser biaspreprocessingranking performance

0 comments

The pith

Converting ratings to percentiles before generating recommendations flattens distributions and improves performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that rating distributions lacking flatness correlate with weaker recommendation results, because users differ in how they use the rating scale and tend to give high scores. It introduces a preprocessing step that converts each user's ratings to percentile values, which adjusts for both central tendency and skew at once. Experiments across four datasets and multiple algorithms demonstrate that this change produces better ranking metrics than standard normalization approaches. The transformation is simple to apply before any existing recommendation method runs.

Core claim

Lack of flatness in rating distributions is negatively correlated with recommendation performance. Converting ratings into percentile values as a pre-processing step flattens the distribution, compensates for both skew and central tendency, and improves recommendation performance. A smoothed version of the transformation is also presented for users with narrow rating ranges.

What carries the argument

Percentile transformation of ratings, which maps each user's scores to their rank order within that user's history to produce a flatter distribution.

If this is right

The transformation improves ranking performance when used with state-of-the-art recommendation algorithms.
It compensates for differences across user rating distributions more effectively than methods that adjust only central tendency.
A smoothed variant yields more intuitive outputs for users who rate within a narrow range.
Results hold across four real-world datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same flattening idea might apply to implicit signals such as click counts or dwell times by converting them to rank-based values.
Recommendation models that already include user bias terms may still gain from this preprocessing because it addresses distribution shape beyond mean shifts.
If flatness matters, then evaluation protocols that ignore rating-scale usage patterns could systematically underestimate algorithm quality on skewed data.

Load-bearing premise

The negative correlation between lack of flatness and performance stems from the shape of the distribution itself rather than other confounding factors, and the percentile step preserves enough information for algorithms to use.

What would settle it

Apply the percentile transform to a new dataset and measure whether ranking metrics fail to improve or whether the flatness-performance correlation disappears when other variables are controlled.

Figures

Figures reproduced from arXiv: 1907.07766 by Bamshad Mobasher, Masoud Mansoury, Robin Burke.

**Figure 2.** Figure 2: Raw and binned percentile distributions for BookCrossing data set. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Percentage of users who provided identical ratings. [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

read the original abstract

It is well known that explicit user ratings in recommender systems are biased towards high ratings, and that users differ significantly in their usage of the rating scale. Implementers usually compensate for these issues through rating normalization or the inclusion of a user bias term in factorization models. However, these methods adjust only for the central tendency of users' distributions. In this work, we demonstrate that lack of \textit{flatness} in rating distributions is negatively correlated with recommendation performance. We propose a rating transformation model that compensates for skew in the rating distribution as well as its central tendency by converting ratings into percentile values as a pre-processing step before recommendation generation. This transformation flattens the rating distribution, better compensates for differences in rating distributions, and improves recommendation performance. We also show a smoothed version of this transformation designed to yield more intuitive results for users with very narrow rating distributions. A comprehensive set of experiments show improved ranking performance for these percentile transformations with state-of-the-art recommendation algorithms in four real-world data sets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Percentile transform gives measurable ranking gains on four datasets but the link from flatness to performance is not isolated from activity level.

read the letter

The main point is that mapping ratings to percentiles before running a recommender improves ranking metrics on four real datasets, yet the paper does not rule out that the observed correlation between peaked distributions and weak performance is driven by how many ratings a user supplied rather than by the shape itself. The authors start from the observation that standard normalization only shifts the center of a user's ratings and leave the skew untouched. They convert each user's ratings to percentile ranks, which both recenters and flattens the distribution, and they add a smoothed version for users whose ratings occupy only one or two values. They then plug the transformed ratings into several current algorithms and report better NDCG and similar measures across the four collections. That is the concrete contribution: a simple pre-processing step that is easy to implement and shows consistent lifts in the experiments they ran. The experiments themselves are the strongest part of the work. Using multiple datasets and established baselines gives the claim some weight, and the idea of targeting flatness rather than just mean or variance is a clear incremental step beyond ordinary user bias terms. The soft spot is exactly the one raised in the stress test. Nothing in the abstract or the described method indicates that the authors stratified by rating count, ran a regression with activity controls, or checked partial correlations. Users who rate few items often produce peaked distributions and also yield noisier recommendations for other reasons, so the negative correlation could be an artifact of activity level. If that is the case, the performance gain from the percentile step might come from the rescaling or tie-breaking effect rather than from the flattening property. The paper is aimed at people who build or tune recommender pipelines and want a lightweight way to reduce scale differences. It has enough empirical content and a reproducible-sounding method to justify sending it to referees, though the authors will almost certainly be asked to address the possible confound with user activity.

Referee Report

2 major / 2 minor

Summary. The paper claims that lack of flatness in user rating distributions is negatively correlated with recommendation performance, and proposes a percentile-based rating transformation (with a smoothed variant) as a pre-processing step that flattens distributions, compensates for both skew and central tendency, and improves ranking performance of state-of-the-art algorithms on four real-world datasets.

Significance. If the central empirical claim holds after proper controls, the work offers a lightweight, model-agnostic pre-processing technique that extends existing normalization practices and could be adopted broadly in production systems. The experiments across multiple datasets and algorithms provide a useful empirical demonstration, though the attribution to flatness specifically remains to be isolated.

major comments (2)

[experiments section] The reported negative correlation between lack of flatness and performance (abstract and experiments) does not include controls or stratification for user activity level (number of ratings per user) or other potential confounders. Users with few ratings tend to produce both peaked distributions and noisier recommendations; without partial correlation, regression controls, or activity-matched subsampling, the correlation cannot be attributed to distribution shape itself.
[experiments section] The performance gains from the percentile transformation are presented as resulting from flattening, but the manuscript does not isolate this mechanism from other effects such as global rescaling or tie resolution. An ablation comparing the percentile transform against a simple min-max or z-score normalization (which also alters central tendency but does not flatten) would be required to support the specific claim.

minor comments (2)

[abstract] The abstract states 'comprehensive experiments' but the provided details lack explicit reporting of statistical significance tests, exact baseline implementations, and hyperparameter tuning protocols.
Notation for the smoothed percentile variant should be introduced with a clear equation or pseudocode to distinguish it from the basic percentile transform.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and describe the revisions we will incorporate.

read point-by-point responses

Referee: [experiments section] The reported negative correlation between lack of flatness and performance (abstract and experiments) does not include controls or stratification for user activity level (number of ratings per user) or other potential confounders. Users with few ratings tend to produce both peaked distributions and noisier recommendations; without partial correlation, regression controls, or activity-matched subsampling, the correlation cannot be attributed to distribution shape itself.

Authors: We agree that user activity level is a plausible confounder. In the revised manuscript we will add partial correlation coefficients between lack of flatness and recommendation performance while controlling for the number of ratings per user. We will also report results on activity-matched subsamples to verify that the relationship persists after stratification. revision: yes
Referee: [experiments section] The performance gains from the percentile transformation are presented as resulting from flattening, but the manuscript does not isolate this mechanism from other effects such as global rescaling or tie resolution. An ablation comparing the percentile transform against a simple min-max or z-score normalization (which also alters central tendency but does not flatten) would be required to support the specific claim.

Authors: We accept that isolating the flattening mechanism requires additional controls. We will include a new ablation that applies min-max normalization and z-score normalization to the same four datasets and algorithms, allowing direct comparison of ranking metrics against the percentile and smoothed-percentile transforms. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical correlation and experimental validation

full rationale

The paper's claims rest on direct empirical demonstration of a negative correlation between lack of flatness and recommendation performance, followed by experimental validation that the proposed percentile transformation improves ranking metrics on four real-world datasets using state-of-the-art algorithms. No load-bearing mathematical derivation, fitted parameter renamed as prediction, or self-citation chain is present; the transformation is introduced as a preprocessing heuristic whose benefits are shown through explicit before/after comparisons rather than reducing to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on the domain assumption about rating biases and the empirical demonstration of correlation and improvement from the abstract.

axioms (1)

domain assumption User rating distributions vary in central tendency and skew, affecting recommendation performance.
Stated in the abstract as well known and demonstrated.

pith-pipeline@v0.9.0 · 5705 in / 1095 out tokens · 22386 ms · 2026-05-24T23:15:45.160332+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 1 internal anchor

[1]

ACM Transactions on Information Systems (TOIS) 23, 1 (2005), 103–145

Incorporating contextual information in recommender systems using a multidimensional approach. ACM Transactions on Information Systems (TOIS) 23, 1 (2005), 103–145. Gediminas Adomavicius and Alexander Tuzhilin

work page 2005
[2]

Sloan Management Review 47, 4 (2006), 67–71

From niches to riches: Anatomy of the long tail. Sloan Management Review 47, 4 (2006), 67–71. Paolo Cremonesi, Yehuda Koren, and Roberto Turrin

work page 2006
[3]

word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method

word2vec Explained: deriving Mikolov et al. ’s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722 (2014). Guibing Guo, Jie Zhang, Zhu Sun, and Neil Yorke-Smith

work page internal anchor Pith review Pith/arXiv arXiv 2014
[4]

The American Statistician 50, 4 (November 1996), 361–365

Sample quantiles in statistical packages. The American Statistician 50, 4 (November 1996), 361–365. Dietmar Jannach, Lukas Lerche, Iman Kamehkhosh, and Michael Jugovac

work page 1996
[5]

User Modeling and User-Adapted Interaction 25, 5 (2015), 427–491

What recommenders recommend: an analysis of recommendation biases and possible countermeasures. User Modeling and User-Adapted Interaction 25, 5 (2015), 427–491. Rong Jin and Luo Si

work page 2015
[6]

Multimedia Tools and Applications 75, 9 (May 2016), 4957âĂŞ4968

Improvement of collaborative filtering using rating normalization. Multimedia Tools and Applications 75, 9 (May 2016), 4957âĂŞ4968. Yehuda Koren

work page 2016
[7]

Computer 42, 8 (2009)

Matrix factorization techniques for recommender systems. Computer 42, 8 (2009). Eric Langford

work page 2009
[8]

Journal of Statistics Education 14, 3 (November 2006), 1–27

Quartiles in elementary statistics. Journal of Statistics Education 14, 3 (November 2006), 1–27. Daniel D. Lee and H Sebastian Seung

work page 2006
[9]

Benjamin M Marlin, Richard S Zemel, Sam Roweis, and Malcolm Slaney

Algorithms for non-negative matrix factorization.Advances in neural information processing systems (2001), 556–562. Benjamin M Marlin, Richard S Zemel, Sam Roweis, and Malcolm Slaney

work page 2001
[10]

InData Mining (ICDM), 2011 IEEE 11th International Conference on

SLIM: Sparse Linear Methods for Top-N Recommender Systems. InData Mining (ICDM), 2011 IEEE 11th International Conference on . IEEE, 497–506. Yoon-Joo Park and Alexander Tuzhilin

work page 2011
[11]

In RecSys ’08 Proceedings of the 2008 ACM Conference on Recommender Systems

The Long Tail of Recommender Systems and How to Leverage It. In RecSys ’08 Proceedings of the 2008 ACM Conference on Recommender Systems . 11–18. Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme

work page 2008
[12]

In Proceedings of the 1994 ACM conference on Computer supported cooperative work

GroupLens: an open architecture for collaborative filtering of netnews. In Proceedings of the 1994 ACM conference on Computer supported cooperative work . ACM, 175–186. Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl

work page 1994

[1] [1]

ACM Transactions on Information Systems (TOIS) 23, 1 (2005), 103–145

Incorporating contextual information in recommender systems using a multidimensional approach. ACM Transactions on Information Systems (TOIS) 23, 1 (2005), 103–145. Gediminas Adomavicius and Alexander Tuzhilin

work page 2005

[2] [2]

Sloan Management Review 47, 4 (2006), 67–71

From niches to riches: Anatomy of the long tail. Sloan Management Review 47, 4 (2006), 67–71. Paolo Cremonesi, Yehuda Koren, and Roberto Turrin

work page 2006

[3] [3]

word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method

word2vec Explained: deriving Mikolov et al. ’s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722 (2014). Guibing Guo, Jie Zhang, Zhu Sun, and Neil Yorke-Smith

work page internal anchor Pith review Pith/arXiv arXiv 2014

[4] [4]

The American Statistician 50, 4 (November 1996), 361–365

Sample quantiles in statistical packages. The American Statistician 50, 4 (November 1996), 361–365. Dietmar Jannach, Lukas Lerche, Iman Kamehkhosh, and Michael Jugovac

work page 1996

[5] [5]

User Modeling and User-Adapted Interaction 25, 5 (2015), 427–491

What recommenders recommend: an analysis of recommendation biases and possible countermeasures. User Modeling and User-Adapted Interaction 25, 5 (2015), 427–491. Rong Jin and Luo Si

work page 2015

[6] [6]

Multimedia Tools and Applications 75, 9 (May 2016), 4957âĂŞ4968

Improvement of collaborative filtering using rating normalization. Multimedia Tools and Applications 75, 9 (May 2016), 4957âĂŞ4968. Yehuda Koren

work page 2016

[7] [7]

Computer 42, 8 (2009)

Matrix factorization techniques for recommender systems. Computer 42, 8 (2009). Eric Langford

work page 2009

[8] [8]

Journal of Statistics Education 14, 3 (November 2006), 1–27

Quartiles in elementary statistics. Journal of Statistics Education 14, 3 (November 2006), 1–27. Daniel D. Lee and H Sebastian Seung

work page 2006

[9] [9]

Benjamin M Marlin, Richard S Zemel, Sam Roweis, and Malcolm Slaney

Algorithms for non-negative matrix factorization.Advances in neural information processing systems (2001), 556–562. Benjamin M Marlin, Richard S Zemel, Sam Roweis, and Malcolm Slaney

work page 2001

[10] [10]

InData Mining (ICDM), 2011 IEEE 11th International Conference on

SLIM: Sparse Linear Methods for Top-N Recommender Systems. InData Mining (ICDM), 2011 IEEE 11th International Conference on . IEEE, 497–506. Yoon-Joo Park and Alexander Tuzhilin

work page 2011

[11] [11]

In RecSys ’08 Proceedings of the 2008 ACM Conference on Recommender Systems

The Long Tail of Recommender Systems and How to Leverage It. In RecSys ’08 Proceedings of the 2008 ACM Conference on Recommender Systems . 11–18. Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme

work page 2008

[12] [12]

In Proceedings of the 1994 ACM conference on Computer supported cooperative work

GroupLens: an open architecture for collaborative filtering of netnews. In Proceedings of the 1994 ACM conference on Computer supported cooperative work . ACM, 175–186. Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl

work page 1994