The Unreasonable Effectiveness of Data for Recommender Systems
Pith reviewed 2026-05-10 18:07 UTC · model grok-4.3
The pith
More training data continues to improve recommender performance with no saturation in typical cases
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that NDCG@10 scores increase with larger training set sizes across nine sample points from 100k to 100M interactions, with no observable saturation point reached. After min-max normalization within each algorithm-dataset group, around 75 percent of the largest completed samples also record the group's highest performance. Late-stage slope analysis over the final 10-30 percent of each group shows the interquartile range fully non-negative with a median near 1.0, confirming an ongoing upward trend for traditional recommender systems on typical data.
What carries the argument
Absolute stratified user sampling to build training subsets of increasing size while holding test sets fixed, followed by NDCG@10 evaluation across multiple algorithms and toolkits
If this is right
- Data volume remains a primary driver of better offline accuracy for most traditional recommender algorithms
- Saturation points appear rare on typical user-item interaction datasets
- Weaker scaling occurs mainly in atypical datasets and specific algorithmic cases such as RecBole BPR
- Continued investment in larger training sets is likely to produce gains rather than diminishing returns
Where Pith is reading between the lines
- This finding implies that data acquisition and storage costs may stay justified for longer in production recommender pipelines
- It raises the question of whether the same scaling pattern appears when measuring online user metrics instead of offline NDCG
- The result suggests testing whether controlling for catalog size or user activity distribution changes the observed trends
- Future experiments could check if very large industrial datasets exhibit the same lack of saturation
Load-bearing premise
Absolute stratified user sampling creates dataset sizes whose performance trends accurately reflect what would be seen when scaling the full original dataset without introducing selection bias
What would settle it
Training the same models on the complete original datasets and observing either a plateau or drop in NDCG@10 relative to the largest sampled subset
Figures
read the original abstract
In recommender systems, collecting, storing, and processing large-scale interaction data is increasingly costly in terms of time, energy, and computation, yet it remains unclear when additional data stops providing meaningful gains. This paper investigates how offline recommendation performance evolves as the size of the training dataset increases and whether a saturation point can be observed. We implemented a reproducible Python evaluation workflow with two established toolkits, LensKit and RecBole, included 11 large public datasets with at least 7 million interactions, and evaluated 10 tool-algorithm combinations. Using absolute stratified user sampling, we trained models on nine sample sizes from 100,000 to 100,000,000 interactions and measured NDCG@10. Overall, raw NDCG usually increased with sample size, with no observable saturation point. To make result groups comparable, we applied min-max normalization within each group, revealing a clear positive trend in which around 75% of the points at the largest completed sample size also achieved the group's best observed performance. A late-stage slope analysis over the final 10-30% of each group further supported this upward trend: the interquartile range remained entirely non-negative with a median near 1.0. In summary, for traditional recommender systems on typical user-item interaction data, incorporating more training data remains primarily beneficial, while weaker scaling behavior is concentrated in atypical dataset cases and in the algorithmic outlier RecBole BPR under our setup.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that for traditional recommender systems on typical user-item interaction data, increasing training dataset size from 100k to 100M interactions generally improves NDCG@10 without observable saturation. Experiments use absolute stratified user sampling on 11 large public datasets, 10 algorithm-tool combinations from LensKit and RecBole, min-max normalization within groups, and late-stage slope analysis (final 10-30% of curves) showing non-negative interquartile ranges with median near 1.0. Weaker scaling is isolated to atypical datasets and the RecBole BPR outlier.
Significance. If the central empirical trends hold, the work provides useful large-scale evidence that data scaling remains effective for standard recommenders, supporting continued investment in data collection rather than assuming early diminishing returns. Strengths include the reproducible Python workflow, use of established public datasets and toolkits with described sampling/normalization, and direct measurement of performance curves across nine sizes.
major comments (2)
- [Methods] Methods section (sampling procedure): Absolute stratified user sampling to exact interaction counts (100k–100M) is presented as a proxy for natural scaling, but it can systematically alter sparsity patterns, per-user interaction distributions, item popularity skew, and new-vs-repeat interaction ratios relative to organic data growth. This directly affects whether the observed NDCG@10 trends and late-stage slopes generalize to the claim about 'typical user-item interaction data.' Validation against alternative sampling (e.g., random or temporal) or bias diagnostics is needed.
- [Results] Results section (normalization and slope analysis): Min-max normalization within each group followed by slope computation over the final 10–30% treats the constructed subsets as comparable; any sampling-induced bias propagates into the '75% achieve best performance' statistic and the entirely non-negative IQR claim. Sensitivity of the slope median (~1.0) to the exact percentage window or to unnormalized raw scores should be reported.
minor comments (2)
- Table or appendix listing all 11 datasets with their original sizes, domains, and sampling details would improve reproducibility.
- Clarify whether the nine sample sizes are strictly nested (i.e., each larger sample contains the smaller ones) or independently drawn; this affects interpretation of the curves.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive feedback on our manuscript. We have carefully considered the major comments regarding the sampling procedure and the normalization/slope analysis. Below, we provide point-by-point responses. We agree that additional justification and sensitivity checks would strengthen the paper and plan to incorporate them in the revision.
read point-by-point responses
-
Referee: [Methods] Methods section (sampling procedure): Absolute stratified user sampling to exact interaction counts (100k–100M) is presented as a proxy for natural scaling, but it can systematically alter sparsity patterns, per-user interaction distributions, item popularity skew, and new-vs-repeat interaction ratios relative to organic data growth. This directly affects whether the observed NDCG@10 trends and late-stage slopes generalize to the claim about 'typical user-item interaction data.' Validation against alternative sampling (e.g., random or temporal) or bias diagnostics is needed.
Authors: We thank the referee for highlighting this important methodological consideration. Our choice of absolute stratified user sampling was motivated by the need to create subsets with precisely controlled interaction counts while preserving the relative user activity levels from the original datasets. This approach allows for a clean isolation of the effect of data volume. We acknowledge that it does not perfectly mimic organic growth, which could involve different dynamics in user and item distributions. However, we believe it provides a reasonable proxy for studying scaling in typical user-item data, as the datasets themselves are real-world collections. In the revised version, we will expand the Methods section to include a more detailed justification of the sampling choice, along with basic bias diagnostics (e.g., changes in sparsity and popularity skew across sizes). Full validation with alternative samplings like temporal splits would require substantial additional experiments and is noted as a limitation for future work. We argue that the consistent trends across 11 diverse datasets support the generalizability of our findings despite the sampling method. revision: partial
-
Referee: [Results] Results section (normalization and slope analysis): Min-max normalization within each group followed by slope computation over the final 10–30% treats the constructed subsets as comparable; any sampling-induced bias propagates into the '75% achieve best performance' statistic and the entirely non-negative IQR claim. Sensitivity of the slope median (~1.0) to the exact percentage window or to unnormalized raw scores should be reported.
Authors: We appreciate this point on the robustness of our analysis. The min-max normalization was applied within each algorithm-dataset group to enable comparison of relative trends across different performance scales. The late-stage slope analysis was intended to focus on the behavior at larger data sizes. To address the concern, we will include in the revision a sensitivity analysis showing how the slope statistics (median and IQR) vary with different window sizes (e.g., final 10%, 20%, 30%) and also report trends on unnormalized scores where possible. We note that the primary claim relies on the raw observations of increasing NDCG in most cases, with normalization used only for aggregation. This additional reporting will confirm that the positive trend is not an artifact of the specific analysis choices. revision: yes
Circularity Check
No circularity: purely empirical scaling measurements
full rationale
The paper reports direct experimental results from training 10 algorithm-tool combinations on nine absolute-stratified subsamples (100k to 100M interactions) drawn from 11 public datasets, then measuring raw NDCG@10, applying within-group min-max normalization, and computing late-stage slopes. No equations, first-principles derivations, fitted parameters, or predictions are claimed; the central claim that more data remains beneficial is presented as an observation from these runs. The sampling procedure and normalization are methodological choices whose validity is external to the reported trends, with no reduction of any result to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption NDCG@10 is a suitable proxy for recommendation quality
- domain assumption Absolute stratified user sampling produces representative subsets for scaling analysis
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Using absolute stratified user sampling, we trained models on nine sample sizes from 100,000 to 100,000,000 interactions and measured NDCG@10. ... late-stage slope analysis over the final 10-30% of each group further supported this upward trend
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Overall, raw NDCG usually increased with sample size, with no observable saturation point.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Ajiboye Abdulraheem, Ruzaini Abdullah Arshah, and Hongwu Qin. 2015. Evalu- ating the Effect of Dataset Size on Predictive Model Using Supervised Learning Technique.International Journal of Software Engineering & Computer Sciences (IJSECS)1 (02 2015), 75–84. doi:10.15282/ijsecs.1.2015.6.0006
-
[2]
Apache Parquet contributors. [n. d.]. Apache Parquet. https://parquet.apache. org/. Accessed: 2026-03-17
work page 2026
-
[3]
Joeran Beel and Victor Brunel. 2019. Data pruning in recommender systems research: Best-practice or malpractice. In13th ACM Conference on Recommender Systems (RecSys), Vol. 2431. CEUR-WS, 26–30
work page 2019
-
[4]
Joeran Beel and Haley Dixon. 2021. The ‘Unreasonable’ Effectiveness of Graphical User Interfaces for Recommender Systems. InAdjunct Proceedings of the 29th ACM Conference on User Modeling, Adaptation and Personalization(Utrecht, Netherlands)(UMAP ’21). Association for Computing Machinery, New York, NY, USA, 22–28. doi:10.1145/3450614.3461682
-
[5]
Cagatay Catal and Banu Diri. 2009. Investigating the effect of dataset size, metrics sets, and feature selection techniques on software fault prediction problem. Information Sciences179, 8 (2009), 1040–1058
work page 2009
- [6]
-
[7]
Michael D. Ekstrand. 2020. LensKit for Python: Next-Generation Software for Recommender Systems Experiments. InProceedings of the 29th ACM International Conference on Information and Knowledge Management. doi:10.1145/3340531. 3412778
-
[8]
Stuart I. Feldman. 1979. Make — a program for maintaining com- puter programs.Software: Practice and Experience9, 4 (1979), 255–
work page 1979
-
[9]
arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/spe.4380090402 doi:10. 1002/spe.4380090402
-
[10]
Alon Halevy, Peter Norvig, and Fernando Pereira. 2009. The unreasonable effectiveness of data.IEEE intelligent systems24, 2 (2009), 8–12
work page 2009
-
[11]
https://doi.org/10.1371/journal
Gregory M. Kurtzer, Vanessa Sochat, and Michael W. Bauer. 2017. Singularity: Scientific containers for mobility of compute.PLOS ONE12, 5 (05 2017), 1–20. doi:10.1371/journal.pone.0177459
-
[12]
Philipp Meister, Lukas Wegmeth, Tobias Vente, and Joeran Beel. 2024. Remov- ing Bad Influence: Identifying and Pruning Detrimental Users in Collaborative Filtering Recommender Systems.. InRobustRecSys@ RecSys. 8–11
work page 2024
-
[13]
Shaina Raza, Mizanur Rahman, Safiullah Kamawal, Armin Toroghi, Ananya Raval, Farshad Navah, and Amirmohammad Kazemeini. 2026. A comprehensive review of recommender systems: Transitioning from theory to practice.Computer Science Review59 (2026), 100849. doi:10.1016/j.cosrev.2025.100849
-
[14]
RUCAIBox. 2024. Dataset List | RecBole. https://recbole.io/dataset_list.html. Accessed: 2026-03-08
work page 2024
-
[15]
RUCAIBox. 2024. RecSysDatasets: Public Data Sources for Recommender Sys- tems. https://github.com/RUCAIBox/RecSysDatasets. GitHub repository, ac- cessed: 2026-03-08
work page 2024
-
[16]
Youssef Tarek Tewfik. 2026. unreasonable-effectiveness-recsys. https://github. com/Youssef-Tarek-Tewfik/unreasonable-effectiveness-recsys. GitHub reposi- tory, accessed: 2026-03-19
work page 2026
-
[17]
Andrius Vabalas, Emma Gowen, Ellen Poliakoff, and Alexander J Casson. 2019. Machine learning algorithm validation with a limited sample size.PloS one14, 11 (2019), e0224365
work page 2019
-
[18]
Tobias Vente, Lukas Wegmeth, Alan Said, and Joeran Beel. 2024. From Clicks to Carbon: The Environmental Toll of Recommender Systems. InProceedings of the 18th ACM Conference on Recommender Systems(Bari, Italy)(RecSys ’24). Association for Computing Machinery, New York, NY, USA, 580–590. doi:10. 1145/3640457.3688074
-
[19]
Andy B. Yoo, Morris A. Jette, and Mark Grondona. 2003. SLURM: Simple Linux Utility for Resource Management. InJob Scheduling Strategies for Parallel Pro- cessing, Dror Feitelson, Larry Rudolph, and Uwe Schwiegelshohn (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 44–60
work page 2003
-
[20]
Wayne Xin Zhao, Shanlei Mu, Yupeng Hou, Zihan Lin, Yushuo Chen, Xingyu Pan, Kaiyuan Li, Yujie Lu, Hui Wang, Changxin Tian, Yingqian Min, Zhichao Feng, Xinyan Fan, Xu Chen, Pengfei Wang, Wendi Ji, Yaliang Li, Xiaoling Wang, and Ji-Rong Wen. 2021. RecBole: Towards a Unified, Comprehensive and Efficient Framework for Recommendation Algorithms. InCIKM. ACM, 4653–4664
work page 2021
- [21]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.