pith. sign in

arxiv: 2604.06420 · v2 · submitted 2026-04-07 · 💻 cs.IR · cs.LG

The Unreasonable Effectiveness of Data for Recommender Systems

Pith reviewed 2026-05-10 18:07 UTC · model grok-4.3

classification 💻 cs.IR cs.LG
keywords recommender systemsdata scalingNDCG@10training data sizesaturation pointuser samplingoffline evaluationLensKit RecBole
0
0 comments X

The pith

More training data continues to improve recommender performance with no saturation in typical cases

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether recommender systems reach a point where adding more user-item interaction data stops yielding gains. It creates training samples ranging from 100,000 to 100 million interactions via absolute stratified user sampling on 11 large public datasets and runs 10 algorithm-tool combinations to measure NDCG at 10. Raw scores rise with sample size in most groups, with no plateau visible even at the largest scales tested. Normalization within each group shows roughly 75 percent of the biggest samples achieving the best result, while slope analysis of the final portion of each curve stays non-negative. The pattern holds for standard setups, with weaker trends limited to atypical datasets or one algorithmic outlier.

Core claim

The paper establishes that NDCG@10 scores increase with larger training set sizes across nine sample points from 100k to 100M interactions, with no observable saturation point reached. After min-max normalization within each algorithm-dataset group, around 75 percent of the largest completed samples also record the group's highest performance. Late-stage slope analysis over the final 10-30 percent of each group shows the interquartile range fully non-negative with a median near 1.0, confirming an ongoing upward trend for traditional recommender systems on typical data.

What carries the argument

Absolute stratified user sampling to build training subsets of increasing size while holding test sets fixed, followed by NDCG@10 evaluation across multiple algorithms and toolkits

If this is right

  • Data volume remains a primary driver of better offline accuracy for most traditional recommender algorithms
  • Saturation points appear rare on typical user-item interaction datasets
  • Weaker scaling occurs mainly in atypical datasets and specific algorithmic cases such as RecBole BPR
  • Continued investment in larger training sets is likely to produce gains rather than diminishing returns

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This finding implies that data acquisition and storage costs may stay justified for longer in production recommender pipelines
  • It raises the question of whether the same scaling pattern appears when measuring online user metrics instead of offline NDCG
  • The result suggests testing whether controlling for catalog size or user activity distribution changes the observed trends
  • Future experiments could check if very large industrial datasets exhibit the same lack of saturation

Load-bearing premise

Absolute stratified user sampling creates dataset sizes whose performance trends accurately reflect what would be seen when scaling the full original dataset without introducing selection bias

What would settle it

Training the same models on the complete original datasets and observing either a plateau or drop in NDCG@10 relative to the largest sampled subset

Figures

Figures reproduced from arXiv: 2604.06420 by Youssef Abdou.

Figure 2
Figure 2. Figure 2: 𝑁𝐷𝐶𝐺@10 vs. Sample Size sample size, with no visible point at which 𝑁𝐷𝐶𝐺 begins to dimin￾ish. Datasets that consistently show both a steep upward trend and a relatively high 𝑁𝐷𝐶𝐺 include MovieLens and Netflix, reach￾ing approximately 0.25 at the full 32𝑚 and around 0.22 at 100𝑚, respectively (both from RecBole’s Item KNN results). In contrast, Last.fm shows no convincing performance improvement with in￾cre… view at source ↗
Figure 4
Figure 4. Figure 4: Normalized 𝑁𝐷𝐶𝐺@10 vs. Sample Size Scatter [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Figure 4 distributions metadata [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Normalized Values Late-Stage Slope Distribution [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
read the original abstract

In recommender systems, collecting, storing, and processing large-scale interaction data is increasingly costly in terms of time, energy, and computation, yet it remains unclear when additional data stops providing meaningful gains. This paper investigates how offline recommendation performance evolves as the size of the training dataset increases and whether a saturation point can be observed. We implemented a reproducible Python evaluation workflow with two established toolkits, LensKit and RecBole, included 11 large public datasets with at least 7 million interactions, and evaluated 10 tool-algorithm combinations. Using absolute stratified user sampling, we trained models on nine sample sizes from 100,000 to 100,000,000 interactions and measured NDCG@10. Overall, raw NDCG usually increased with sample size, with no observable saturation point. To make result groups comparable, we applied min-max normalization within each group, revealing a clear positive trend in which around 75% of the points at the largest completed sample size also achieved the group's best observed performance. A late-stage slope analysis over the final 10-30% of each group further supported this upward trend: the interquartile range remained entirely non-negative with a median near 1.0. In summary, for traditional recommender systems on typical user-item interaction data, incorporating more training data remains primarily beneficial, while weaker scaling behavior is concentrated in atypical dataset cases and in the algorithmic outlier RecBole BPR under our setup.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that for traditional recommender systems on typical user-item interaction data, increasing training dataset size from 100k to 100M interactions generally improves NDCG@10 without observable saturation. Experiments use absolute stratified user sampling on 11 large public datasets, 10 algorithm-tool combinations from LensKit and RecBole, min-max normalization within groups, and late-stage slope analysis (final 10-30% of curves) showing non-negative interquartile ranges with median near 1.0. Weaker scaling is isolated to atypical datasets and the RecBole BPR outlier.

Significance. If the central empirical trends hold, the work provides useful large-scale evidence that data scaling remains effective for standard recommenders, supporting continued investment in data collection rather than assuming early diminishing returns. Strengths include the reproducible Python workflow, use of established public datasets and toolkits with described sampling/normalization, and direct measurement of performance curves across nine sizes.

major comments (2)
  1. [Methods] Methods section (sampling procedure): Absolute stratified user sampling to exact interaction counts (100k–100M) is presented as a proxy for natural scaling, but it can systematically alter sparsity patterns, per-user interaction distributions, item popularity skew, and new-vs-repeat interaction ratios relative to organic data growth. This directly affects whether the observed NDCG@10 trends and late-stage slopes generalize to the claim about 'typical user-item interaction data.' Validation against alternative sampling (e.g., random or temporal) or bias diagnostics is needed.
  2. [Results] Results section (normalization and slope analysis): Min-max normalization within each group followed by slope computation over the final 10–30% treats the constructed subsets as comparable; any sampling-induced bias propagates into the '75% achieve best performance' statistic and the entirely non-negative IQR claim. Sensitivity of the slope median (~1.0) to the exact percentage window or to unnormalized raw scores should be reported.
minor comments (2)
  1. Table or appendix listing all 11 datasets with their original sizes, domains, and sampling details would improve reproducibility.
  2. Clarify whether the nine sample sizes are strictly nested (i.e., each larger sample contains the smaller ones) or independently drawn; this affects interpretation of the curves.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback on our manuscript. We have carefully considered the major comments regarding the sampling procedure and the normalization/slope analysis. Below, we provide point-by-point responses. We agree that additional justification and sensitivity checks would strengthen the paper and plan to incorporate them in the revision.

read point-by-point responses
  1. Referee: [Methods] Methods section (sampling procedure): Absolute stratified user sampling to exact interaction counts (100k–100M) is presented as a proxy for natural scaling, but it can systematically alter sparsity patterns, per-user interaction distributions, item popularity skew, and new-vs-repeat interaction ratios relative to organic data growth. This directly affects whether the observed NDCG@10 trends and late-stage slopes generalize to the claim about 'typical user-item interaction data.' Validation against alternative sampling (e.g., random or temporal) or bias diagnostics is needed.

    Authors: We thank the referee for highlighting this important methodological consideration. Our choice of absolute stratified user sampling was motivated by the need to create subsets with precisely controlled interaction counts while preserving the relative user activity levels from the original datasets. This approach allows for a clean isolation of the effect of data volume. We acknowledge that it does not perfectly mimic organic growth, which could involve different dynamics in user and item distributions. However, we believe it provides a reasonable proxy for studying scaling in typical user-item data, as the datasets themselves are real-world collections. In the revised version, we will expand the Methods section to include a more detailed justification of the sampling choice, along with basic bias diagnostics (e.g., changes in sparsity and popularity skew across sizes). Full validation with alternative samplings like temporal splits would require substantial additional experiments and is noted as a limitation for future work. We argue that the consistent trends across 11 diverse datasets support the generalizability of our findings despite the sampling method. revision: partial

  2. Referee: [Results] Results section (normalization and slope analysis): Min-max normalization within each group followed by slope computation over the final 10–30% treats the constructed subsets as comparable; any sampling-induced bias propagates into the '75% achieve best performance' statistic and the entirely non-negative IQR claim. Sensitivity of the slope median (~1.0) to the exact percentage window or to unnormalized raw scores should be reported.

    Authors: We appreciate this point on the robustness of our analysis. The min-max normalization was applied within each algorithm-dataset group to enable comparison of relative trends across different performance scales. The late-stage slope analysis was intended to focus on the behavior at larger data sizes. To address the concern, we will include in the revision a sensitivity analysis showing how the slope statistics (median and IQR) vary with different window sizes (e.g., final 10%, 20%, 30%) and also report trends on unnormalized scores where possible. We note that the primary claim relies on the raw observations of increasing NDCG in most cases, with normalization used only for aggregation. This additional reporting will confirm that the positive trend is not an artifact of the specific analysis choices. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical scaling measurements

full rationale

The paper reports direct experimental results from training 10 algorithm-tool combinations on nine absolute-stratified subsamples (100k to 100M interactions) drawn from 11 public datasets, then measuring raw NDCG@10, applying within-group min-max normalization, and computing late-stage slopes. No equations, first-principles derivations, fitted parameters, or predictions are claimed; the central claim that more data remains beneficial is presented as an observation from these runs. The sampling procedure and normalization are methodological choices whose validity is external to the reported trends, with no reduction of any result to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The study uses standard evaluation practices without introducing new free parameters, axioms beyond domain norms, or invented entities.

axioms (2)
  • domain assumption NDCG@10 is a suitable proxy for recommendation quality
    Standard metric in the field but assumes ranking correlates with user utility.
  • domain assumption Absolute stratified user sampling produces representative subsets for scaling analysis
    Invoked to generate the nine dataset sizes from full data.

pith-pipeline@v0.9.0 · 5553 in / 1227 out tokens · 52759 ms · 2026-05-10T18:07:56.886369+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages

  1. [1]

    Ajiboye Abdulraheem, Ruzaini Abdullah Arshah, and Hongwu Qin. 2015. Evalu- ating the Effect of Dataset Size on Predictive Model Using Supervised Learning Technique.International Journal of Software Engineering & Computer Sciences (IJSECS)1 (02 2015), 75–84. doi:10.15282/ijsecs.1.2015.6.0006

  2. [2]

    Apache Parquet contributors. [n. d.]. Apache Parquet. https://parquet.apache. org/. Accessed: 2026-03-17

  3. [3]

    Joeran Beel and Victor Brunel. 2019. Data pruning in recommender systems research: Best-practice or malpractice. In13th ACM Conference on Recommender Systems (RecSys), Vol. 2431. CEUR-WS, 26–30

  4. [4]

    Joeran Beel and Haley Dixon. 2021. The ‘Unreasonable’ Effectiveness of Graphical User Interfaces for Recommender Systems. InAdjunct Proceedings of the 29th ACM Conference on User Modeling, Adaptation and Personalization(Utrecht, Netherlands)(UMAP ’21). Association for Computing Machinery, New York, NY, USA, 22–28. doi:10.1145/3450614.3461682

  5. [5]

    Cagatay Catal and Banu Diri. 2009. Investigating the effect of dataset size, metrics sets, and feature selection techniques on software fault prediction problem. Information Sciences179, 8 (2009), 1040–1058

  6. [6]

    Qifei Dong and Gang Luo. 2020. Progress Indication for Deep Learning Model Training: A Feasibility Demonstration.IEEE Access8 (2020), 79811–79843. doi:10. 1109/ACCESS.2020.2989684

  7. [7]

    Ekstrand

    Michael D. Ekstrand. 2020. LensKit for Python: Next-Generation Software for Recommender Systems Experiments. InProceedings of the 29th ACM International Conference on Information and Knowledge Management. doi:10.1145/3340531. 3412778

  8. [8]

    Stuart I. Feldman. 1979. Make — a program for maintaining com- puter programs.Software: Practice and Experience9, 4 (1979), 255–

  9. [9]

    1002/spe.4380090402

    arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/spe.4380090402 doi:10. 1002/spe.4380090402

  10. [10]

    Alon Halevy, Peter Norvig, and Fernando Pereira. 2009. The unreasonable effectiveness of data.IEEE intelligent systems24, 2 (2009), 8–12

  11. [11]

    https://doi.org/10.1371/journal

    Gregory M. Kurtzer, Vanessa Sochat, and Michael W. Bauer. 2017. Singularity: Scientific containers for mobility of compute.PLOS ONE12, 5 (05 2017), 1–20. doi:10.1371/journal.pone.0177459

  12. [12]

    Philipp Meister, Lukas Wegmeth, Tobias Vente, and Joeran Beel. 2024. Remov- ing Bad Influence: Identifying and Pruning Detrimental Users in Collaborative Filtering Recommender Systems.. InRobustRecSys@ RecSys. 8–11

  13. [13]

    Shaina Raza, Mizanur Rahman, Safiullah Kamawal, Armin Toroghi, Ananya Raval, Farshad Navah, and Amirmohammad Kazemeini. 2026. A comprehensive review of recommender systems: Transitioning from theory to practice.Computer Science Review59 (2026), 100849. doi:10.1016/j.cosrev.2025.100849

  14. [14]

    RUCAIBox. 2024. Dataset List | RecBole. https://recbole.io/dataset_list.html. Accessed: 2026-03-08

  15. [15]

    RUCAIBox. 2024. RecSysDatasets: Public Data Sources for Recommender Sys- tems. https://github.com/RUCAIBox/RecSysDatasets. GitHub repository, ac- cessed: 2026-03-08

  16. [16]

    Youssef Tarek Tewfik. 2026. unreasonable-effectiveness-recsys. https://github. com/Youssef-Tarek-Tewfik/unreasonable-effectiveness-recsys. GitHub reposi- tory, accessed: 2026-03-19

  17. [17]

    Andrius Vabalas, Emma Gowen, Ellen Poliakoff, and Alexander J Casson. 2019. Machine learning algorithm validation with a limited sample size.PloS one14, 11 (2019), e0224365

  18. [18]

    Tobias Vente, Lukas Wegmeth, Alan Said, and Joeran Beel. 2024. From Clicks to Carbon: The Environmental Toll of Recommender Systems. InProceedings of the 18th ACM Conference on Recommender Systems(Bari, Italy)(RecSys ’24). Association for Computing Machinery, New York, NY, USA, 580–590. doi:10. 1145/3640457.3688074

  19. [19]

    Yoo, Morris A

    Andy B. Yoo, Morris A. Jette, and Mark Grondona. 2003. SLURM: Simple Linux Utility for Resource Management. InJob Scheduling Strategies for Parallel Pro- cessing, Dror Feitelson, Larry Rudolph, and Uwe Schwiegelshohn (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 44–60

  20. [20]

    Wayne Xin Zhao, Shanlei Mu, Yupeng Hou, Zihan Lin, Yushuo Chen, Xingyu Pan, Kaiyuan Li, Yujie Lu, Hui Wang, Changxin Tian, Yingqian Min, Zhichao Feng, Xinyan Fan, Xu Chen, Pengfei Wang, Wendi Ji, Yaliang Li, Xiaoling Wang, and Ji-Rong Wen. 2021. RecBole: Towards a Unified, Comprehensive and Efficient Framework for Recommendation Algorithms. InCIKM. ACM, 4653–4664

  21. [21]

    Kuan Zou and Aixin Sun. 2025. A Survey of Real-World Recommender Systems: Challenges, Constraints, and Industrial Perspectives. arXiv:2509.06002 [cs.IR] https://arxiv.org/abs/2509.06002