The Unreasonable Effectiveness of Data for Recommender Systems

Youssef Abdou

arxiv: 2604.06420 · v2 · submitted 2026-04-07 · 💻 cs.IR · cs.LG

The Unreasonable Effectiveness of Data for Recommender Systems

Youssef Abdou This is my paper

Pith reviewed 2026-05-10 18:07 UTC · model grok-4.3

classification 💻 cs.IR cs.LG

keywords recommender systemsdata scalingNDCG@10training data sizesaturation pointuser samplingoffline evaluationLensKit RecBole

0 comments

The pith

More training data continues to improve recommender performance with no saturation in typical cases

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether recommender systems reach a point where adding more user-item interaction data stops yielding gains. It creates training samples ranging from 100,000 to 100 million interactions via absolute stratified user sampling on 11 large public datasets and runs 10 algorithm-tool combinations to measure NDCG at 10. Raw scores rise with sample size in most groups, with no plateau visible even at the largest scales tested. Normalization within each group shows roughly 75 percent of the biggest samples achieving the best result, while slope analysis of the final portion of each curve stays non-negative. The pattern holds for standard setups, with weaker trends limited to atypical datasets or one algorithmic outlier.

Core claim

The paper establishes that NDCG@10 scores increase with larger training set sizes across nine sample points from 100k to 100M interactions, with no observable saturation point reached. After min-max normalization within each algorithm-dataset group, around 75 percent of the largest completed samples also record the group's highest performance. Late-stage slope analysis over the final 10-30 percent of each group shows the interquartile range fully non-negative with a median near 1.0, confirming an ongoing upward trend for traditional recommender systems on typical data.

What carries the argument

Absolute stratified user sampling to build training subsets of increasing size while holding test sets fixed, followed by NDCG@10 evaluation across multiple algorithms and toolkits

If this is right

Data volume remains a primary driver of better offline accuracy for most traditional recommender algorithms
Saturation points appear rare on typical user-item interaction datasets
Weaker scaling occurs mainly in atypical datasets and specific algorithmic cases such as RecBole BPR
Continued investment in larger training sets is likely to produce gains rather than diminishing returns

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This finding implies that data acquisition and storage costs may stay justified for longer in production recommender pipelines
It raises the question of whether the same scaling pattern appears when measuring online user metrics instead of offline NDCG
The result suggests testing whether controlling for catalog size or user activity distribution changes the observed trends
Future experiments could check if very large industrial datasets exhibit the same lack of saturation

Load-bearing premise

Absolute stratified user sampling creates dataset sizes whose performance trends accurately reflect what would be seen when scaling the full original dataset without introducing selection bias

What would settle it

Training the same models on the complete original datasets and observing either a plateau or drop in NDCG@10 relative to the largest sampled subset

Figures

Figures reproduced from arXiv: 2604.06420 by Youssef Abdou.

**Figure 2.** Figure 2: 𝑁𝐷𝐶𝐺@10 vs. Sample Size sample size, with no visible point at which 𝑁𝐷𝐶𝐺 begins to diminish. Datasets that consistently show both a steep upward trend and a relatively high 𝑁𝐷𝐶𝐺 include MovieLens and Netflix, reaching approximately 0.25 at the full 32𝑚 and around 0.22 at 100𝑚, respectively (both from RecBole’s Item KNN results). In contrast, Last.fm shows no convincing performance improvement with incre… view at source ↗

**Figure 4.** Figure 4: Normalized 𝑁𝐷𝐶𝐺@10 vs. Sample Size Scatter [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Figure 4 distributions metadata [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

**Figure 6.** Figure 6: Normalized Values Late-Stage Slope Distribution [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

read the original abstract

In recommender systems, collecting, storing, and processing large-scale interaction data is increasingly costly in terms of time, energy, and computation, yet it remains unclear when additional data stops providing meaningful gains. This paper investigates how offline recommendation performance evolves as the size of the training dataset increases and whether a saturation point can be observed. We implemented a reproducible Python evaluation workflow with two established toolkits, LensKit and RecBole, included 11 large public datasets with at least 7 million interactions, and evaluated 10 tool-algorithm combinations. Using absolute stratified user sampling, we trained models on nine sample sizes from 100,000 to 100,000,000 interactions and measured NDCG@10. Overall, raw NDCG usually increased with sample size, with no observable saturation point. To make result groups comparable, we applied min-max normalization within each group, revealing a clear positive trend in which around 75% of the points at the largest completed sample size also achieved the group's best observed performance. A late-stage slope analysis over the final 10-30% of each group further supported this upward trend: the interquartile range remained entirely non-negative with a median near 1.0. In summary, for traditional recommender systems on typical user-item interaction data, incorporating more training data remains primarily beneficial, while weaker scaling behavior is concentrated in atypical dataset cases and in the algorithmic outlier RecBole BPR under our setup.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper shows empirically that NDCG@10 keeps rising with more training data for most recsys setups on public datasets, without saturation, but the stratified sampling step is the part that needs closer checking.

read the letter

This paper finds that adding more training interactions usually improves NDCG@10 for standard recommender algorithms on these 11 public datasets, and they see no saturation even at the largest sizes tested. They ran the same experiment with LensKit and RecBole across ten algorithm combinations and nine dataset sizes from 100k to 100M interactions, then used min-max normalization within groups plus a late-stage slope check to make the trends comparable. Around 75 percent of the biggest samples hit the best performance in their group, and the slope interquartile range stayed non-negative. That is the main new observation: a larger-scale confirmation that data scaling stays beneficial for typical cases, with weaker results isolated to a few atypical datasets and the RecBole BPR outlier. The work is straightforward to reproduce in principle because it sticks to established toolkits and public data with clear sampling and normalization steps described. It extends earlier smaller scaling studies by covering more datasets and algorithm-toolkit pairs at once. The soft spot is the absolute stratified user sampling used to create the size variants. Selecting users to hit exact interaction counts can still change per-user activity levels, item skew, or sparsity patterns relative to natural growth, which weakens how directly the trends apply to real data collection decisions. The normalization and slope analysis are reasonable but inherit any bias from the constructed subsets. Everything stays within offline metrics, so it does not address online costs or production A/B outcomes. This is useful for practitioners and researchers who need practical guidance on whether to keep investing in larger interaction logs rather than new modeling tricks. It shows honest engagement with the scaling question and uses reproducible methods, so it deserves a serious referee even though the sampling justification will probably need tightening in revision.

Referee Report

2 major / 2 minor

Summary. The paper claims that for traditional recommender systems on typical user-item interaction data, increasing training dataset size from 100k to 100M interactions generally improves NDCG@10 without observable saturation. Experiments use absolute stratified user sampling on 11 large public datasets, 10 algorithm-tool combinations from LensKit and RecBole, min-max normalization within groups, and late-stage slope analysis (final 10-30% of curves) showing non-negative interquartile ranges with median near 1.0. Weaker scaling is isolated to atypical datasets and the RecBole BPR outlier.

Significance. If the central empirical trends hold, the work provides useful large-scale evidence that data scaling remains effective for standard recommenders, supporting continued investment in data collection rather than assuming early diminishing returns. Strengths include the reproducible Python workflow, use of established public datasets and toolkits with described sampling/normalization, and direct measurement of performance curves across nine sizes.

major comments (2)

[Methods] Methods section (sampling procedure): Absolute stratified user sampling to exact interaction counts (100k–100M) is presented as a proxy for natural scaling, but it can systematically alter sparsity patterns, per-user interaction distributions, item popularity skew, and new-vs-repeat interaction ratios relative to organic data growth. This directly affects whether the observed NDCG@10 trends and late-stage slopes generalize to the claim about 'typical user-item interaction data.' Validation against alternative sampling (e.g., random or temporal) or bias diagnostics is needed.
[Results] Results section (normalization and slope analysis): Min-max normalization within each group followed by slope computation over the final 10–30% treats the constructed subsets as comparable; any sampling-induced bias propagates into the '75% achieve best performance' statistic and the entirely non-negative IQR claim. Sensitivity of the slope median (~1.0) to the exact percentage window or to unnormalized raw scores should be reported.

minor comments (2)

Table or appendix listing all 11 datasets with their original sizes, domains, and sampling details would improve reproducibility.
Clarify whether the nine sample sizes are strictly nested (i.e., each larger sample contains the smaller ones) or independently drawn; this affects interpretation of the curves.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback on our manuscript. We have carefully considered the major comments regarding the sampling procedure and the normalization/slope analysis. Below, we provide point-by-point responses. We agree that additional justification and sensitivity checks would strengthen the paper and plan to incorporate them in the revision.

read point-by-point responses

Referee: [Methods] Methods section (sampling procedure): Absolute stratified user sampling to exact interaction counts (100k–100M) is presented as a proxy for natural scaling, but it can systematically alter sparsity patterns, per-user interaction distributions, item popularity skew, and new-vs-repeat interaction ratios relative to organic data growth. This directly affects whether the observed NDCG@10 trends and late-stage slopes generalize to the claim about 'typical user-item interaction data.' Validation against alternative sampling (e.g., random or temporal) or bias diagnostics is needed.

Authors: We thank the referee for highlighting this important methodological consideration. Our choice of absolute stratified user sampling was motivated by the need to create subsets with precisely controlled interaction counts while preserving the relative user activity levels from the original datasets. This approach allows for a clean isolation of the effect of data volume. We acknowledge that it does not perfectly mimic organic growth, which could involve different dynamics in user and item distributions. However, we believe it provides a reasonable proxy for studying scaling in typical user-item data, as the datasets themselves are real-world collections. In the revised version, we will expand the Methods section to include a more detailed justification of the sampling choice, along with basic bias diagnostics (e.g., changes in sparsity and popularity skew across sizes). Full validation with alternative samplings like temporal splits would require substantial additional experiments and is noted as a limitation for future work. We argue that the consistent trends across 11 diverse datasets support the generalizability of our findings despite the sampling method. revision: partial
Referee: [Results] Results section (normalization and slope analysis): Min-max normalization within each group followed by slope computation over the final 10–30% treats the constructed subsets as comparable; any sampling-induced bias propagates into the '75% achieve best performance' statistic and the entirely non-negative IQR claim. Sensitivity of the slope median (~1.0) to the exact percentage window or to unnormalized raw scores should be reported.

Authors: We appreciate this point on the robustness of our analysis. The min-max normalization was applied within each algorithm-dataset group to enable comparison of relative trends across different performance scales. The late-stage slope analysis was intended to focus on the behavior at larger data sizes. To address the concern, we will include in the revision a sensitivity analysis showing how the slope statistics (median and IQR) vary with different window sizes (e.g., final 10%, 20%, 30%) and also report trends on unnormalized scores where possible. We note that the primary claim relies on the raw observations of increasing NDCG in most cases, with normalization used only for aggregation. This additional reporting will confirm that the positive trend is not an artifact of the specific analysis choices. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical scaling measurements

full rationale

The paper reports direct experimental results from training 10 algorithm-tool combinations on nine absolute-stratified subsamples (100k to 100M interactions) drawn from 11 public datasets, then measuring raw NDCG@10, applying within-group min-max normalization, and computing late-stage slopes. No equations, first-principles derivations, fitted parameters, or predictions are claimed; the central claim that more data remains beneficial is presented as an observation from these runs. The sampling procedure and normalization are methodological choices whose validity is external to the reported trends, with no reduction of any result to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The study uses standard evaluation practices without introducing new free parameters, axioms beyond domain norms, or invented entities.

axioms (2)

domain assumption NDCG@10 is a suitable proxy for recommendation quality
Standard metric in the field but assumes ranking correlates with user utility.
domain assumption Absolute stratified user sampling produces representative subsets for scaling analysis
Invoked to generate the nine dataset sizes from full data.

pith-pipeline@v0.9.0 · 5553 in / 1227 out tokens · 52759 ms · 2026-05-10T18:07:56.886369+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Using absolute stratified user sampling, we trained models on nine sample sizes from 100,000 to 100,000,000 interactions and measured NDCG@10. ... late-stage slope analysis over the final 10-30% of each group further supported this upward trend
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Overall, raw NDCG usually increased with sample size, with no observable saturation point.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages

[1]

Ajiboye Abdulraheem, Ruzaini Abdullah Arshah, and Hongwu Qin. 2015. Evalu- ating the Effect of Dataset Size on Predictive Model Using Supervised Learning Technique.International Journal of Software Engineering & Computer Sciences (IJSECS)1 (02 2015), 75–84. doi:10.15282/ijsecs.1.2015.6.0006

work page doi:10.15282/ijsecs.1.2015.6.0006 2015
[2]

Apache Parquet contributors. [n. d.]. Apache Parquet. https://parquet.apache. org/. Accessed: 2026-03-17

work page 2026
[3]

Joeran Beel and Victor Brunel. 2019. Data pruning in recommender systems research: Best-practice or malpractice. In13th ACM Conference on Recommender Systems (RecSys), Vol. 2431. CEUR-WS, 26–30

work page 2019
[4]

Joeran Beel and Haley Dixon. 2021. The ‘Unreasonable’ Effectiveness of Graphical User Interfaces for Recommender Systems. InAdjunct Proceedings of the 29th ACM Conference on User Modeling, Adaptation and Personalization(Utrecht, Netherlands)(UMAP ’21). Association for Computing Machinery, New York, NY, USA, 22–28. doi:10.1145/3450614.3461682

work page doi:10.1145/3450614.3461682 2021
[5]

Cagatay Catal and Banu Diri. 2009. Investigating the effect of dataset size, metrics sets, and feature selection techniques on software fault prediction problem. Information Sciences179, 8 (2009), 1040–1058

work page 2009
[6]

Qifei Dong and Gang Luo. 2020. Progress Indication for Deep Learning Model Training: A Feasibility Demonstration.IEEE Access8 (2020), 79811–79843. doi:10. 1109/ACCESS.2020.2989684

work page arXiv 2020
[7]

Ekstrand

Michael D. Ekstrand. 2020. LensKit for Python: Next-Generation Software for Recommender Systems Experiments. InProceedings of the 29th ACM International Conference on Information and Knowledge Management. doi:10.1145/3340531. 3412778

work page doi:10.1145/3340531 2020
[8]

Stuart I. Feldman. 1979. Make — a program for maintaining com- puter programs.Software: Practice and Experience9, 4 (1979), 255–

work page 1979
[9]

1002/spe.4380090402

arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/spe.4380090402 doi:10. 1002/spe.4380090402

work page doi:10.1002/spe.4380090402
[10]

Alon Halevy, Peter Norvig, and Fernando Pereira. 2009. The unreasonable effectiveness of data.IEEE intelligent systems24, 2 (2009), 8–12

work page 2009
[11]

https://doi.org/10.1371/journal

Gregory M. Kurtzer, Vanessa Sochat, and Michael W. Bauer. 2017. Singularity: Scientific containers for mobility of compute.PLOS ONE12, 5 (05 2017), 1–20. doi:10.1371/journal.pone.0177459

work page doi:10.1371/journal.pone.0177459 2017
[12]

Philipp Meister, Lukas Wegmeth, Tobias Vente, and Joeran Beel. 2024. Remov- ing Bad Influence: Identifying and Pruning Detrimental Users in Collaborative Filtering Recommender Systems.. InRobustRecSys@ RecSys. 8–11

work page 2024
[13]

Shaina Raza, Mizanur Rahman, Safiullah Kamawal, Armin Toroghi, Ananya Raval, Farshad Navah, and Amirmohammad Kazemeini. 2026. A comprehensive review of recommender systems: Transitioning from theory to practice.Computer Science Review59 (2026), 100849. doi:10.1016/j.cosrev.2025.100849

work page doi:10.1016/j.cosrev.2025.100849 2026
[14]

RUCAIBox. 2024. Dataset List | RecBole. https://recbole.io/dataset_list.html. Accessed: 2026-03-08

work page 2024
[15]

RUCAIBox. 2024. RecSysDatasets: Public Data Sources for Recommender Sys- tems. https://github.com/RUCAIBox/RecSysDatasets. GitHub repository, ac- cessed: 2026-03-08

work page 2024
[16]

Youssef Tarek Tewfik. 2026. unreasonable-effectiveness-recsys. https://github. com/Youssef-Tarek-Tewfik/unreasonable-effectiveness-recsys. GitHub reposi- tory, accessed: 2026-03-19

work page 2026
[17]

Andrius Vabalas, Emma Gowen, Ellen Poliakoff, and Alexander J Casson. 2019. Machine learning algorithm validation with a limited sample size.PloS one14, 11 (2019), e0224365

work page 2019
[18]

Tobias Vente, Lukas Wegmeth, Alan Said, and Joeran Beel. 2024. From Clicks to Carbon: The Environmental Toll of Recommender Systems. InProceedings of the 18th ACM Conference on Recommender Systems(Bari, Italy)(RecSys ’24). Association for Computing Machinery, New York, NY, USA, 580–590. doi:10. 1145/3640457.3688074

work page arXiv 2024
[19]

Yoo, Morris A

Andy B. Yoo, Morris A. Jette, and Mark Grondona. 2003. SLURM: Simple Linux Utility for Resource Management. InJob Scheduling Strategies for Parallel Pro- cessing, Dror Feitelson, Larry Rudolph, and Uwe Schwiegelshohn (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 44–60

work page 2003
[20]

Wayne Xin Zhao, Shanlei Mu, Yupeng Hou, Zihan Lin, Yushuo Chen, Xingyu Pan, Kaiyuan Li, Yujie Lu, Hui Wang, Changxin Tian, Yingqian Min, Zhichao Feng, Xinyan Fan, Xu Chen, Pengfei Wang, Wendi Ji, Yaliang Li, Xiaoling Wang, and Ji-Rong Wen. 2021. RecBole: Towards a Unified, Comprehensive and Efficient Framework for Recommendation Algorithms. InCIKM. ACM, 4653–4664

work page 2021
[21]

Kuan Zou and Aixin Sun. 2025. A Survey of Real-World Recommender Systems: Challenges, Constraints, and Industrial Perspectives. arXiv:2509.06002 [cs.IR] https://arxiv.org/abs/2509.06002

work page arXiv 2025

[1] [1]

Ajiboye Abdulraheem, Ruzaini Abdullah Arshah, and Hongwu Qin. 2015. Evalu- ating the Effect of Dataset Size on Predictive Model Using Supervised Learning Technique.International Journal of Software Engineering & Computer Sciences (IJSECS)1 (02 2015), 75–84. doi:10.15282/ijsecs.1.2015.6.0006

work page doi:10.15282/ijsecs.1.2015.6.0006 2015

[2] [2]

Apache Parquet contributors. [n. d.]. Apache Parquet. https://parquet.apache. org/. Accessed: 2026-03-17

work page 2026

[3] [3]

Joeran Beel and Victor Brunel. 2019. Data pruning in recommender systems research: Best-practice or malpractice. In13th ACM Conference on Recommender Systems (RecSys), Vol. 2431. CEUR-WS, 26–30

work page 2019

[4] [4]

Joeran Beel and Haley Dixon. 2021. The ‘Unreasonable’ Effectiveness of Graphical User Interfaces for Recommender Systems. InAdjunct Proceedings of the 29th ACM Conference on User Modeling, Adaptation and Personalization(Utrecht, Netherlands)(UMAP ’21). Association for Computing Machinery, New York, NY, USA, 22–28. doi:10.1145/3450614.3461682

work page doi:10.1145/3450614.3461682 2021

[5] [5]

Cagatay Catal and Banu Diri. 2009. Investigating the effect of dataset size, metrics sets, and feature selection techniques on software fault prediction problem. Information Sciences179, 8 (2009), 1040–1058

work page 2009

[6] [6]

Qifei Dong and Gang Luo. 2020. Progress Indication for Deep Learning Model Training: A Feasibility Demonstration.IEEE Access8 (2020), 79811–79843. doi:10. 1109/ACCESS.2020.2989684

work page arXiv 2020

[7] [7]

Ekstrand

Michael D. Ekstrand. 2020. LensKit for Python: Next-Generation Software for Recommender Systems Experiments. InProceedings of the 29th ACM International Conference on Information and Knowledge Management. doi:10.1145/3340531. 3412778

work page doi:10.1145/3340531 2020

[8] [8]

Stuart I. Feldman. 1979. Make — a program for maintaining com- puter programs.Software: Practice and Experience9, 4 (1979), 255–

work page 1979

[9] [9]

1002/spe.4380090402

arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/spe.4380090402 doi:10. 1002/spe.4380090402

work page doi:10.1002/spe.4380090402

[10] [10]

Alon Halevy, Peter Norvig, and Fernando Pereira. 2009. The unreasonable effectiveness of data.IEEE intelligent systems24, 2 (2009), 8–12

work page 2009

[11] [11]

https://doi.org/10.1371/journal

Gregory M. Kurtzer, Vanessa Sochat, and Michael W. Bauer. 2017. Singularity: Scientific containers for mobility of compute.PLOS ONE12, 5 (05 2017), 1–20. doi:10.1371/journal.pone.0177459

work page doi:10.1371/journal.pone.0177459 2017

[12] [12]

Philipp Meister, Lukas Wegmeth, Tobias Vente, and Joeran Beel. 2024. Remov- ing Bad Influence: Identifying and Pruning Detrimental Users in Collaborative Filtering Recommender Systems.. InRobustRecSys@ RecSys. 8–11

work page 2024

[13] [13]

Shaina Raza, Mizanur Rahman, Safiullah Kamawal, Armin Toroghi, Ananya Raval, Farshad Navah, and Amirmohammad Kazemeini. 2026. A comprehensive review of recommender systems: Transitioning from theory to practice.Computer Science Review59 (2026), 100849. doi:10.1016/j.cosrev.2025.100849

work page doi:10.1016/j.cosrev.2025.100849 2026

[14] [14]

RUCAIBox. 2024. Dataset List | RecBole. https://recbole.io/dataset_list.html. Accessed: 2026-03-08

work page 2024

[15] [15]

RUCAIBox. 2024. RecSysDatasets: Public Data Sources for Recommender Sys- tems. https://github.com/RUCAIBox/RecSysDatasets. GitHub repository, ac- cessed: 2026-03-08

work page 2024

[16] [16]

Youssef Tarek Tewfik. 2026. unreasonable-effectiveness-recsys. https://github. com/Youssef-Tarek-Tewfik/unreasonable-effectiveness-recsys. GitHub reposi- tory, accessed: 2026-03-19

work page 2026

[17] [17]

Andrius Vabalas, Emma Gowen, Ellen Poliakoff, and Alexander J Casson. 2019. Machine learning algorithm validation with a limited sample size.PloS one14, 11 (2019), e0224365

work page 2019

[18] [18]

Tobias Vente, Lukas Wegmeth, Alan Said, and Joeran Beel. 2024. From Clicks to Carbon: The Environmental Toll of Recommender Systems. InProceedings of the 18th ACM Conference on Recommender Systems(Bari, Italy)(RecSys ’24). Association for Computing Machinery, New York, NY, USA, 580–590. doi:10. 1145/3640457.3688074

work page arXiv 2024

[19] [19]

Yoo, Morris A

Andy B. Yoo, Morris A. Jette, and Mark Grondona. 2003. SLURM: Simple Linux Utility for Resource Management. InJob Scheduling Strategies for Parallel Pro- cessing, Dror Feitelson, Larry Rudolph, and Uwe Schwiegelshohn (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 44–60

work page 2003

[20] [20]

Wayne Xin Zhao, Shanlei Mu, Yupeng Hou, Zihan Lin, Yushuo Chen, Xingyu Pan, Kaiyuan Li, Yujie Lu, Hui Wang, Changxin Tian, Yingqian Min, Zhichao Feng, Xinyan Fan, Xu Chen, Pengfei Wang, Wendi Ji, Yaliang Li, Xiaoling Wang, and Ji-Rong Wen. 2021. RecBole: Towards a Unified, Comprehensive and Efficient Framework for Recommendation Algorithms. InCIKM. ACM, 4653–4664

work page 2021

[21] [21]

Kuan Zou and Aixin Sun. 2025. A Survey of Real-World Recommender Systems: Challenges, Constraints, and Industrial Perspectives. arXiv:2509.06002 [cs.IR] https://arxiv.org/abs/2509.06002

work page arXiv 2025