pith. sign in

arxiv: 2604.07869 · v1 · submitted 2026-04-09 · 💻 cs.IR · cs.LG

Ensembles at Any Cost? Accuracy-Energy Trade-offs in Recommender Systems

Pith reviewed 2026-05-10 18:29 UTC · model grok-4.3

classification 💻 cs.IR cs.LG
keywords recommender systemsensemble methodsenergy efficiencyaccuracy trade-offscarbon emissionsmachine learning sustainabilityMovieLens
0
0 comments X

The pith

Ensemble methods improve recommender accuracy by 0.3-5.7% but raise energy use by 19-2549%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether ensemble methods justify their accuracy gains in recommender systems when whole-system energy consumption is taken into account. It runs 93 controlled experiments across two pipelines and four datasets, comparing four ensemble strategies to strong single-model baselines while measuring energy with a smart plug and converting results to CO2 equivalents. Across settings the ensembles deliver only modest accuracy lifts yet multiply energy draw and emissions, sometimes by more than twenty times. Selective strategies that combine only the best models prove far more efficient than exhaustive averaging of all predictors. A sympathetic reader would care because recommender systems run at massive scale, where even small per-query energy differences affect operating costs and environmental impact.

Core claim

Across explicit-rating and implicit-feedback tasks, ensemble methods raise accuracy by 0.3% to 5.7% while increasing energy consumption by 19% to 2,549%, with selective Top-Performers ensembles showing the smallest overheads and exhaustive averaging the largest; on the Anime dataset some ensembles multiply energy by a factor of twenty while LensKit ensembles hit memory limits on larger data.

What carries the argument

Direct comparison of Average, Weighted, Stacking/Rank Fusion, and Top Performers ensembles against optimized single models (SVD++, etc.) using RMSE and NDCG@10, with whole-system energy captured by EMERS smart-plug measurements and converted to CO2 equivalents.

If this is right

  • Selective ensembles such as Top Performers deliver accuracy gains at far lower energy overhead than exhaustive averaging.
  • On MovieLens 1M a Top Performers ensemble improves RMSE by 0.96% at only 18.8% extra energy.
  • On Anime an ensemble improves RMSE by 1.2% yet consumes 2,005% more energy and raises emissions from 2.6 to 53.8 mg CO2 equivalents.
  • Some LensKit ensemble configurations fail on larger datasets due to memory limits.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Designers of production recommenders may need to treat energy as a first-class constraint rather than an afterthought.
  • Carbon accounting for recommender pipelines could be extended to include training as well as inference stages.
  • Hardware-aware or approximate ensemble techniques might narrow the energy gap without sacrificing accuracy.

Load-bearing premise

The smart-plug readings of whole-system energy accurately isolate the computational cost of the recommender pipelines without confounding effects from hardware variability or measurement noise.

What would settle it

A controlled run in which an ensemble achieves the reported accuracy gains while consuming no more energy than the strongest single model baseline would falsify the central trade-off claim.

read the original abstract

Ensemble methods are frequently used in recommender systems to improve accuracy by combining multiple models. Recent work reports sizable performance gains, but most studies still optimize primarily for accuracy and robustness rather than for energy efficiency. This paper measures accuracy energy trade offs of ensemble techniques relative to strong single models. We run 93 controlled experiments in two pipelines: 1. explicit rating prediction with Surprise (RMSE) and 2. implicit feedback ranking with LensKit (NDCG@10). We evaluate four datasets ranging from 100,000 to 7.8 million interactions (MovieLens 100K, MovieLens 1M, ModCloth, Anime). We compare four ensemble strategies (Average, Weighted, Stacking or Rank Fusion, Top Performers) against baselines and optimized single models. Whole system energy is measured with EMERS using a smart plug and converted to CO2 equivalents. Across settings, ensembles improve accuracy by 0.3% to 5.7% while increasing energy by 19% to 2,549%. On MovieLens 1M, a Top Performers ensemble improves RMSE by 0.96% at an 18.8% energy overhead over SVD++. On MovieLens 100K, an averaging ensemble improves NDCG@10 by 5.7% with 103% additional energy. On Anime, a Surprise Top Performers ensemble improves RMSE by 1.2% but consumes 2,005% more energy (0.21 vs. 0.01 Wh), increasing emissions from 2.6 to 53.8 mg CO2 equivalents, and LensKit ensembles fail due to memory limits. Overall, selective ensembles are more energy efficient than exhaustive averaging,

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript reports results from 93 controlled experiments comparing four ensemble strategies (Average, Weighted, Stacking/Rank Fusion, Top Performers) against single models in two recommender pipelines: explicit rating prediction using Surprise (RMSE) and implicit feedback ranking using LensKit (NDCG@10). Experiments span four datasets (MovieLens 100K, MovieLens 1M, ModCloth, Anime) with whole-system energy measured via EMERS smart plug and converted to CO2 equivalents. The central claim is that ensembles deliver accuracy gains of 0.3%–5.7% at energy overheads of 19%–2,549%, with selective ensembles (e.g., Top Performers) more efficient than exhaustive averaging; specific examples include a 0.96% RMSE improvement at 18.8% energy cost on MovieLens 1M and a 5.7% NDCG@10 gain at 103% extra energy on MovieLens 100K.

Significance. If the energy measurements can be validated as isolating marginal pipeline costs, the work would be significant for quantifying sustainability trade-offs in recommender systems, an area where accuracy has historically dominated. The broad experimental scope across datasets, libraries, and ensemble types provides practical evidence that modest accuracy gains can incur large energy penalties, supporting calls for energy-aware design in IR. Direct measurement rather than proxies is a strength, though only if methodological controls are strengthened.

major comments (3)
  1. [Energy measurement methodology] Energy measurement methodology (as described for the Anime dataset): the 0.01 Wh baseline rising to 0.21 Wh (2,005% increase) and similar large percentages (up to 2,549%) rest on whole-system smart-plug readings for short runs. No repeated trials, variance estimates, or cross-validation against CPU-level counters (RAPL, perf) are reported, so fixed overheads (interpreter startup, data loading, OS activity) likely dominate and prevent confident attribution of the reported overheads to ensemble computation itself. This directly undermines the accuracy-energy trade-off numbers that form the paper's central claim.
  2. [Results section] Results section: accuracy improvements (0.3%–5.7%) and energy differences are presented without error bars, confidence intervals, or statistical tests (e.g., paired t-tests or Wilcoxon signed-rank). With 93 experiments but no assessment of variability or significance, it is impossible to determine whether reported gains exceed measurement noise, especially given the small absolute energy values involved.
  3. [Experimental setup] Experimental setup: insufficient detail is given on hyperparameter tuning for baselines and ensembles, exact run durations, hardware configuration, and controls for background processes. This prevents verification that the 93 experiments are truly controlled and reproducible, which is load-bearing for claims of consistent trade-offs across pipelines and datasets.
minor comments (2)
  1. [Abstract] The abstract claims selective ensembles are more energy efficient than exhaustive averaging, but the supporting quantitative comparisons are scattered; a consolidated table or figure summarizing overheads by ensemble type across all settings would improve clarity.
  2. [Results] LensKit ensembles failing on Anime due to memory limits is noted but not explored further (e.g., via scaling experiments or alternative implementations); this limits the generalizability of the implicit-feedback results.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed review, which highlights important methodological improvements needed to strengthen the paper's claims on accuracy-energy trade-offs. We address each major comment below and describe the revisions planned for the next version of the manuscript.

read point-by-point responses
  1. Referee: Energy measurement methodology (as described for the Anime dataset): the 0.01 Wh baseline rising to 0.21 Wh (2,005% increase) and similar large percentages (up to 2,549%) rest on whole-system smart-plug readings for short runs. No repeated trials, variance estimates, or cross-validation against CPU-level counters (RAPL, perf) are reported, so fixed overheads (interpreter startup, data loading, OS activity) likely dominate and prevent confident attribution of the reported overheads to ensemble computation itself. This directly undermines the accuracy-energy trade-off numbers that form the paper's central claim.

    Authors: We agree that the energy measurements, based on single whole-system smart-plug readings for short runs, are vulnerable to fixed overheads and lack reported variance, which reduces confidence in attributing exact percentage increases solely to ensemble computation. In the revised manuscript, we will add repeated trials (minimum of 5 runs per configuration) and report means with standard deviations for all energy values. We will also expand the methodology discussion to explicitly address the limitations of whole-system measurement, including the influence of non-computational overheads, and qualify the trade-off claims accordingly. Cross-validation against RAPL or perf was not performed in the original experiments; we will note this as a limitation and suggest it for future work, while maintaining that relative efficiency differences between selective and exhaustive ensembles remain directionally informative. revision: partial

  2. Referee: Results section: accuracy improvements (0.3%–5.7%) and energy differences are presented without error bars, confidence intervals, or statistical tests (e.g., paired t-tests or Wilcoxon signed-rank). With 93 experiments but no assessment of variability or significance, it is impossible to determine whether reported gains exceed measurement noise, especially given the small absolute energy values involved.

    Authors: We acknowledge that the lack of error bars and statistical tests makes it difficult to assess whether the reported accuracy gains and energy differences exceed variability. In the revision, we will incorporate error bars based on the repeated measurements described above for both accuracy and energy metrics. We will also add statistical significance testing (e.g., paired t-tests for accuracy comparisons and appropriate tests for energy) to evaluate whether observed improvements are statistically meaningful beyond noise. revision: yes

  3. Referee: Experimental setup: insufficient detail is given on hyperparameter tuning for baselines and ensembles, exact run durations, hardware configuration, and controls for background processes. This prevents verification that the 93 experiments are truly controlled and reproducible, which is load-bearing for claims of consistent trade-offs across pipelines and datasets.

    Authors: We agree that additional detail is required for full reproducibility and verification of experimental controls. The revised manuscript will expand the experimental setup section to specify hyperparameter tuning procedures (including search ranges and selection criteria for baselines and ensembles), exact run durations, complete hardware specifications (CPU model, memory, OS), and controls for background processes (e.g., isolated execution environments and minimized system activity). We will also release the full experimental code, configurations, and data processing scripts to support independent verification. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurements with no derivations or fitted predictions

full rationale

The paper consists entirely of direct empirical measurements from 93 controlled experiments across two recommender pipelines (Surprise and LensKit) and four datasets. It reports observed accuracy improvements (0.3%–5.7%) and energy increases (19%–2,549%) from ensemble strategies versus single models, using whole-system smart-plug readings converted to CO2 equivalents. No equations, derivations, first-principles results, fitted parameters, or predictions are claimed that could reduce to self-definitional inputs, fitted subsets, or self-citation chains. The central claims rest on raw experimental data rather than any mathematical reduction or ansatz. This is the expected outcome for a measurement-focused study; the skeptic's concerns address measurement validity (e.g., overhead dominance on short runs) but do not indicate circularity in any derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an empirical measurement study with no free parameters, axioms, or invented entities beyond standard experimental assumptions in machine learning evaluation.

pith-pipeline@v0.9.0 · 5623 in / 1064 out tokens · 54304 ms · 2026-05-10T18:29:47.557577+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages

  1. [1]

    Green AI,

    R. Schwartz, J. Dodge, N. A. Smith, and O. Etzioni, “Green AI,” Communications of the ACM, vol. 63, no. 12, pp. 54–63, 2020

  2. [2]

    Towards the systematic reporting of the energy and carbon footprints of machine learning,

    P. Henderson, J. Hu, J. Romoff, E. Brunskill, D. Jurafsky, and J. Pineau, “Towards the systematic reporting of the energy and carbon footprints of machine learning,”Journal of Machine Learning Research, vol. 21, no. 248, pp. 1–43, 2020

  3. [3]

    Green recommender systems: A call for attention,

    J. Beel, A. Said, T. Vente, and L. Wegmeth, “Green recommender systems: A call for attention,”SIGIR Forum, vol. 58, no. 2, pp. 1–5, 2024

  4. [4]

    Green recommender systems: Understanding and minimizing the carbon footprint of AI- powered personalization,

    L. Wegmeth, T. Vente, A. Said, and J. Beel, “Green recommender systems: Understanding and minimizing the carbon footprint of AI- powered personalization,”ACM Transactions on Recommender Systems, 2025

  5. [5]

    Improving simple collaborative filtering models using ensemble methods,

    A. Bar, L. Rokach, G. Shani, B. Shapira, and A. Schclar, “Improving simple collaborative filtering models using ensemble methods,” inMul- tiple Classifier Systems, ser. Lecture Notes in Computer Science, vol

  6. [6]

    Springer, 2013, pp. 1–12

  7. [7]

    Methods for boosting recommender systems,

    R. Boim and T. Milo, “Methods for boosting recommender systems,” in2011 IEEE 27th International Conference on Data Engineering Workshops, 2011, pp. 288–291

  8. [8]

    Presentation of a recommender system with ensemble learning and graph embedding: A case on movielens,

    S. Forouzandeh, K. Berahmand, and M. Rostami, “Presentation of a recommender system with ensemble learning and graph embedding: A case on movielens,”Multimedia Tools and Applications, vol. 80, no. 5, pp. 7805–7832, 2021

  9. [9]

    Greedy ensemble selec- tion for top-n recommendations,

    T. Vente, Z. Mehta, L. Wegmeth, and J. Beel, “Greedy ensemble selec- tion for top-n recommendations,” inProceedings of the RobustRecSys Workshop at the 18th ACM Conference on Recommender Systems, vol

  10. [10]

    Bari, Italy: CEUR-WS.org, 2024

  11. [11]

    Assembled-openml: Creating efficient bench- marks for ensembles in AutoML with OpenML,

    L. Purucker and J. Beel, “Assembled-openml: Creating efficient bench- marks for ensembles in AutoML with OpenML,” inAutoML Conference 2022 Workshop Track, 2022

  12. [12]

    CMA-ES for post hoc ensembling in AutoML: A great success and salvageable failure,

    ——, “CMA-ES for post hoc ensembling in AutoML: A great success and salvageable failure,” inAutoML Conference 2023, 2023

  13. [13]

    Q(D)O-ES: Population-based quality (diversity) optimisation for post hoc ensemble selection in AutoML,

    L. Purucker, L. Schneider, M. Anastacio, J. Beel, B. Bischl, and H. Hoos, “Q(D)O-ES: Population-based quality (diversity) optimisation for post hoc ensemble selection in AutoML,” inAutoML Conference 2023, 2023

  14. [14]

    Towards sustainability-aware recommender systems: Analyzing the trade-off between algorithms performance and carbon footprint,

    G. Spillo, A. D. Filippo, C. Musto, M. Milano, and G. Semeraro, “Towards sustainability-aware recommender systems: Analyzing the trade-off between algorithms performance and carbon footprint,” in Proceedings of the 17th ACM Conference on Recommender Systems, 2023, pp. 856–862

  15. [15]

    Towards green recommender systems: Investigating the impact of data reduction on carbon footprint and algorithm performances,

    ——, “Towards green recommender systems: Investigating the impact of data reduction on carbon footprint and algorithm performances,” in Proceedings of the 18th ACM Conference on Recommender Systems, 2024, pp. 866–871

  16. [16]

    14 kg of CO2: Analyzing the carbon footprint and performance of session-based recommendation algorithms,

    A. Plaza, J. C. Gil, and D. Parra, “14 kg of CO2: Analyzing the carbon footprint and performance of session-based recommendation algorithms,” inRecommender Systems for Sustainability and Social Good, L. Boratto, A. D. Filippo, E. Lex, and F. Ricci, Eds. Cham: Springer Nature Switzerland, 2025, pp. 123–134

  17. [17]

    Eco-aware graph neural networks for sus- tainable recommendations,

    A. Purificato and F. Silvestri, “Eco-aware graph neural networks for sus- tainable recommendations,” inRecommender Systems for Sustainability and Social Good, L. Boratto, A. D. Filippo, E. Lex, and F. Ricci, Eds. Cham: Springer Nature Switzerland, 2025, pp. 111–122

  18. [18]

    Recsys carbonator: Predicting carbon footprint of recommendation system models,

    G. Spillo, A. G. Valerio, F. Franchini, A. D. Filippo, C. Musto, M. Milano, and G. Semeraro, “Recsys carbonator: Predicting carbon footprint of recommendation system models,” inRecommender Systems for Sustainability and Social Good, L. Boratto, A. D. Filippo, E. Lex, and F. Ricci, Eds. Cham: Springer Nature Switzerland, 2025, pp. 98– 110

  19. [19]

    From clicks to carbon: The environmental toll of recommender systems,

    T. Vente, L. Wegmeth, A. Said, and J. Beel, “From clicks to carbon: The environmental toll of recommender systems,” inProceedings of the 18th ACM Conference on Recommender Systems, 2024, pp. 580–590

  20. [20]

    The feasibility of greedy ensemble selection for automated recommender systems,

    T. Vente, L. Purucker, and J. Beel, “The feasibility of greedy ensemble selection for automated recommender systems,” COSEAL Workshop 2022, 2022

  21. [21]

    The potential of AutoML for recom- mender systems,

    T. Vente, L. Wegmeth, and J. Beel, “The potential of AutoML for recom- mender systems,” inAdjunct Proceedings of the 33rd ACM Conference on User Modeling, Adaptation and Personalization, 2025, pp. 371–378

  22. [22]

    Algorithm selection for recommender systems via meta-learning on algorithm characteristics,

    J. M. Decker and J. Beel, “Algorithm selection for recommender systems via meta-learning on algorithm characteristics,”arXiv preprint arXiv:2508.04419, 2025

  23. [23]

    EMERS: Energy meter for recommender systems,

    L. Wegmeth, T. Vente, A. Said, and J. Beel, “EMERS: Energy meter for recommender systems,” inInternational Workshop on Recommender Systems for Sustainability and Social Good (RecSoGood) at the 18th ACM Conference on Recommender Systems, 2024

  24. [24]

    Green recommender systems: Optimizing dataset size for energy-efficient algorithm performance,

    A. Arabzadeh, T. Vente, and J. Beel, “Green recommender systems: Optimizing dataset size for energy-efficient algorithm performance,” in Recommender Systems for Sustainability and Social Good, L. Boratto, A. D. Filippo, E. Lex, and F. Ricci, Eds. Cham: Springer Nature Switzerland, 2025, pp. 73–82

  25. [25]

    e-fold cross-validation for recommender-system evaluation,

    M. Baumgart, L. Wegmeth, T. Vente, and J. Beel, “e-fold cross-validation for recommender-system evaluation,” inRecommender Systems for Sus- tainability and Social Good, L. Boratto, A. D. Filippo, E. Lex, and F. Ricci, Eds. Cham: Springer Nature Switzerland, 2025, pp. 90–97

  26. [26]

    From theory to practice: Implement- ing and evaluating e-fold cross-validation,

    C. Mahlich, T. Vente, and J. Beel, “From theory to practice: Implement- ing and evaluating e-fold cross-validation,” inInternational Conference on Artificial Intelligence and Machine Learning Research, 2024

  27. [27]

    An experimental comparison of software-based power me- ters: Focus on CPU and GPU,

    M. Jay, V . Ostapenco, L. Lefevre, D. Trystram, A.-C. Orgerie, and B. Fichel, “An experimental comparison of software-based power me- ters: Focus on CPU and GPU,” inProceedings of the IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, 2023, pp. 106–118

  28. [28]

    Towards green automated machine learning: Status quo and future directions,

    T. Tornede, A. Tornede, J. Hanselle, F. Mohr, M. Wever, and E. Huller- meier, “Towards green automated machine learning: Status quo and future directions,”Journal of Artificial Intelligence Research, vol. 77, pp. 427–457, 2023

  29. [29]

    The movielens datasets: History and context,

    F. M. Harper and J. A. Konstan, “The movielens datasets: History and context,”ACM Transactions on Interactive Intelligent Systems, vol. 5, no. 4, pp. 19:1–19:19, 2015

  30. [30]

    Surprise: A python library for recommender systems,

    N. Hug, “Surprise: A python library for recommender systems,”Journal of Open Source Software, vol. 5, no. 52, p. 2174, 2020

  31. [31]

    Lenskit for python: Next-generation software for recommender systems experiments,

    M. D. Ekstrand, “Lenskit for python: Next-generation software for recommender systems experiments,” inProceedings of the 29th ACM International Conference on Information & Knowledge Management, 2020, pp. 2999–3006

  32. [32]

    Cumulated gain-based evaluation of IR techniques,

    K. Jarvelin and J. Kekalainen, “Cumulated gain-based evaluation of IR techniques,”ACM Transactions on Information Systems, vol. 20, no. 4, pp. 422–446, 2002

  33. [33]

    The environmental impact of ensemble techniques in recommender systems,

    J. Nitschke, “The environmental impact of ensemble techniques in recommender systems,” Bachelor’s thesis, University of Siegen, Siegen, Germany, 2025