hub Canonical reference

Energy and Policy Considerations for Deep Learning in NLP

Emma Strubell, Ananya Ganesh, Andrew McCallum · 2019 · cs.CL · DOI 10.18653/v1/p19-1355 · arXiv 1906.02243

Canonical reference. 80% of citing Pith papers cite this work as background.

47 Pith papers citing it

1,853 external citations · Pith

Background 80% of classified citations

open full Pith review browse 47 citing papers arXiv PDF

abstract

Recent progress in hardware and methodology for training neural networks has ushered in a new generation of large networks trained on abundant data. These models have obtained notable gains in accuracy across many NLP tasks. However, these accuracy improvements depend on the availability of exceptionally large computational resources that necessitate similarly substantial energy consumption. As a result these models are costly to train and develop, both financially, due to the cost of hardware and electricity or cloud compute time, and environmentally, due to the carbon footprint required to fuel modern tensor processing hardware. In this paper we bring this issue to the attention of NLP researchers by quantifying the approximate financial and environmental costs of training a variety of recently successful neural network models for NLP. Based on these findings, we propose actionable recommendations to reduce costs and improve equity in NLP research and practice.

hub tools

JSON dossier citing papers JSON publisher DOI arXiv source

citation-role summary

background 5

citation-polarity summary

background 4 unclear 1

representative citing papers

EURO-5K: When Does Domain Pretraining Matter? Benchmarking Transformers for EU Reporting Obligation Extraction

cs.CL · 2026-06-02 · unverdicted · novelty 7.0

Introduces EURO-5K dataset from 136 EU acts and benchmarks full fine-tuning vs QLoRA for BERT and LLM models on reporting obligation extraction, reporting 0.89 F1 with limited gains from legal pretraining except under parameter-efficient adaptation.

Wake Vision: A Tailored Dataset and Benchmark Suite for TinyML Computer Vision Applications

cs.CV · 2024-05-01 · unverdicted · novelty 7.0

Wake Vision pipeline produces a 6M-image person detection dataset for TinyML with 2.2% label error, improving model accuracy up to 6.6% over prior VWW benchmark across architectures and subsets.

The Curse of Recursion: Training on Generated Data Makes Models Forget

cs.LG · 2023-05-27 · conditional · novelty 7.0

Use of model-generated content in training causes irreversible loss of distribution tails, termed model collapse, in VAEs, GMMs, and LLMs.

Perhaps PTLMs Should Go to School -- A Task to Assess Open Book and Closed Book QA

cs.CL · 2021-10-04 · unverdicted · novelty 7.0

Proposes a textbook-based true/false QA task where PTLMs score ~50% closed-book even after pre-training on the text and ~60% open-book with retrieval.

The Hidden Cost of Thinking: Energy Use and Environmental Impact of LMs Beyond Pretraining

cs.CY · 2026-05-01 · unverdicted · novelty 7.0

Full development of 7B and 32B Olmo 3 models used 12.3 GWh datacenter energy and emitted 4,251 tCO2eq, with development overheads accounting for 82% of compute and reasoning models costing 17x more to post-train than instruction-tuned ones.

OPT: Open Pre-trained Transformer Language Models

cs.CL · 2022-05-02 · unverdicted · novelty 7.0

OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.

Flamingo: a Visual Language Model for Few-Shot Learning

cs.CV · 2022-04-29 · unverdicted · novelty 7.0

Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

cs.LG · 2021-01-11 · accept · novelty 7.0

Switch Transformers use top-1 expert routing in a Mixture of Experts setup to scale to trillion-parameter language models with constant compute and up to 4x speedup over T5-XXL.

Rethinking Attention with Performers

cs.LG · 2020-09-30 · unverdicted · novelty 7.0

Performers approximate full-rank softmax attention in Transformers via FAVOR+ random features for linear complexity, with theoretical guarantees of unbiased estimation and competitive results on pixel, text, and protein tasks.

Ravines in quantum cost landscapes: opportunities for improved VQA predictions

quant-ph · 2026-07-01 · unverdicted · novelty 6.0

NEB-adapted ravine ensembles for QNNs classifying concentratable entanglement outperform naive methods when local-prediction variability is high and reduce costs, with ravines persisting under depth and qubit scaling.

M\"OVE: A Holistic LLM Benchmark for the German Public Sector

cs.CL · 2026-06-11 · unverdicted · novelty 6.0

MÖVE presents a new German-language benchmark evaluating 39 LLMs on performance and governance criteria using ten public-administration datasets.

Accounting for AI Inference in Corporate GHG Inventories: A Four-Tier Methodology for Scope 3 Category 1 Reporting

cs.CY · 2026-06-09 · unverdicted · novelty 6.0

A four-tier framework for AI inference GHG emissions in Scope 3 reporting, progressing from direct physical estimation using GPU benchmarks to EEIO spend-based methods, with a case showing low total emissions.

MAAM: Anchor-Preserving Compression and Contextual Calibration for Chinese Discriminatory Language Detection

cs.CL · 2026-06-08 · unverdicted · novelty 6.0

MAAM is a model-agnostic framework that improves Chinese discriminatory language detection via anchor preservation and C-I-S contextual calibration, supported by a new 8,120-sample ChLGBT dataset.

EnergyLens: Predictive Energy-Aware Exploration for Multi-GPU LLM Inference Optimization

cs.LG · 2026-05-14 · unverdicted · novelty 6.0

EnergyLens predicts multi-GPU LLM inference energy consumption with 9-13% MAPE and identifies configurations with up to 52x energy efficiency differences.

Minimum-Action Learning: Energy-Constrained Symbolic Model Selection for Physical Law Identification from Noisy Data

cs.LG · 2026-03-16 · unverdicted · novelty 6.0

MAL recovers correct symbolic force laws like Kepler gravity from noisy data by minimizing trajectory reconstruction, sparsity, and energy violation, reaching 100% identification via energy criterion on benchmarks.

GPT-NeoX-20B: An Open-Source Autoregressive Language Model

cs.CL · 2022-04-14 · accept · novelty 6.0

GPT-NeoX-20B is a publicly released 20B parameter autoregressive language model trained on the Pile that shows strong gains in five-shot reasoning over similarly sized prior models.

Deduplicating Training Data Makes Language Models Better

cs.CL · 2021-07-14 · unverdicted · novelty 6.0

Deduplicating training datasets reduces language model verbatim memorization by 10x, improves training efficiency, and enables more accurate evaluation by cutting train-test overlap.

Query-efficient model evaluation using cached responses

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

DKPS-based methods predict new model benchmark scores using cached responses, matching baseline mean absolute error with substantially fewer queries and an offline query selection approach.

On the (In-)Security of the Shuffling Defense in the Transformer Secure Inference

cs.CR · 2026-05-06 · conditional · novelty 6.0

An attack aligns differently shuffled intermediate activations from secure Transformer inference queries to recover model weights with low error using roughly one dollar of queries.

A Meta Reinforcement Learning Approach to Goals-Based Wealth Management

cs.LG · 2026-05-04 · unverdicted · novelty 6.0

MetaRL pre-trained on GBWM problems delivers near-optimal dynamic strategies in 0.01s achieving 97.8% of DP optimal utility and handles larger problems where DP fails.

Extraction of linearized models from pre-trained networks via knowledge distillation

cs.LG · 2026-04-08 · unverdicted · novelty 6.0

Koopman theory plus knowledge distillation yields linearized models from pre-trained nets that outperform standard least-squares Koopman approximations on MNIST and Fashion-MNIST in accuracy and stability.

ST-MoE: Designing Stable and Transferable Sparse Expert Models

cs.CL · 2022-02-17 · unverdicted · novelty 6.0

ST-MoE introduces stability techniques for sparse expert models, allowing a 269B-parameter model to achieve state-of-the-art transfer learning results across reasoning, summarization, and QA tasks at the compute cost of a 32B dense model.

Ethical and social risks of harm from Language Models

cs.CL · 2021-12-08 · accept · novelty 6.0

The authors provide a detailed taxonomy of 21 risks associated with language models, covering discrimination, information leaks, misinformation, malicious applications, interaction harms, and societal impacts like job loss and environmental costs.

AI Native Games: A Survey and Roadmap

cs.AI · 2026-07-01 · unverdicted · novelty 5.0

The paper proposes a counterfactual definition of AI-native games, screens 53 examples, introduces a G/N taxonomy, and outlines a research roadmap for the field.

citing papers explorer

Showing 39 of 39 citing papers after filters.

EURO-5K: When Does Domain Pretraining Matter? Benchmarking Transformers for EU Reporting Obligation Extraction cs.CL · 2026-06-02 · unverdicted · none · ref 95 · internal anchor
Introduces EURO-5K dataset from 136 EU acts and benchmarks full fine-tuning vs QLoRA for BERT and LLM models on reporting obligation extraction, reporting 0.89 F1 with limited gains from legal pretraining except under parameter-efficient adaptation.
Wake Vision: A Tailored Dataset and Benchmark Suite for TinyML Computer Vision Applications cs.CV · 2024-05-01 · unverdicted · none · ref 18 · internal anchor
Wake Vision pipeline produces a 6M-image person detection dataset for TinyML with 2.2% label error, improving model accuracy up to 6.6% over prior VWW benchmark across architectures and subsets.
Perhaps PTLMs Should Go to School -- A Task to Assess Open Book and Closed Book QA cs.CL · 2021-10-04 · unverdicted · none · ref 32 · internal anchor
Proposes a textbook-based true/false QA task where PTLMs score ~50% closed-book even after pre-training on the text and ~60% open-book with retrieval.
The Hidden Cost of Thinking: Energy Use and Environmental Impact of LMs Beyond Pretraining cs.CY · 2026-05-01 · unverdicted · none · ref 12
Full development of 7B and 32B Olmo 3 models used 12.3 GWh datacenter energy and emitted 4,251 tCO2eq, with development overheads accounting for 82% of compute and reasoning models costing 17x more to post-train than instruction-tuned ones.
OPT: Open Pre-trained Transformer Language Models cs.CL · 2022-05-02 · unverdicted · none · ref 290
OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
Flamingo: a Visual Language Model for Few-Shot Learning cs.CV · 2022-04-29 · unverdicted · none · ref 106
Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.
Rethinking Attention with Performers cs.LG · 2020-09-30 · unverdicted · none · ref 149
Performers approximate full-rank softmax attention in Transformers via FAVOR+ random features for linear complexity, with theoretical guarantees of unbiased estimation and competitive results on pixel, text, and protein tasks.
Ravines in quantum cost landscapes: opportunities for improved VQA predictions quant-ph · 2026-07-01 · unverdicted · none · ref 109 · internal anchor
NEB-adapted ravine ensembles for QNNs classifying concentratable entanglement outperform naive methods when local-prediction variability is high and reduce costs, with ravines persisting under depth and qubit scaling.
M\"OVE: A Holistic LLM Benchmark for the German Public Sector cs.CL · 2026-06-11 · unverdicted · none · ref 113 · internal anchor
MÖVE presents a new German-language benchmark evaluating 39 LLMs on performance and governance criteria using ten public-administration datasets.
Accounting for AI Inference in Corporate GHG Inventories: A Four-Tier Methodology for Scope 3 Category 1 Reporting cs.CY · 2026-06-09 · unverdicted · none · ref 6 · internal anchor
A four-tier framework for AI inference GHG emissions in Scope 3 reporting, progressing from direct physical estimation using GPU benchmarks to EEIO spend-based methods, with a case showing low total emissions.
MAAM: Anchor-Preserving Compression and Contextual Calibration for Chinese Discriminatory Language Detection cs.CL · 2026-06-08 · unverdicted · none · ref 27 · internal anchor
MAAM is a model-agnostic framework that improves Chinese discriminatory language detection via anchor preservation and C-I-S contextual calibration, supported by a new 8,120-sample ChLGBT dataset.
EnergyLens: Predictive Energy-Aware Exploration for Multi-GPU LLM Inference Optimization cs.LG · 2026-05-14 · unverdicted · none · ref 13 · internal anchor
EnergyLens predicts multi-GPU LLM inference energy consumption with 9-13% MAPE and identifies configurations with up to 52x energy efficiency differences.
Minimum-Action Learning: Energy-Constrained Symbolic Model Selection for Physical Law Identification from Noisy Data cs.LG · 2026-03-16 · unverdicted · none · ref 31 · internal anchor
MAL recovers correct symbolic force laws like Kepler gravity from noisy data by minimizing trajectory reconstruction, sparsity, and energy violation, reaching 100% identification via energy criterion on benchmarks.
Deduplicating Training Data Makes Language Models Better cs.CL · 2021-07-14 · unverdicted · none · ref 39 · internal anchor
Deduplicating training datasets reduces language model verbatim memorization by 10x, improves training efficiency, and enables more accurate evaluation by cutting train-test overlap.
Query-efficient model evaluation using cached responses cs.LG · 2026-05-08 · unverdicted · none · ref 59
DKPS-based methods predict new model benchmark scores using cached responses, matching baseline mean absolute error with substantially fewer queries and an offline query selection approach.
A Meta Reinforcement Learning Approach to Goals-Based Wealth Management cs.LG · 2026-05-04 · unverdicted · none · ref 276
MetaRL pre-trained on GBWM problems delivers near-optimal dynamic strategies in 0.01s achieving 97.8% of DP optimal utility and handles larger problems where DP fails.
Extraction of linearized models from pre-trained networks via knowledge distillation cs.LG · 2026-04-08 · unverdicted · none · ref 2
Koopman theory plus knowledge distillation yields linearized models from pre-trained nets that outperform standard least-squares Koopman approximations on MNIST and Fashion-MNIST in accuracy and stability.
ST-MoE: Designing Stable and Transferable Sparse Expert Models cs.CL · 2022-02-17 · unverdicted · none · ref 71
ST-MoE introduces stability techniques for sparse expert models, allowing a 269B-parameter model to achieve state-of-the-art transfer learning results across reasoning, summarization, and QA tasks at the compute cost of a 32B dense model.
AI Native Games: A Survey and Roadmap cs.AI · 2026-07-01 · unverdicted · none · ref 81 · internal anchor
The paper proposes a counterfactual definition of AI-native games, screens 53 examples, introduces a G/N taxonomy, and outlines a research roadmap for the field.
Mapping the Schedule x Bit-Width Boundary in Sub-100M Quantisation-Aware Training cs.LG · 2026-05-25 · unverdicted · none · ref 24 · internal anchor
Factorial experiments with over 1300 runs falsify the hypothesis that INT6 QAT needs a different LR schedule from higher precision and identify a 50M-parameter boundary for INT4 schedule sensitivity.
SAS: Semantic-aware Sampling for Generative Dataset Distillation cs.CV · 2026-05-18 · unverdicted · none · ref 3 · internal anchor
SAS adds semantic scoring with CLIP and a two-stage filter-then-diversity selection process to make generative dataset distillation produce more class-discriminative and diverse compact datasets.
Soft Learning cs.LG · 2026-05-16 · unverdicted · none · ref 2 · internal anchor
Soft Learning optimally combines heterogeneous ML specialists via cross-validated non-negative least squares, achieving top performance on 70% of 37 datasets with formal guarantees and 72-435x CPU speedups over deep networks.
The Cognitive Kardashev Scale: Quantifying the Material Envelope of Civilisational Computation physics.soc-ph · 2026-05-11 · unverdicted · none · ref 19 · internal anchor
Defines a Cognitive Kardashev Scale using total power, cognitive fraction f, compute efficiency η, and brain reference to place current humanity at K ≈ 0.73 and estimate Type I/II capacities.
AI of the People, by the People, for the People: A Social Choice Approach to Collective Control of Artificial Intelligence cs.CY · 2026-04-14 · unverdicted · none · ref 96 · internal anchor
Proposes applying social choice theory as a modeling language and axiomatic tool for incorporating collective input across the ML development pipeline.
Physics Priors Offer Useful Accuracy-Carbon Trade-Offs in Spatio-Temporal Forecasting cs.LG · 2025-09-29 · unverdicted · none · ref 41 · internal anchor
Stronger physics priors in neural networks for spatio-temporal shear flow forecasting yield substantially lower training carbon footprints than weak or no priors, though inference savings are less consistent.
Quantifying the Carbon Emissions of Machine Learning cs.CY · 2019-10-21 · unverdicted · none · ref 1 · internal anchor
Presents a calculator tool for estimating carbon emissions from ML model training along with mitigation actions.
Convolutional Dictionary Learning in Hierarchical Networks cs.LG · 2019-07-23 · unverdicted · none · ref 11 · internal anchor
A hierarchical convolutional dictionary learning model for piecewise smooth signals using recursive scale-detail filtering and sparse coding, learned by alternating minimization and demonstrated on MNIST.
Position: LLM Inference Should Be Evaluated as Energy-to-Token Production cs.CE · 2026-05-12 · unverdicted · none · ref 14
LLM inference should be reframed and evaluated as energy-to-token production with a Token Production Function that accounts for power, cooling, and efficiency ceilings.
GreenDyGNN: Runtime-Adaptive Energy-Efficient Communication for Distributed GNN Training cs.DC · 2026-04-25 · unverdicted · none · ref 18
GreenDyGNN applies Double-DQN to adapt cache management in distributed GNN training, cutting energy by up to 43% under congestion versus static policies.
Assessing the Carbon Emissions and Energy Consumption of U.S. Hyperscale Data Centers cs.AI · 2026-06-03 · unverdicted · none · ref 12 · internal anchor
US hyperscale data centers consumed 68-99 TWh electricity and emitted 37-54 Mt CO2, representing 1.8% of US electricity use with average carbon intensity 48% above the national grid average.
Evaluation of ML Resource Utilization Requires Model Life Cycle Assessment cs.LG · 2026-05-31 · unverdicted · none · ref 122 · internal anchor
The paper calls for life cycle assessment to capture embodied hardware costs and full pipeline operational costs in AI development and deployment.
MedicalRec: Medical recommender system for image classification without retraining cs.LG · 2026-05-23 · unverdicted · none · ref 12 · internal anchor
A transformer recommender system trained on a new benchmark of over 5,000 model performances from medical imaging papers achieves up to 75.5% HitRate@100.
Benchmarked Yet Not Measured -- Generative AI Should be Evaluated Against Real-World Utility cs.LG · 2026-05-07 · unverdicted · none · ref 202 · 2 links
Generative AI evaluation must shift from static benchmark scores to measuring sustained improvements in human capabilities within specific deployment contexts.
Modest, artistic, and radical solutions to the environmental impact of image-generating machine learning cs.CY · 2026-06-18 · unverdicted · none · ref 44 · internal anchor
Surveys energy footprints of image ML and proposes modest technical solutions including tiny models, low-precision hardware, and true-cost accounting driven by critiques of shareholder efficiency metrics.
Assessing the Energy and Carbon Emissions of Neural Speaker Verification Model in Training and Inference cs.SD · 2026-06-06 · unverdicted · none · ref 14 · internal anchor
Empirical study finds diminishing accuracy returns against steep energy growth for deeper and wider ResNet speaker verification models on VoxCeleb2.
Evaluating LLM Usage for Efficient and Explainable Numerical and Classified Implicit Sentiment Analysis of Product Desirability cs.CL · 2026-06-04 · unverdicted · none · ref 58 · internal anchor
LLMs achieve Pearson correlations up to 0.97 and 94% classification accuracy on product desirability sentiment from qualitative data, outperforming lexicon and transformer baselines while providing confidence ratings and rationales.
AgriIR: A Scalable Framework for Domain-Specific Knowledge Retrieval cs.IR · 2026-03-17 · unverdicted · none · ref 38 · internal anchor
AgriIR is a configurable RAG framework using modular stages and 1B-parameter models to deliver grounded, citable answers for Indian agricultural information access.
Spiking Neural Network Architecture Search: A Survey cs.NE · 2025-10-16 · unverdicted · none · ref 112 · internal anchor
A survey of Spiking Neural Network architecture search techniques viewed through a hardware/software co-design lens.
Quantum-inspired tensor networks in machine learning models cs.LG · 2026-04-15 · unverdicted · none · ref 116
Tensor networks developed for quantum states are reviewed as tools for machine learning models, with assessment of their potential computational, explanatory, and privacy advantages alongside remaining challenges.

Energy and Policy Considerations for Deep Learning in NLP

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer