Introduces EURO-5K dataset from 136 EU acts and benchmarks full fine-tuning vs QLoRA for BERT and LLM models on reporting obligation extraction, reporting 0.89 F1 with limited gains from legal pretraining except under parameter-efficient adaptation.
hub Canonical reference
Energy and Policy Considerations for Deep Learning in NLP
Canonical reference. 80% of citing Pith papers cite this work as background.
abstract
Recent progress in hardware and methodology for training neural networks has ushered in a new generation of large networks trained on abundant data. These models have obtained notable gains in accuracy across many NLP tasks. However, these accuracy improvements depend on the availability of exceptionally large computational resources that necessitate similarly substantial energy consumption. As a result these models are costly to train and develop, both financially, due to the cost of hardware and electricity or cloud compute time, and environmentally, due to the carbon footprint required to fuel modern tensor processing hardware. In this paper we bring this issue to the attention of NLP researchers by quantifying the approximate financial and environmental costs of training a variety of recently successful neural network models for NLP. Based on these findings, we propose actionable recommendations to reduce costs and improve equity in NLP research and practice.
hub tools
citation-role summary
citation-polarity summary
roles
background 5representative citing papers
Wake Vision pipeline produces a 6M-image person detection dataset for TinyML with 2.2% label error, improving model accuracy up to 6.6% over prior VWW benchmark across architectures and subsets.
Use of model-generated content in training causes irreversible loss of distribution tails, termed model collapse, in VAEs, GMMs, and LLMs.
Proposes a textbook-based true/false QA task where PTLMs score ~50% closed-book even after pre-training on the text and ~60% open-book with retrieval.
Full development of 7B and 32B Olmo 3 models used 12.3 GWh datacenter energy and emitted 4,251 tCO2eq, with development overheads accounting for 82% of compute and reasoning models costing 17x more to post-train than instruction-tuned ones.
OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.
Switch Transformers use top-1 expert routing in a Mixture of Experts setup to scale to trillion-parameter language models with constant compute and up to 4x speedup over T5-XXL.
Performers approximate full-rank softmax attention in Transformers via FAVOR+ random features for linear complexity, with theoretical guarantees of unbiased estimation and competitive results on pixel, text, and protein tasks.
MÖVE presents a new German-language benchmark evaluating 39 LLMs on performance and governance criteria using ten public-administration datasets.
A four-tier framework for AI inference GHG emissions in Scope 3 reporting, progressing from direct physical estimation using GPU benchmarks to EEIO spend-based methods, with a case showing low total emissions.
EnergyLens predicts multi-GPU LLM inference energy consumption with 9-13% MAPE and identifies configurations with up to 52x energy efficiency differences.
MAL recovers correct symbolic force laws like Kepler gravity from noisy data by minimizing trajectory reconstruction, sparsity, and energy violation, reaching 100% identification via energy criterion on benchmarks.
GPT-NeoX-20B is a publicly released 20B parameter autoregressive language model trained on the Pile that shows strong gains in five-shot reasoning over similarly sized prior models.
Deduplicating training datasets reduces language model verbatim memorization by 10x, improves training efficiency, and enables more accurate evaluation by cutting train-test overlap.
An attack aligns differently shuffled intermediate activations from secure Transformer inference queries to recover model weights with low error using roughly one dollar of queries.
MetaRL pre-trained on GBWM problems delivers near-optimal dynamic strategies in 0.01s achieving 97.8% of DP optimal utility and handles larger problems where DP fails.
Koopman theory plus knowledge distillation yields linearized models from pre-trained nets that outperform standard least-squares Koopman approximations on MNIST and Fashion-MNIST in accuracy and stability.
ST-MoE introduces stability techniques for sparse expert models, allowing a 269B-parameter model to achieve state-of-the-art transfer learning results across reasoning, summarization, and QA tasks at the compute cost of a 32B dense model.
The authors provide a detailed taxonomy of 21 risks associated with language models, covering discrimination, information leaks, misinformation, malicious applications, interaction harms, and societal impacts like job loss and environmental costs.
SAS adds semantic scoring with CLIP and a two-stage filter-then-diversity selection process to make generative dataset distillation produce more class-discriminative and diverse compact datasets.
Soft Learning optimally combines heterogeneous ML specialists via cross-validated non-negative least squares, achieving top performance on 70% of 37 datasets with formal guarantees and 72-435x CPU speedups over deep networks.
Defines a Cognitive Kardashev Scale using total power, cognitive fraction f, compute efficiency η, and brain reference to place current humanity at K ≈ 0.73 and estimate Type I/II capacities.
Proposes applying social choice theory as a modeling language and axiomatic tool for incorporating collective input across the ML development pipeline.
citing papers explorer
-
EURO-5K: When Does Domain Pretraining Matter? Benchmarking Transformers for EU Reporting Obligation Extraction
Introduces EURO-5K dataset from 136 EU acts and benchmarks full fine-tuning vs QLoRA for BERT and LLM models on reporting obligation extraction, reporting 0.89 F1 with limited gains from legal pretraining except under parameter-efficient adaptation.
-
Perhaps PTLMs Should Go to School -- A Task to Assess Open Book and Closed Book QA
Proposes a textbook-based true/false QA task where PTLMs score ~50% closed-book even after pre-training on the text and ~60% open-book with retrieval.
-
OPT: Open Pre-trained Transformer Language Models
OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
-
M\"OVE: A Holistic LLM Benchmark for the German Public Sector
MÖVE presents a new German-language benchmark evaluating 39 LLMs on performance and governance criteria using ten public-administration datasets.
-
GPT-NeoX-20B: An Open-Source Autoregressive Language Model
GPT-NeoX-20B is a publicly released 20B parameter autoregressive language model trained on the Pile that shows strong gains in five-shot reasoning over similarly sized prior models.
-
Deduplicating Training Data Makes Language Models Better
Deduplicating training datasets reduces language model verbatim memorization by 10x, improves training efficiency, and enables more accurate evaluation by cutting train-test overlap.
-
ST-MoE: Designing Stable and Transferable Sparse Expert Models
ST-MoE introduces stability techniques for sparse expert models, allowing a 269B-parameter model to achieve state-of-the-art transfer learning results across reasoning, summarization, and QA tasks at the compute cost of a 32B dense model.
-
Ethical and social risks of harm from Language Models
The authors provide a detailed taxonomy of 21 risks associated with language models, covering discrimination, information leaks, misinformation, malicious applications, interaction harms, and societal impacts like job loss and environmental costs.
-
How do datasets, developers, and models affect biases in a low-resourced language?: The Case of the Bengali Language
Bengali sentiment analysis models exhibit persistent identity-based biases across datasets and developer backgrounds despite similar semantic content.
-
Evaluating LLM Usage for Efficient and Explainable Numerical and Classified Implicit Sentiment Analysis of Product Desirability
LLMs achieve Pearson correlations up to 0.97 and 94% classification accuracy on product desirability sentiment from qualitative data, outperforming lexicon and transformer baselines while providing confidence ratings and rationales.