Leveraging LLMs to Automate Energy-Aware Refactoring of Parallel Scientific Codes

Matthew T. Dearing; Valerie Taylor; Xingfu Wu; Yiheng Tao; Zhiling Lan

arxiv: 2505.02184 · v3 · submitted 2025-05-04 · 💻 cs.AI · cs.DC· cs.PL· cs.SE

Leveraging LLMs to Automate Energy-Aware Refactoring of Parallel Scientific Codes

Matthew T. Dearing , Yiheng Tao , Xingfu Wu , Zhiling Lan , Valerie Taylor This is my paper

Pith reviewed 2026-05-22 17:17 UTC · model grok-4.3

classification 💻 cs.AI cs.DCcs.PLcs.SE

keywords LLMenergy efficiencyparallel scientific codesGPU refactoringpower profilingautomated optimizationcode generation

0 comments

The pith

LLMs guided by runtime power data can refactor parallel scientific codes to cut energy use by about one third on GPUs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large language models can automatically produce energy-efficient versions of parallel scientific codes instead of only functionally correct ones. It builds a multi-stage system that measures actual power use during execution, feeds that data back into prompts for the model, lets the model try fixes, and screens results with another model acting as judge. If the approach works, it would let developers obtain lower-power code for large GPU systems without hand-tuning each benchmark. Tests across twenty-two scientific workloads on two different GPU types produced average savings of 34 to 36 percent in the cases that passed all checks.

Core claim

The paper presents LASSI-EE, an automated LLM-based refactoring framework that generates energy-efficient parallel codes through a multi-stage, iterative approach integrating runtime power profiling, energy-aware prompting, self-correcting feedback loops, and an LLM-as-a-Judge agent for screening generated code. We evaluate LASSI-EE using twenty-two representative scientific benchmarks and applications on NVIDIA A100 and AMD MI100 GPUs. The results indicate an average energy reduction of 36% for MI100 and 34% for A100, across trials that produced passing energy-reducing refactorings.

What carries the argument

LASSI-EE multi-stage iterative framework that combines runtime power profiling, energy-aware prompting, self-correcting loops, and LLM-as-Judge screening to produce energy-reducing code changes.

If this is right

Energy consumption of large-scale scientific applications on GPUs can be lowered automatically using empirical execution feedback.
LLMs become practical tools for optimizing parallel codes for power efficiency in addition to correctness and speed.
Refactoring tasks that previously required manual expert effort can be handled through iterative prompting and verification loops.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same feedback-driven loop could be applied to other hardware platforms if comparable power measurement tools exist.
Repeated application across successive code versions might produce cumulative energy improvements over a project's lifetime.

Load-bearing premise

LLM-generated refactorings preserve functional correctness for the original scientific results while delivering the measured energy savings.

What would settle it

A benchmark run where the refactored code passes screening and executes but returns different scientific output from the original or shows no energy reduction.

read the original abstract

Large language models (LLMs) are increasingly used for generating parallel scientific codes, with a primary focus on generating functionally correct code. Recent work has focused on generating performant code, with an emphasis on its execution time. However, energy efficiency is now recognized as a critical objective, given the significant power demands of large-scale compute systems. This paper addresses the research question of whether LLMs can generate energy-efficient parallel scientific codes when guided by empirical execution feedback. To answer this question, we propose LASSI-EE, an automated LLM-based refactoring framework that generates energy-efficient parallel codes through a multi-stage, iterative approach integrating runtime power profiling, energy-aware prompting, self-correcting feedback loops, and an LLM-as-a-Judge agent for screening generated code. We evaluate LASSI-EE using twenty-two representative scientific benchmarks and applications on NVIDIA A100 and AMD MI100 GPUs. The results indicate an average energy reduction of 36% for MI100 and 34% for A100, across trials that produced passing energy-reducing refactorings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces LASSI-EE, a multi-stage LLM-based framework for automated refactoring of parallel scientific codes to reduce energy consumption. It integrates runtime power profiling, energy-aware prompting, self-correcting feedback loops, and an LLM-as-Judge agent. Evaluated on 22 benchmarks and applications on NVIDIA A100 and AMD MI100 GPUs, it reports average energy reductions of 36% (MI100) and 34% (A100) across trials that produced passing energy-reducing refactorings.

Significance. If the central results hold after addressing the missing success-rate and verification details, the work would be significant for demonstrating practical LLM-driven energy optimization in HPC, moving beyond time-focused code generation. The empirical feedback loop and GPU-specific evaluation on representative scientific workloads are strengths; the approach could influence energy-aware refactoring tools if the pass rates and correctness guarantees are quantified.

major comments (3)

[Abstract / Evaluation] Abstract and Evaluation section: The headline claims of 36% (MI100) and 34% (A100) average energy reduction are explicitly conditioned on 'trials that produced passing energy-reducing refactorings' without reporting the overall success rate, the distribution of savings among passers, or any failure-mode analysis. This directly affects interpretability of the central claim, as the averages cannot be read as expected performance of LASSI-EE without knowing what fraction of attempts succeeded or whether passers are biased toward easier benchmarks.
[Methodology / Evaluation] Methodology and Evaluation sections: Functional correctness of the refactored scientific codes is asserted via self-correcting loops and LLM-as-Judge screening, yet no details are provided on verification methods for numerical/scientific results (e.g., tolerance checks against reference outputs, number of test cases, or error rates). This is load-bearing because undetected semantic errors could invalidate the energy savings.
[Evaluation] Evaluation section: The abstract and results lack variance across trials, statistical significance tests, or per-benchmark breakdowns for the energy reductions. Without these, it is unclear whether the reported averages are robust or driven by a few outliers.

minor comments (2)

[Evaluation] Clarify the exact number of trials per benchmark and the definition of 'passing' (e.g., energy reduction threshold and correctness criteria) in the evaluation description.
[Related Work] The paper could strengthen the related-work discussion by explicitly contrasting LASSI-EE against prior LLM code-generation efforts focused solely on performance rather than energy.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and agree that several clarifications and additions will strengthen the manuscript. Revisions will be made to improve interpretability and rigor of the evaluation.

read point-by-point responses

Referee: The headline claims of 36% (MI100) and 34% (A100) average energy reduction are explicitly conditioned on 'trials that produced passing energy-reducing refactorings' without reporting the overall success rate, the distribution of savings among passers, or any failure-mode analysis. This directly affects interpretability of the central claim, as the averages cannot be read as expected performance of LASSI-EE without knowing what fraction of attempts succeeded or whether passers are biased toward easier benchmarks.

Authors: We acknowledge the referee's point on interpretability. The reported averages are conditioned on successful trials, as stated in the abstract and Evaluation section. To address this, we will revise the manuscript to report the overall success rate across all attempts (including the fraction of trials that produced passing refactorings), a distribution or histogram of energy savings among successful cases, and a concise failure-mode analysis. These details are available from our experimental logs and will be added to the Evaluation section and abstract where appropriate. revision: yes
Referee: Functional correctness of the refactored scientific codes is asserted via self-correcting loops and LLM-as-Judge screening, yet no details are provided on verification methods for numerical/scientific results (e.g., tolerance checks against reference outputs, number of test cases, or error rates). This is load-bearing because undetected semantic errors could invalidate the energy savings.

Authors: We agree that explicit verification details for numerical correctness are essential for scientific codes. The Methodology section describes the self-correcting feedback loops and LLM-as-Judge screening, but we will expand the Evaluation section to include specifics on numerical verification: tolerance thresholds used for output comparisons against reference implementations, the number and types of test cases per benchmark, and observed error rates or failure counts during screening. This will clarify how semantic correctness was ensured beyond the LLM-based checks. revision: yes
Referee: The abstract and results lack variance across trials, statistical significance tests, or per-benchmark breakdowns for the energy reductions. Without these, it is unclear whether the reported averages are robust or driven by a few outliers.

Authors: We appreciate this observation regarding statistical robustness. The current Evaluation section presents aggregate averages but does not include variance, per-benchmark breakdowns, or significance tests. We will revise to add: (1) per-benchmark energy reduction tables or figures with individual results, (2) measures of variance (e.g., standard deviation or interquartile range across trials), and (3) statistical significance tests (such as paired t-tests comparing original vs. refactored energy) to confirm the averages are not outlier-driven. These additions will be included in the revised Evaluation section. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical measurements are externally grounded

full rationale

The paper proposes an LLM-driven refactoring framework (LASSI-EE) and evaluates it via direct runtime power profiling on physical NVIDIA A100 and AMD MI100 GPUs across 22 benchmarks. Reported energy reductions (36% MI100, 34% A100) are obtained from hardware measurements conditioned on passing trials, not from any fitted parameters, self-referential definitions, or equations that reduce to the inputs by construction. No derivation chain, uniqueness theorems, or ansatzes are present; the central claims rest on external, falsifiable execution data rather than self-citation load-bearing or renaming of known results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The paper introduces a new empirical framework but does not specify free parameters, axioms, or invented entities beyond the proposed system itself; assessment is limited by abstract-only access.

invented entities (1)

LASSI-EE framework no independent evidence
purpose: Automated LLM-based energy-aware refactoring of parallel codes
Proposed in this work as the core contribution.

pith-pipeline@v0.9.0 · 5729 in / 1111 out tokens · 33320 ms · 2026-05-22T17:17:57.907430+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

LASSI-EE ... multi-stage, iterative approach integrating runtime power profiling, energy-aware prompting, self-correcting feedback loops, and an LLM-as-a-Judge agent
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

energy-reduction@k ... expected energy reduction when generating k code candidates

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SysLLMatic: Large Language Models are Software System Optimizers
cs.SE 2025-06 unverdicted novelty 6.0

SysLLMatic integrates LLMs with performance diagnostics and a 43-pattern catalog to optimize complex software, reporting 1.54x latency and 1.24x energy gains over compilers on large Java systems where prior LLM method...
Sustainable Code Generation Using Large Language Models: A Systematic Literature Review
cs.SE 2026-03 unverdicted novelty 3.0

A systematic review finds research on the sustainability of LLM-generated code to be limited, fragmented, and without accepted frameworks for measurement or benchmarking.