Dissecting Transformers: A CLEAR Perspective towards Green AI

Divyansh Pandey; Hemang Jain; Karthik Vaidhyanathan; Shailender Goyal

arxiv: 2510.02810 · v2 · submitted 2025-10-03 · 💻 cs.LG · cs.AI· cs.SE

Dissecting Transformers: A CLEAR Perspective towards Green AI

Hemang Jain , Shailender Goyal , Divyansh Pandey , Karthik Vaidhyanathan This is my paper

Pith reviewed 2026-05-18 10:14 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.SE

keywords energy efficiencytransformer modelsattention mechanismgreen AILLM inferencecomponent-level measurementsustainabilityenergy consumption

0 comments

The pith

A repetition technique measures energy use of individual Transformer parts and shows attention costs more energy per FLOP than the model as a whole.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CLEAR to solve the mismatch between fast component runs and slow energy sensors by repeating executions and averaging the results. It applies the method to fifteen models and shows that attention layers draw more energy relative to their floating-point operations than the rest of the model combined. This means total FLOP counts hide real differences in power draw across components. The measurements stay stable with under ten percent variance and account for most of the model's total energy.

Core claim

Using CLEAR, the authors demonstrate that Attention consumes significantly more Energy per FLOP as compared to the entire model, indicating that FLOPs alone fail to capture true component-level energy cost. CLEAR enables reliable fine-grained energy measurements and provides a strong formal foundation for predictive modelling of energy consumption.

What carries the argument

CLEAR (Component-Level Energy Assessment via Repetitions), a protocol that repeats short component executions to align their microsecond timing with millisecond-scale energy sensors.

If this is right

Attention mechanisms use more energy per computation than other Transformer blocks.
FLOP totals are insufficient for comparing energy efficiency of different model parts.
Design choices such as number of attention heads or KV-cache size directly affect measured energy draw.
Component-level data can support models that predict energy use before running inference.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Reducing energy in attention could produce larger savings than equal effort on other layers.
Hardware accelerators might benefit from special paths for high-energy components like attention.
The same repetition approach could be tested on non-Transformer architectures to check for similar energy imbalances.

Load-bearing premise

Repeating component executions at microsecond scale sufficiently bridges the temporal mismatch with millisecond-scale energy sensors without introducing measurement artifacts or bias.

What would settle it

A single-run energy trace of an attention block that yields a materially different energy-per-FLOP ratio from the repeated-execution average.

read the original abstract

The rapid adoption of Large Language Models (LLMs) has raised significant environmental concerns. Unlike the one-time cost of training, LLM inference occurs continuously and dominates the AI energy footprint. Yet most sustainability studies report only coarse model-level metrics, treating energy efficiency as an afterthought rather than a primary objective. Addressing the limitation, we propose Component-Level Energy Assessment via Repetitions CLEAR, to overcome temporal mismatch between microsecond scale component execution and millisecond(ms) scale monitoring of energy sensors. Using CLEAR, we evaluate 15 models spanning four architecture types, keeping component-wise energy variance below 9.5% while capturing over 90% of total energy as individual components. We present the first comprehensive, fine-grained energy analysis of Transformer components across key parameters such as batch size, attention heads, hidden dimension, KV cache, and attention variants. Our findings reveal that Attention consumes significantly more Energy per FLOP as compared to the entire model, indicating that FLOPs alone fail to capture true component-level energy cost. CLEAR enables reliable fine-grained energy measurements and provides a strong formal foundation for predictive modelling of energy consumption.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CLEAR uses repetition to measure component energy in Transformers and finds attention costs more per FLOP than the model average, but the repetition step itself may bias those numbers.

read the letter

The two things to know are that the authors repeat short component executions until millisecond-scale energy sensors can read them, and that this yields the result that attention uses more energy per FLOP than the model as a whole across the 15 models they tested. They also report keeping component variance under 9.5 percent while capturing over 90 percent of total energy. That consistency is the main practical output so far. They vary batch size, heads, hidden size, KV cache, and attention variants, which gives a broader picture than the usual whole-model numbers. The data across four architecture families is the clearest part of the work and could feed into better predictive models for inference energy. The soft spot sits in the repetition assumption. Running the same microsecond operation many times can change cache residency, GPU frequency scaling, or thermal state compared with a normal single forward pass. Their variance and capture-rate checks are internal to the repeated runs and do not test whether the measured energy per FLOP still holds in ordinary inference. Without a direct comparison or overhead model, the claim that FLOPs alone miss the true cost rests on an unverified step. This paper is aimed at people who need concrete component-level energy numbers for green-AI work rather than theoretical derivations. A reader looking for data points to build or check energy predictors will find usable material here. I would send it to peer review because the measurement approach is concrete and the results are specific enough to be worth detailed checking, even with the open question on repetition bias.

Referee Report

1 major / 2 minor

Summary. The paper introduces CLEAR (Component-Level Energy Assessment via Repetitions), a method to isolate and measure energy use of individual Transformer components (e.g., attention) by repeating their microsecond-scale executions until the accumulated energy is detectable by millisecond-scale sensors. Experiments on 15 models across four architectures report component energy variance below 9.5% and capture of over 90% of total model energy. The central empirical finding is that attention exhibits substantially higher energy per FLOP than the model average, implying that aggregate FLOPs are an inadequate proxy for component-level energy costs. The work positions CLEAR as enabling reliable fine-grained analysis and a foundation for predictive energy-consumption models.

Significance. If the per-component measurements prove representative of single-pass inference, the results supply concrete, quantitative evidence that attention mechanisms are disproportionately energy-intensive on a per-FLOP basis. This has direct implications for Green AI, as it suggests targeted optimizations (e.g., attention variants or KV-cache strategies) could yield larger efficiency gains than FLOPs-based estimates would predict. The multi-model, multi-parameter evaluation (batch size, heads, hidden dimension, etc.) and the reported consistency metrics constitute a useful empirical dataset for the community.

major comments (1)

[§3 (CLEAR procedure)] §3 (CLEAR procedure): The central claim that attention consumes significantly more energy per FLOP rests on the untested assumption that repeated component executions preserve the same power-draw behavior (cache residency, DVFS scaling, thermal headroom) as a single forward pass. The manuscript provides only internal consistency checks (<9.5% variance, >90% capture); it does not include validation experiments that compare repeated-run energy profiles against single-pass or higher-resolution measurements. This assumption is load-bearing for the headline result and for any downstream predictive modeling.

minor comments (2)

[Abstract and §4] Abstract and §4: The statements 'variance below 9.5 percent' and 'over 90 percent energy captured' are presented without the precise definitions of the variance metric, the aggregation method, or the exact capture-rate formula; these should be stated explicitly.
[§5 (results)] §5 (results): No comparison is made to alternative energy-profiling tools or to published per-component breakdowns from other groups; adding such baselines would help readers gauge absolute accuracy.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. The comment on the CLEAR procedure raises an important methodological point, and we address it directly below while clarifying the rationale and limitations of our approach.

read point-by-point responses

Referee: [§3 (CLEAR procedure)] §3 (CLEAR procedure): The central claim that attention consumes significantly more energy per FLOP rests on the untested assumption that repeated component executions preserve the same power-draw behavior (cache residency, DVFS scaling, thermal headroom) as a single forward pass. The manuscript provides only internal consistency checks (<9.5% variance, >90% capture); it does not include validation experiments that compare repeated-run energy profiles against single-pass or higher-resolution measurements. This assumption is load-bearing for the headline result and for any downstream predictive modeling.

Authors: We appreciate this observation and agree that direct validation against single-pass or higher-resolution measurements would provide additional confidence. CLEAR was developed precisely because millisecond-scale sensors cannot resolve the microsecond-scale execution of individual components during a single forward pass; repetition is required to accumulate measurable energy. The reported variance below 9.5% and capture of over 90% of total model energy serve as internal evidence that power-draw characteristics remain stable under repetition. We acknowledge that these checks do not constitute a head-to-head comparison with single-pass profiles. In the revised manuscript we will expand Section 3 with an explicit discussion of the assumption, including monitoring of CPU frequency, cache behavior, and temperature across repetition counts, and we will add a limitations paragraph addressing potential discrepancies in DVFS and thermal headroom. This constitutes a partial revision focused on strengthening the presentation and discussion rather than new empirical validation experiments. revision: partial

Circularity Check

0 steps flagged

No circularity in empirical measurement framework

full rationale

The paper introduces CLEAR as a practical empirical technique to resolve the temporal mismatch between microsecond-scale component runs and millisecond-scale energy sensors via repetition. All reported findings (component energy variances below 9.5%, >90% capture rate, and the observation that attention uses more energy per FLOP than the model average) are direct outputs of these measurements across 15 models and parameter sweeps. No equations, predictions, or first-principles derivations are presented that reduce by construction to fitted parameters, self-citations, or ansatzes; the central claims remain independent experimental results rather than tautological restatements of inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that hardware energy sensors can be made accurate via repetition; no free parameters or new invented entities are stated in the abstract.

axioms (1)

domain assumption Microsecond-scale component executions can be accurately captured by millisecond-scale energy sensors through sufficient repetitions without altering true power characteristics.
Invoked to justify CLEAR as a solution to the temporal mismatch described in the abstract.

pith-pipeline@v0.9.0 · 5738 in / 1220 out tokens · 35266 ms · 2026-05-18T10:14:28.519808+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

CLEAR employs an amplification strategy... repeatedly execute each component back-to-back... Etot_c = MeasureEnergy(∑_{i=1}^N c(a_c)) ... ˆE_c = E_end - E_start / N
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Attention consumes significantly more Energy per FLOP as compared to the entire model

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.