pith. machine review for the scientific record. sign in

arxiv: 2603.17280 · v2 · submitted 2026-03-18 · 💻 cs.DC

Recognition: 2 theorem links

· Lean Theorem

The 1/W Law: An Analytical Study of Context-Length Routing Topology and GPU Generation Gains for LLM Inference Energy Efficiency

Authors on Pith no claims yet

Pith reviewed 2026-05-15 09:28 UTC · model grok-4.3

classification 💻 cs.DC
keywords 1/W lawcontext windowtokens per wattKV cacheLLM inferenceenergy efficiencyrouting topologyGPU power model
0
0 comments X

The pith

Tokens per watt halves every time the context window doubles.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper derives that tokens per watt in LLM inference drop by half with each doubling of context length. This occurs because larger contexts reduce the number of sequences that fit in the KV cache while GPU power draw stays roughly constant. The relation positions routing decisions over context lengths as a stronger energy-efficiency lever than hardware generation upgrades. Analytical comparisons show that two-pool routing and MoE active-parameter scaling each multiply efficiency, and the gains combine rather than overlap.

Core claim

We derive the 1/W law: tokens per watt halves every time the context window doubles. A larger context window shrinks the KV-cache concurrency limit while leaving GPU power draw roughly unchanged. At 64K context, an H100 holds 16 sequences in flight (tok/W = 1.5); at 4K context, the same H100 holds 256 sequences (tok/W = 17.6). Routing topology is a more powerful energy lever than buying newer hardware. Two-pool context-length routing delivers roughly 2.5x better tok/W over a homogeneous fleet, while upgrading from H100 to B200 delivers roughly 1.7x. The gains are independent: combining FleetOpt with B200 yields 4.25x over the H100 homogeneous baseline. For MoE models, active-parameter weight

What carries the argument

The 1/W law that links tokens per watt directly to context length through KV-cache concurrency limits and near-constant power draw.

Load-bearing premise

GPU power draw remains roughly unchanged as context window size varies.

What would settle it

Measure actual power consumption and achieved throughput on an H100 at both 4K and 64K context under saturated load to test whether tokens per watt exactly halves.

read the original abstract

How many tokens can a GPU inference cluster deliver per watt? Across deployments of identical hardware, the answer varies by 40x -- not because of software inefficiency, but because of the serving context window. We derive the 1/W law: tokens per watt halves every time the context window doubles. A larger context window shrinks the KV-cache concurrency limit while leaving GPU power draw roughly unchanged. At 64K context, an H100 holds 16 sequences in flight (tok/W = 1.5); at 4K context, the same H100 holds 256 sequences (tok/W = 17.6). Routing topology -- which determines the effective context window each GPU services -- is a more powerful energy lever than buying newer hardware. Working from published H100 power measurements, a calibrated logistic power model, and a roofline throughput model, we derive these results analytically using the inference-fleet-sim framework; no new hardware experiments were conducted. Two-pool context-length routing (FleetOpt) delivers roughly 2.5x better tok/W over a homogeneous fleet, while upgrading from H100 to B200 delivers roughly 1.7x. The gains are independent: combining FleetOpt with B200 yields 4.25x over the H100 homogeneous baseline. B200/H200 numbers are analytical projections (+-20% uncertainty); H100 results are calibrated to published measurements. For MoE models, active-parameter weight streaming adds a third lever. Qwen3-235B-A22B (22B active) reaches roughly 37.8 tok/W at 8K context on H100 -- 5.1x better than Llama-3.1-70B -- because decode time scales with activated weights, not total parameters. MoE dispatch overhead is excluded, so this is an upper bound.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript derives the '1/W law' claiming that tokens per watt in LLM inference halves with each doubling of context window size, as larger windows reduce KV-cache concurrency limits while GPU power draw remains roughly constant. Using a calibrated logistic power model, roofline throughput model, and inference-fleet-sim framework on published H100 data, it shows context-length routing (FleetOpt two-pool) yields ~2.5x tok/W gains over homogeneous fleets, outperforming H100-to-B200 hardware upgrades (~1.7x), with combined gains of 4.25x; MoE models like Qwen3-235B-A22B achieve ~37.8 tok/W at 8K context (5.1x over Llama-3.1-70B) due to active-parameter scaling, with all projections carrying ±20% uncertainty and no new experiments performed.

Significance. If the constant-power assumption holds within the stated bounds, the work supplies a useful analytical lens for energy optimization in LLM serving clusters, highlighting routing topology as a stronger lever than hardware iteration. Strengths include the closed-form derivation of the 1/W scaling from KV-cache limits, calibration to external measurements without new hardware runs, and explicit independence of routing and hardware gains, which enables falsifiable deployment predictions.

major comments (2)
  1. [§3 (1/W Law Derivation) and Abstract] The exact 1/W halving (tokens per watt halves on context doubling) in the derivation rests on power invariance with context length. The logistic power model is calibrated to published H100 measurements, but these do not isolate context-length effects on dynamic power (memory controller or utilization); any positive correlation with KV footprint would make tok/W decline slower than 1/W, directly affecting the central law and all downstream FleetOpt and MoE projections.
  2. [§4 (FleetOpt Routing Analysis)] The roofline throughput model and FleetOpt routing gains (2.5x tok/W) inherit the constant-power assumption without shown sensitivity analysis; the stated ±20% uncertainty on B200 projections should be propagated through the combined 4.25x claim to quantify robustness of the routing-vs-hardware comparison.
minor comments (2)
  1. [Abstract] The numerical examples (H100 at 64K: 16 sequences, tok/W=1.5; at 4K: 256 sequences, tok/W=17.6) should cite the exact equations or tables from which they are computed for traceability.
  2. [MoE Model Analysis] For the MoE upper-bound claim (Qwen3-235B-A22B at 37.8 tok/W), explicitly note that excluding dispatch overhead affects the 5.1x comparison to dense models and state whether this bound is used in the FleetOpt analysis.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thorough and constructive review of our manuscript. We address each major comment point-by-point below, providing clarifications on the power invariance assumption and committing to revisions that strengthen the analysis of uncertainty and sensitivity.

read point-by-point responses
  1. Referee: [§3 (1/W Law Derivation) and Abstract] The exact 1/W halving (tokens per watt halves on context doubling) in the derivation rests on power invariance with context length. The logistic power model is calibrated to published H100 measurements, but these do not isolate context-length effects on dynamic power (memory controller or utilization); any positive correlation with KV footprint would make tok/W decline slower than 1/W, directly affecting the central law and all downstream FleetOpt and MoE projections.

    Authors: We appreciate the referee highlighting the foundational role of the power invariance assumption. The 1/W law follows directly from KV-cache concurrency limits (sequences in flight = total HBM / per-sequence KV footprint) while power draw is taken as approximately constant, consistent with published H100 inference measurements across workloads that show <10% power variation in the logistic fit. Although these external datasets do not explicitly isolate context-length effects on dynamic components such as memory controllers, the decode phase remains memory-bound with similar utilization patterns. To address the concern rigorously, we will add a sensitivity subsection in §3 that models a positive linear correlation between power and KV footprint (up to +20% at 64K context). This yields tok/W scaling between 1/W^{0.85} and 1/W, preserving the qualitative ordering of routing versus hardware gains and the central claims. The revision will also update the abstract to note the assumption explicitly. revision: partial

  2. Referee: [§4 (FleetOpt Routing Analysis)] The roofline throughput model and FleetOpt routing gains (2.5x tok/W) inherit the constant-power assumption without shown sensitivity analysis; the stated ±20% uncertainty on B200 projections should be propagated through the combined 4.25x claim to quantify robustness of the routing-vs-hardware comparison.

    Authors: We agree that explicit propagation of uncertainty is required to substantiate the routing-versus-hardware comparison. The ±20% figure on B200 projections arises from the analytical hardware scaling model. We will revise §4 to include a Monte-Carlo-style sensitivity sweep that simultaneously varies the power assumption by ±15% (motivated by the first comment) and the hardware gain by ±20%. The resulting combined gain range is 3.1x–5.4x, with context-length routing (FleetOpt) remaining the dominant lever in >80% of sampled scenarios. Error bars will be added to the 4.25x claim and the 2.5x routing gain will be shown to hold qualitatively across the uncertainty envelope. This analysis uses only existing models and requires no new experiments. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the 1/W law derivation

full rationale

The 1/W law is obtained by combining the KV-cache memory bound (concurrency limit scales as 1/W) with the modeling assumption of roughly constant GPU power draw across context lengths, then dividing throughput by power in the roofline model. Both the logistic power model and roofline throughput model are calibrated to external published H100 measurements rather than to the target scaling relation itself. The inference-fleet-sim framework performs the analytical composition; no fitted parameter is renamed as a prediction, no self-citation supplies a uniqueness theorem, and no ansatz is smuggled in. The central scaling result therefore retains independent content from its stated assumptions and external data sources.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on a calibrated logistic power model and roofline throughput model whose parameters are fitted to published H100 data, plus domain assumptions about KV-cache scaling and constant power draw.

free parameters (2)
  • logistic power model parameters
    Calibrated to published H100 power measurements to produce the constant-power assumption
  • roofline throughput model parameters
    Used to compute concurrency limits from context length
axioms (2)
  • domain assumption GPU power draw remains roughly unchanged with context window size
    Invoked to derive the direct halving in the 1/W law
  • domain assumption KV-cache memory usage limits concurrency linearly with context length
    Core modeling step for concurrency limit

pith-pipeline@v0.9.0 · 5664 in / 1375 out tokens · 65910 ms · 2026-05-15T09:28:37.427488+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. The Workload-Router-Pool Architecture for LLM Inference Optimization: A Vision Paper from the vLLM Semantic Router Project

    cs.LG 2026-03 unverdicted novelty 5.0

    The Workload-Router-Pool architecture is a 3D framework for LLM inference optimization that synthesizes prior vLLM work into a 3x3 interaction matrix and proposes 21 research directions at the intersections.

  2. Scaling Mobile Agent Systems: From Capability Density to Collective Intelligence

    cs.DC 2026-04 unverdicted novelty 3.0

    A vision paper outlining a two-pronged research agenda for scaling mobile agents from isolated devices to distributed intelligent systems.