Recognition: 2 theorem links
· Lean TheoremThe 1/W Law: An Analytical Study of Context-Length Routing Topology and GPU Generation Gains for LLM Inference Energy Efficiency
Pith reviewed 2026-05-15 09:28 UTC · model grok-4.3
The pith
Tokens per watt halves every time the context window doubles.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We derive the 1/W law: tokens per watt halves every time the context window doubles. A larger context window shrinks the KV-cache concurrency limit while leaving GPU power draw roughly unchanged. At 64K context, an H100 holds 16 sequences in flight (tok/W = 1.5); at 4K context, the same H100 holds 256 sequences (tok/W = 17.6). Routing topology is a more powerful energy lever than buying newer hardware. Two-pool context-length routing delivers roughly 2.5x better tok/W over a homogeneous fleet, while upgrading from H100 to B200 delivers roughly 1.7x. The gains are independent: combining FleetOpt with B200 yields 4.25x over the H100 homogeneous baseline. For MoE models, active-parameter weight
What carries the argument
The 1/W law that links tokens per watt directly to context length through KV-cache concurrency limits and near-constant power draw.
Load-bearing premise
GPU power draw remains roughly unchanged as context window size varies.
What would settle it
Measure actual power consumption and achieved throughput on an H100 at both 4K and 64K context under saturated load to test whether tokens per watt exactly halves.
read the original abstract
How many tokens can a GPU inference cluster deliver per watt? Across deployments of identical hardware, the answer varies by 40x -- not because of software inefficiency, but because of the serving context window. We derive the 1/W law: tokens per watt halves every time the context window doubles. A larger context window shrinks the KV-cache concurrency limit while leaving GPU power draw roughly unchanged. At 64K context, an H100 holds 16 sequences in flight (tok/W = 1.5); at 4K context, the same H100 holds 256 sequences (tok/W = 17.6). Routing topology -- which determines the effective context window each GPU services -- is a more powerful energy lever than buying newer hardware. Working from published H100 power measurements, a calibrated logistic power model, and a roofline throughput model, we derive these results analytically using the inference-fleet-sim framework; no new hardware experiments were conducted. Two-pool context-length routing (FleetOpt) delivers roughly 2.5x better tok/W over a homogeneous fleet, while upgrading from H100 to B200 delivers roughly 1.7x. The gains are independent: combining FleetOpt with B200 yields 4.25x over the H100 homogeneous baseline. B200/H200 numbers are analytical projections (+-20% uncertainty); H100 results are calibrated to published measurements. For MoE models, active-parameter weight streaming adds a third lever. Qwen3-235B-A22B (22B active) reaches roughly 37.8 tok/W at 8K context on H100 -- 5.1x better than Llama-3.1-70B -- because decode time scales with activated weights, not total parameters. MoE dispatch overhead is excluded, so this is an upper bound.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript derives the '1/W law' claiming that tokens per watt in LLM inference halves with each doubling of context window size, as larger windows reduce KV-cache concurrency limits while GPU power draw remains roughly constant. Using a calibrated logistic power model, roofline throughput model, and inference-fleet-sim framework on published H100 data, it shows context-length routing (FleetOpt two-pool) yields ~2.5x tok/W gains over homogeneous fleets, outperforming H100-to-B200 hardware upgrades (~1.7x), with combined gains of 4.25x; MoE models like Qwen3-235B-A22B achieve ~37.8 tok/W at 8K context (5.1x over Llama-3.1-70B) due to active-parameter scaling, with all projections carrying ±20% uncertainty and no new experiments performed.
Significance. If the constant-power assumption holds within the stated bounds, the work supplies a useful analytical lens for energy optimization in LLM serving clusters, highlighting routing topology as a stronger lever than hardware iteration. Strengths include the closed-form derivation of the 1/W scaling from KV-cache limits, calibration to external measurements without new hardware runs, and explicit independence of routing and hardware gains, which enables falsifiable deployment predictions.
major comments (2)
- [§3 (1/W Law Derivation) and Abstract] The exact 1/W halving (tokens per watt halves on context doubling) in the derivation rests on power invariance with context length. The logistic power model is calibrated to published H100 measurements, but these do not isolate context-length effects on dynamic power (memory controller or utilization); any positive correlation with KV footprint would make tok/W decline slower than 1/W, directly affecting the central law and all downstream FleetOpt and MoE projections.
- [§4 (FleetOpt Routing Analysis)] The roofline throughput model and FleetOpt routing gains (2.5x tok/W) inherit the constant-power assumption without shown sensitivity analysis; the stated ±20% uncertainty on B200 projections should be propagated through the combined 4.25x claim to quantify robustness of the routing-vs-hardware comparison.
minor comments (2)
- [Abstract] The numerical examples (H100 at 64K: 16 sequences, tok/W=1.5; at 4K: 256 sequences, tok/W=17.6) should cite the exact equations or tables from which they are computed for traceability.
- [MoE Model Analysis] For the MoE upper-bound claim (Qwen3-235B-A22B at 37.8 tok/W), explicitly note that excluding dispatch overhead affects the 5.1x comparison to dense models and state whether this bound is used in the FleetOpt analysis.
Simulated Author's Rebuttal
We thank the referee for the thorough and constructive review of our manuscript. We address each major comment point-by-point below, providing clarifications on the power invariance assumption and committing to revisions that strengthen the analysis of uncertainty and sensitivity.
read point-by-point responses
-
Referee: [§3 (1/W Law Derivation) and Abstract] The exact 1/W halving (tokens per watt halves on context doubling) in the derivation rests on power invariance with context length. The logistic power model is calibrated to published H100 measurements, but these do not isolate context-length effects on dynamic power (memory controller or utilization); any positive correlation with KV footprint would make tok/W decline slower than 1/W, directly affecting the central law and all downstream FleetOpt and MoE projections.
Authors: We appreciate the referee highlighting the foundational role of the power invariance assumption. The 1/W law follows directly from KV-cache concurrency limits (sequences in flight = total HBM / per-sequence KV footprint) while power draw is taken as approximately constant, consistent with published H100 inference measurements across workloads that show <10% power variation in the logistic fit. Although these external datasets do not explicitly isolate context-length effects on dynamic components such as memory controllers, the decode phase remains memory-bound with similar utilization patterns. To address the concern rigorously, we will add a sensitivity subsection in §3 that models a positive linear correlation between power and KV footprint (up to +20% at 64K context). This yields tok/W scaling between 1/W^{0.85} and 1/W, preserving the qualitative ordering of routing versus hardware gains and the central claims. The revision will also update the abstract to note the assumption explicitly. revision: partial
-
Referee: [§4 (FleetOpt Routing Analysis)] The roofline throughput model and FleetOpt routing gains (2.5x tok/W) inherit the constant-power assumption without shown sensitivity analysis; the stated ±20% uncertainty on B200 projections should be propagated through the combined 4.25x claim to quantify robustness of the routing-vs-hardware comparison.
Authors: We agree that explicit propagation of uncertainty is required to substantiate the routing-versus-hardware comparison. The ±20% figure on B200 projections arises from the analytical hardware scaling model. We will revise §4 to include a Monte-Carlo-style sensitivity sweep that simultaneously varies the power assumption by ±15% (motivated by the first comment) and the hardware gain by ±20%. The resulting combined gain range is 3.1x–5.4x, with context-length routing (FleetOpt) remaining the dominant lever in >80% of sampled scenarios. Error bars will be added to the 4.25x claim and the 2.5x routing gain will be shown to hold qualitatively across the uncertainty envelope. This analysis uses only existing models and requires no new experiments. revision: yes
Circularity Check
No significant circularity in the 1/W law derivation
full rationale
The 1/W law is obtained by combining the KV-cache memory bound (concurrency limit scales as 1/W) with the modeling assumption of roughly constant GPU power draw across context lengths, then dividing throughput by power in the roofline model. Both the logistic power model and roofline throughput model are calibrated to external published H100 measurements rather than to the target scaling relation itself. The inference-fleet-sim framework performs the analytical composition; no fitted parameter is renamed as a prediction, no self-citation supplies a uniqueness theorem, and no ansatz is smuggled in. The central scaling result therefore retains independent content from its stated assumptions and external data sources.
Axiom & Free-Parameter Ledger
free parameters (2)
- logistic power model parameters
- roofline throughput model parameters
axioms (2)
- domain assumption GPU power draw remains roughly unchanged with context window size
- domain assumption KV-cache memory usage limits concurrency linearly with context length
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We derive the 1/W law: tokens per watt halves every time the context window doubles. A larger context window shrinks the KV-cache concurrency limit while leaving GPU power draw roughly unchanged.
-
IndisputableMonolith/Foundation/AlphaCoordinateFixationcostAlphaLog_high_calibrated_iff unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
P(b) = P_range / (1 + e^{-k(log2 b - x0)}) + P_idle
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
The Workload-Router-Pool Architecture for LLM Inference Optimization: A Vision Paper from the vLLM Semantic Router Project
The Workload-Router-Pool architecture is a 3D framework for LLM inference optimization that synthesizes prior vLLM work into a 3x3 interaction matrix and proposes 21 research directions at the intersections.
-
Scaling Mobile Agent Systems: From Capability Density to Collective Intelligence
A vision paper outlining a two-pronged research agenda for scaling mobile agents from isolated devices to distributed intelligent systems.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.