Intelligence per Watt: Measuring Intelligence Efficiency of Local AI

Adrian Gamarra Lafuente; Avanika Narayan; Azalia Mirhoseini; Ben Athiwaratkun; Christopher R\'e; Etash Kumar Guha; Hakki Orhun Akengin; Herumb Shandilya; John Hennessy; Jon Saad-Falcon

arxiv: 2511.07885 · v4 · pith:ACE6XRRHnew · submitted 2025-11-11 · 💻 cs.DC · cs.AI· cs.CL· cs.LG

Intelligence per Watt: Measuring Intelligence Efficiency of Local AI

Jon Saad-Falcon , Avanika Narayan , Hakki Orhun Akengin , J. Wes Griffin , Herumb Shandilya , Adrian Gamarra Lafuente , Medhya Goel , Rebecca Joseph

show 7 more authors

Shlok Natarajan Etash Kumar Guha Shang Zhu Ben Athiwaratkun John Hennessy Azalia Mirhoseini Christopher R\'e

This is my paper

Pith reviewed 2026-05-22 12:16 UTC · model grok-4.3

classification 💻 cs.DC cs.AIcs.CLcs.LG

keywords local LLM inferenceintelligence per wattenergy efficiencyedge AIdistributed computingpower-aware AImodel servingquery redistribution

0 comments

The pith

Local models answer 88.7 percent of real queries while delivering higher intelligence per watt than cloud accelerators.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether small local language models can take over a meaningful share of everyday AI queries from big centralized servers. It introduces intelligence per watt as the single number that combines how often a model gets the answer right with how much electricity it uses on a laptop or phone. Over two years the metric rose more than fivefold, lifting the fraction of queries that local hardware can handle from 23 percent to 71 percent. The authors show that consumer-grade accelerators already run identical models at least 1.4 times more efficiently than cloud hardware, pointing to a practical path for spreading load away from data centers.

Core claim

Local language models with at most 20 billion active parameters achieve an 88.7 percent win rate against frontier models across one million single-turn chat and reasoning queries drawn from real user traffic. When this accuracy is divided by measured power draw, the resulting intelligence-per-watt figure improves 5.3 times between 2023 and 2025. Local accelerators consistently record at least 1.4 times higher IPW than cloud accelerators running the same models, while the share of queries that can be served entirely on-device rises from 23.2 percent to 71.3 percent.

What carries the argument

Intelligence per watt (IPW), the ratio of task accuracy to power consumed, used to rank model-accelerator pairs and to track how much real-world demand can shift to local hardware.

If this is right

Local accelerators already outperform cloud accelerators on identical models, so further hardware tuning can increase the share of queries that stay on-device.
The 71.3 percent locally serviceable coverage means a large slice of daily traffic can move away from centralized infrastructure without loss of accuracy.
Continued 5x-scale IPW growth would make local inference the default choice for most chat and reasoning tasks within a few more hardware generations.
Domain variation in accuracy shows that routing decisions can be made per query type rather than blanket adoption of local or cloud paths.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Hybrid systems could automatically send only the hardest queries to the cloud while keeping the rest local, cutting both latency and energy for the average user.
Because local processing keeps data on the device, the shift measured by IPW would also improve privacy for the subset of queries that never leave the laptop.
Manufacturers could publish IPW ratings for new chips the way they publish battery life, giving consumers a direct way to compare AI performance across devices.

Load-bearing premise

The one million single-turn queries are representative of actual user demand and that win rate against frontier models is a reliable stand-in for correctness on local models.

What would settle it

A fresh sample of several hundred thousand queries drawn from the same distribution but collected after 2025 would show local-model win rates falling below 70 percent or IPW gains stalling.

Figures

Figures reproduced from arXiv: 2511.07885 by Adrian Gamarra Lafuente, Avanika Narayan, Azalia Mirhoseini, Ben Athiwaratkun, Christopher R\'e, Etash Kumar Guha, Hakki Orhun Akengin, Herumb Shandilya, John Hennessy, Jon Saad-Falcon, J. Wes Griffin, Medhya Goel, Rebecca Joseph, Shang Zhu, Shlok Natarajan.

**Figure 1.** Figure 1: Intelligence per Watt: A Study of Local Intelligence Efficiency. We present the first systematic study of local AI inference efficiency across models, hardware, and real-world workloads. (Left) Intelligence efficiency is defined as task accuracy per unit of power, capturing both capabilities delivered and energy consumed. (Left-Middle) We conduct comprehensive performance profiling across 20+ state-of-the-… view at source ↗

**Figure 2.** Figure 2: Local Models Rival Cloud Models Across Diverse Benchmarks: Individual model performance scales with size, ranging from 31.5–69.4% for IBM GRANITE4-H-SMALL, 30.0–83.6% for GEMMA3-12B, 51.5–80.4% for GPT-OSS-120B, and 66.5– 89.5% for GEMINI 2.5 PRO. Local routing (best local LM per query) achieves 97.8%, 88.3%, 77.0%, and 92.4% on WILDCHAT, NATURALREASONING, SUPERGPQA, and MMLU PRO respectively, surpassing c… view at source ↗

**Figure 3.** Figure 3: Rapid Improvement of Local LMs across Chat and Reasoning Queries: We evaluate the performance of SOTA local models released between April 2024 and August 2025 on WILDCHAT and NATURALREASONING. On WILDCHAT (left), local models show a win/tie rate of 78.2% against QWEN3-235B as of August 2025, compared to just 28.0% in April 2024—a 2.8× improvement in 16 months. On NATURALREASONING (right), local models achi… view at source ↗

**Figure 4.** Figure 4: Increasing GPU Memory of Consumer Accelerators: Memory capacity (GB) for local accelerators. Over the past decade, local hardware has significantly closed the memory gap with cloudgrade accelerators, particularly since 2020, driven by advances in high bandwidth memory (HBM) components and unified memory architectures. and GPT-OSS). The consistent progression across multiple efficiency metrics—including a… view at source ↗

**Figure 5.** Figure 5: Increase in Intelligence per Joule for Local LMs and Accelerators: Efficiency improved 18.0× over 16 months, decomposed into 3.1× from better local LMs and 5.9× from better local accelerators. tency across QWEN3 and GPT-OSS model variants. Here the efficiency gaps widen substantially: the B200 achieves 1.6× to 2.3× higher intelligence per joule than the M4 MAX, while the SN40L achieves 6.5× to 7.4× higher … view at source ↗

**Figure 6.** Figure 6: Energy, Compute, and Capital Gains from Model Routing. Cumulative resource consumption over 24 hours and 80.2M LLM queries (Wang et al., 2025). Using our local-cloud router between 4 small LMs on Apple M4 Max and QWEN3-235B on an H200 yields substantial savings at various routing accuracies. For the 80% accurate router, we observe: 64.3% in energy savings, 61.8% in compute, and 59.0% in cost compared to na… view at source ↗

**Figure 7.** Figure 7: Local Win/Tie-Rate vs. Cloud LMs by Domain. Stacked bars show the fraction of single-turn chat and reasoning queries handled by local LMs (< 20B active parameters; blue) versus those routed to frontier models in the cloud (red), computed per economic index domain (Appel et al., 2025) Metrics We detail all the metrics collected via our profiling harness in [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

**Figure 8.** Figure 8: Chat Task Performance by Difficulty Level and Year. Model success rates across four difficulty levels and three model generations (2023, 2024, 2025). The data reveals dramatic progress across all difficulty levels, with 2023 models achieving 28.79% overall success rising to 98.12% by 2025. Notably, Levels 1-3 approach near-perfect performance (98-99%), while Level 4 shows the largest relative improvement (… view at source ↗

**Figure 9.** Figure 9: Reasoning Task Performance by Difficulty Level and Year. Model success rates on across five difficulty levels and three model generations. The benchmark shows a three-tier saturation pattern: near-complete (98-99% on Levels 1-2), approaching saturation (85-92% on Levels 3-4), and wide-open frontier (51% on Level 5) [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗

**Figure 10.** Figure 10: Perplexity and Accuracy per Joule Trends across WILDCHAT and NATURALREASONING. achieving 98-99% success on levels 1-3 and 92.6% on level 4. Absolute improvements range from +55.4 percentage points (pp) for level 1 to +76.4 pp for level 3, indicating relatively uniform capability gains. For reasoning tasks (see [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗

**Figure 11.** Figure 11: Minimal Accuracy Degradation Shifting from FP16 to FP4 for Open-Source Local Models: Evaluation across three reasoning datasets (N = 10, 000 each) shows 2 − 3% accuracy loss per precision step, demonstrating that F P8/F P4 quantization enables efficient deployment with acceptable performance tradeoffs. QWEN3-235B-A22B achieves 59.2% accuracy covering $9.3T in relevant GDP (31.9% of total U.S. GDP), while … view at source ↗

**Figure 12.** Figure 12: Open-Source Local LMs Performance vs. U.S. GDP - WildChat and Natural Reasoning: Model accuracy on WildChat and Natural Reasoning benchmarks plotted against relevant GDP in trillions of dollars. Both benchmarks show continued performance improvements as training compute scales across models from Qwen3B-4B to Qwen3B-A22B-235B. For our calculations, we compute the weighted sum of an LM’s accuracy on each U.… view at source ↗

**Figure 13.** Figure 13: Open-Source Local LMs Performance vs. U.S. GDP - SuperGPQA and MMLU Pro: Model accuracy on SuperGPQA and MMLU Pro benchmarks plotted against relevant GDP in trillions of dollars. Both benchmarks show continued performance improvements as training compute scales across models from Qwen3B-4B to Qwen3B-A22B-235B. For our calculations, we compute the weighted sum of an LM’s accuracy on each U.S. Labor categor… view at source ↗

read the original abstract

Large language model (LLM) queries are predominantly processed by frontier models in centralized cloud infrastructure. Demand growth strains this paradigm faster than providers can scale. Two advances create an opportunity to rethink it: small, local LMs (<=20B active parameters) now achieve competitive performance to frontier models on many tasks, and local accelerators (e.g., Apple M4 Max) can host these models at interactive latencies. This raises the question: can local inference viably redistribute demand from centralized infrastructure? This requires measuring both whether local LMs can accurately answer real-world queries and whether they can do so efficiently on power-constrained devices (e.g., laptops). We propose intelligence per watt (IPW), task accuracy per unit of power, as a unified metric for the capability and efficiency of local inference across model-accelerator configurations. We evaluate 20+ state-of-the-art local LMs, 8 hardware accelerators (local and cloud), and 1M real-world single-turn chat and reasoning queries. For each query, we measure accuracy (local LM win rate against frontier models), energy, latency, and power. We find three key results. First, local LMs successfully answer 88.7% of these queries, with accuracy varying by domain. Second, longitudinal analysis from 2023-2025 shows IPW improved 5.3x, driven by both algorithmic and accelerator advances, with locally-serviceable query coverage rising from 23.2% to 71.3%. Third, local accelerators achieve at least 1.4x lower IPW than cloud accelerators running identical models, revealing significant headroom for local accelerator optimization. These findings demonstrate that local inference can meaningfully redistribute demand from centralized infrastructure for a substantial subset of queries, with IPW serving as the critical metric for tracking this transition.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper supplies real hardware measurements and a 5.3x longitudinal IPW trend on a million queries, but the 88.7% success rate rests on an uncalibrated win-rate proxy that does not establish objective accuracy.

read the letter

The paper's useful part is the scale of the experiment and the before-and-after numbers. They ran a million single-turn queries across more than twenty local models and eight accelerators, then tracked IPW from 2023 to 2025. That shows a 5.3 times efficiency gain and coverage rising from 23% to 71%, with local chips beating cloud ones on power for the same models. Those direct comparisons give engineers concrete data on where local inference already works and where accelerators still lag.

Referee Report

2 major / 0 minor

Summary. The paper introduces Intelligence per Watt (IPW) as a metric combining task accuracy and power consumption to assess local LLMs (≤20B parameters) on accelerators. Using 1M single-turn chat and reasoning queries, it evaluates 20+ local models across 8 hardware platforms and reports three main findings: local models achieve an 88.7% success rate (defined via win rate against frontier models), IPW improved 5.3× from 2023–2025 with locally serviceable coverage rising from 23.2% to 71.3%, and local accelerators show at least 1.4× lower IPW than cloud accelerators for identical models, supporting the potential for demand redistribution from centralized infrastructure.

Significance. If the empirical measurements are robust, the work provides a practical, unified metric for tracking local inference efficiency and quantifies the scale at which local models could offload cloud demand. The direct measurement approach on real queries and fixed model-hardware pairs, without reliance on fitted parameters or circular definitions, strengthens the contribution as a benchmark for sustainable distributed AI systems.

major comments (2)

[Abstract] Abstract: The headline claim that local LMs 'successfully answer 88.7%' of queries (and the derived 71.3% coverage) is defined solely via win rate against frontier models. This proxy does not establish objective accuracy, as both models may err on the same query, LLM judges can exhibit style/length biases, and many single-turn queries lack verifiable ground truth. Without calibration against human judgment or objective answers, the accuracy figures and all downstream IPW and redistribution conclusions rest on an unvalidated assumption.
[Abstract] Abstract and evaluation description: Concrete percentages (88.7%, 71.3%, 5.3×, 1.4×) are reported without error bars, query sampling methodology, domain stratification details, or statistical tests for significance. This makes it impossible to assess whether the observed trends and coverage claims are robust to sampling variation or measurement noise.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our use of win-rate proxies and the need for greater statistical transparency. We address each major comment below and will incorporate revisions to clarify limitations and add methodological details.

read point-by-point responses

Referee: [Abstract] Abstract: The headline claim that local LMs 'successfully answer 88.7%' of queries (and the derived 71.3% coverage) is defined solely via win rate against frontier models. This proxy does not establish objective accuracy, as both models may err on the same query, LLM judges can exhibit style/length biases, and many single-turn queries lack verifiable ground truth. Without calibration against human judgment or objective answers, the accuracy figures and all downstream IPW and redistribution conclusions rest on an unvalidated assumption.

Authors: We agree that win rate against frontier models via LLM judges is a proxy metric rather than objective accuracy, and that it does not rule out joint errors or judge biases. This is a standard evaluation approach for open-ended real-world queries where ground truth is often absent, but we acknowledge the limitation. In revision we will rephrase the abstract and results to state that local models achieve competitive performance as measured by win rate, add explicit discussion of LLM-judge biases and lack of human calibration, and note that a subset of reasoning queries admit objective verification. We will also qualify the 71.3% coverage claim as the fraction of queries where local models are preferred or tied under this proxy. These changes temper absolute claims while preserving the relative efficiency comparisons that are the paper's core contribution. revision: partial
Referee: [Abstract] Abstract and evaluation description: Concrete percentages (88.7%, 71.3%, 5.3×, 1.4×) are reported without error bars, query sampling methodology, domain stratification details, or statistical tests for significance. This makes it impossible to assess whether the observed trends and coverage claims are robust to sampling variation or measurement noise.

Authors: We agree that the current presentation lacks error bars, detailed sampling description, and significance testing, which weakens assessment of robustness. The 1M queries were drawn from production logs with explicit stratification across chat and reasoning domains, but these details and variance estimates were omitted from the abstract and main text. In the revised manuscript we will add bootstrap-derived error bars to all headline percentages and improvement factors, expand the evaluation section with the full sampling methodology and domain stratification procedure, and include statistical significance tests (e.g., paired t-tests or bootstrap confidence intervals) for the reported trends and the 1.4× local-vs-cloud comparison. revision: yes

Circularity Check

0 steps flagged

No significant circularity; results from direct empirical measurements

full rationale

The paper's central claims rest on empirical measurements: win-rate accuracy of local LMs against frontier models on 1M single-turn queries, plus direct energy/power/latency readings across model-hardware pairs. IPW is defined explicitly as accuracy per watt and computed from these observed values. Longitudinal IPW trends (5.3x improvement, coverage from 23.2% to 71.3%) are reported from data collected over 2023-2025 rather than from any fitted parameter or self-referential equation. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the derivation chain. The win-rate proxy raises a separate validity question but does not create circularity under the specified patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical measurement study; the central claims rest on experimental data collection rather than mathematical axioms, free parameters, or newly postulated entities.

pith-pipeline@v0.9.0 · 5938 in / 1284 out tokens · 57836 ms · 2026-05-22T12:16:35.984471+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose intelligence per watt (IPW), task accuracy per unit of power, as a unified metric...
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

local LMs successfully answer 88.7% of these queries... IPW improved 5.3x

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Llamas on the Web: Memory-Efficient, Performance-Portable, and Multi-Precision LLM Inference with WebGPU
cs.DC 2026-05 conditional novelty 7.0

LlamaWeb is a WebGPU backend for llama.cpp that uses static memory planning, tunable kernels, and templated multi-precision support to cut memory use by 29-33% and raise decode throughput by 45-69% versus prior browse...
The xPU-athalon: Quantifying the Competition of AI Acceleration
cs.AR 2026-04 unverdicted novelty 6.0

Quantitative benchmarks across recent AI accelerators reveal that optimal hardware choice varies with workload parameters and that several platforms incur substantially higher idle power than GPUs.
AgentStop: Terminating Local AI Agents Early to Save Energy in Consumer Devices
cs.LG 2026-05 unverdicted novelty 4.0

AgentStop uses execution signals to early-terminate failing local LLM agent trajectories, cutting energy use 15-20% with minimal utility loss.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · cited by 3 Pith papers · 5 internal anchors

[1]

Constitutional AI: Harmlessness from AI Feedback

URL https://assets.anthropic. com/m/12f214efcc2f457a/original/ Claude-Sonnet-4-5-System-Card.pdf. Appel, R., McCrory, P., Tamkin, A., Stern, M., McCain, M., and Neylon, T. Anthropic economic index report: Uneven geographic and enterprise ai adoption, 2025. Apple. Apple m4 max — tech specs. Apple Support / Press Releases, 2024. URL https://support.apple. c...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.5281/zenodo.14212766 2025
[2]

GPT-4 Technical Report

URL https://resources.nvidia. com/en-us-data-center-overview-mc/ en-us-data-center-overview/ grace-hopper-superchip-datasheet-partner . Accessed: 2025-01-15. NVIDIA Corporation. NVIDIA DGX B200 Sys- tem Architecture. Technical report, 2025a. URL https://resources.nvidia.com/ en-us-dgx-systems/dgx-b200-datasheet . Accessed: 2025-01-15. NVIDIA Corporation. ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Energy Use of AI Inference: Efficiency Pathways and Test-Time Compute

Accessed: 23 September 2025. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744, 2022. Oviedo, F., Kazhamiaka, F., Choukse, E., Kim, A., Luers, A., Na...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/hpec58863.2023.10363447 2025
[4]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

doi: 10.1145/3381831. URL https://dl.acm. org/doi/10.1145/3381831. Sevilla, J., Besiroglu, T., Cottier, B., You, J., Rold´an, E., Vil- lalobos, P., and Erdil, E. Can ai scaling continue through 2030?, 2024. URL https://epoch.ai/blog/ can-ai-scaling-continue-through-2030 . Accessed: 2025-10-06. Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Zhang, M., Li, ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1145/3381831 2030
[5]

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

URL https://www.bea.gov/data/gdp/ gdp-industry. Wang, X., Chen, Z., Ren, J., Li, Y ., Zhang, J., Sun, J., Mi, Y ., et al. MINT: Evaluating LLMs in multi-turn interac- tion with tools and language feedback. InThe Twelfth International Conference on Learning Representations, 2024a. Wang, Y ., Ma, X., Zhang, G., Ni, Y ., Chandra, A., Guo, S., Ren, W., Arulra...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1145/3711896.3737413 2025
[6]

Arts, design, sports, entertainment, and media

URL https://www.epri.com/research/ products/000000003002033669. Yuan, W., Yu, J., Jiang, S., Padthe, K., Li, Y ., Wang, D., Kulikov, I., Cho, K., Tian, Y ., Weston, J. E., and Li, X. Naturalreasoning: Reasoning in the wild with 2.8m challenging questions, 2025. URL https://arxiv. org/abs/2502.13124. Zhang, Y . The avengers: A simple recipe for uniting sma...

work page arXiv 2025
[7]

Read the query carefully

work page
[8]

Determine which job/occupation category the query relates to most closely

work page
[9]

If the query doesn’t clearly relate to any specific occupation category, use "None"

work page
[10]

Architecture and engineering

Respond with ONLY the category name, exactly as listed above 35 36Category: Solvability rates vary dramatically by domain and dataset type, where a query’s solvability is defined as its ability to be answered correctly by any of the available local LMs (e.g. Qwen models or GPT OSS). WILDCHATqueries show consistently high solvability across most domains (g...

work page 2025
[11]

Creativity & novelty | 18| Objective / technical | 1

Conciseness 5. Creativity & novelty | 18| Objective / technical | 1. Correctness only | 19 20When using the multi-criteria rubric, note strengths and weaknesses for **each** dimension . 21When using the single-criterion rubric, focus exclusively on factual / functional accuracy Measuring Intelligence Efficiency of Local AI Category WILDCHATMMLU PROSUPERGP...

work page 2025
[12]

Assistant A is significantly better: [[A>>B]]

work page
[13]

Assistant A is slightly better: [[A>B]]

work page
[14]

Tie, Assistant A is equal: [[A=B]]

work page
[15]

Assistant B is slightly better: [[B>A]]

work page
[16]

Assistant B is significantly better: [[B>>A]] 32 33Choose exactly one token from: ‘[[A>>B]]‘, ‘[[A>B]]‘, ‘[[A=B]]‘, ‘[[B>A]]‘, ‘[[B>>A]]‘. 34 35--- 36 37### Output format (strict) 38Return **only** a JSON object that matches the provided schema: NATURALREASONINGLLM-judge Prompt 1You are evaluating a response to a scientific/technical question against a re...

work page
[17]

Scientific accuracy of facts and concepts

work page
[18]

Mathematical correctness (if applicable)

work page
[19]

Completeness of the answer

work page
[20]

difficulty

Technical precision 10 11Question: {question} 12 13Response: {response} 14 15Reference Answer: {reference} Measuring Intelligence Efficiency of Local AI 16 17Return ONLY ’True’ if the response is correct and complete, ’False’ otherwise. Metric Description flops per request FLOPs per query. macs per request MACs per query; proxy for compute. per query joul...

work page 2025

[1] [1]

Constitutional AI: Harmlessness from AI Feedback

URL https://assets.anthropic. com/m/12f214efcc2f457a/original/ Claude-Sonnet-4-5-System-Card.pdf. Appel, R., McCrory, P., Tamkin, A., Stern, M., McCain, M., and Neylon, T. Anthropic economic index report: Uneven geographic and enterprise ai adoption, 2025. Apple. Apple m4 max — tech specs. Apple Support / Press Releases, 2024. URL https://support.apple. c...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.5281/zenodo.14212766 2025

[2] [2]

GPT-4 Technical Report

URL https://resources.nvidia. com/en-us-data-center-overview-mc/ en-us-data-center-overview/ grace-hopper-superchip-datasheet-partner . Accessed: 2025-01-15. NVIDIA Corporation. NVIDIA DGX B200 Sys- tem Architecture. Technical report, 2025a. URL https://resources.nvidia.com/ en-us-dgx-systems/dgx-b200-datasheet . Accessed: 2025-01-15. NVIDIA Corporation. ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Energy Use of AI Inference: Efficiency Pathways and Test-Time Compute

Accessed: 23 September 2025. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744, 2022. Oviedo, F., Kazhamiaka, F., Choukse, E., Kim, A., Luers, A., Na...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/hpec58863.2023.10363447 2025

[4] [4]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

doi: 10.1145/3381831. URL https://dl.acm. org/doi/10.1145/3381831. Sevilla, J., Besiroglu, T., Cottier, B., You, J., Rold´an, E., Vil- lalobos, P., and Erdil, E. Can ai scaling continue through 2030?, 2024. URL https://epoch.ai/blog/ can-ai-scaling-continue-through-2030 . Accessed: 2025-10-06. Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Zhang, M., Li, ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1145/3381831 2030

[5] [5]

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

URL https://www.bea.gov/data/gdp/ gdp-industry. Wang, X., Chen, Z., Ren, J., Li, Y ., Zhang, J., Sun, J., Mi, Y ., et al. MINT: Evaluating LLMs in multi-turn interac- tion with tools and language feedback. InThe Twelfth International Conference on Learning Representations, 2024a. Wang, Y ., Ma, X., Zhang, G., Ni, Y ., Chandra, A., Guo, S., Ren, W., Arulra...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1145/3711896.3737413 2025

[6] [6]

Arts, design, sports, entertainment, and media

URL https://www.epri.com/research/ products/000000003002033669. Yuan, W., Yu, J., Jiang, S., Padthe, K., Li, Y ., Wang, D., Kulikov, I., Cho, K., Tian, Y ., Weston, J. E., and Li, X. Naturalreasoning: Reasoning in the wild with 2.8m challenging questions, 2025. URL https://arxiv. org/abs/2502.13124. Zhang, Y . The avengers: A simple recipe for uniting sma...

work page arXiv 2025

[7] [7]

Read the query carefully

work page

[8] [8]

Determine which job/occupation category the query relates to most closely

work page

[9] [9]

If the query doesn’t clearly relate to any specific occupation category, use "None"

work page

[10] [10]

Architecture and engineering

Respond with ONLY the category name, exactly as listed above 35 36Category: Solvability rates vary dramatically by domain and dataset type, where a query’s solvability is defined as its ability to be answered correctly by any of the available local LMs (e.g. Qwen models or GPT OSS). WILDCHATqueries show consistently high solvability across most domains (g...

work page 2025

[11] [11]

Creativity & novelty | 18| Objective / technical | 1

Conciseness 5. Creativity & novelty | 18| Objective / technical | 1. Correctness only | 19 20When using the multi-criteria rubric, note strengths and weaknesses for **each** dimension . 21When using the single-criterion rubric, focus exclusively on factual / functional accuracy Measuring Intelligence Efficiency of Local AI Category WILDCHATMMLU PROSUPERGP...

work page 2025

[12] [12]

Assistant A is significantly better: [[A>>B]]

work page

[13] [13]

Assistant A is slightly better: [[A>B]]

work page

[14] [14]

Tie, Assistant A is equal: [[A=B]]

work page

[15] [15]

Assistant B is slightly better: [[B>A]]

work page

[16] [16]

Assistant B is significantly better: [[B>>A]] 32 33Choose exactly one token from: ‘[[A>>B]]‘, ‘[[A>B]]‘, ‘[[A=B]]‘, ‘[[B>A]]‘, ‘[[B>>A]]‘. 34 35--- 36 37### Output format (strict) 38Return **only** a JSON object that matches the provided schema: NATURALREASONINGLLM-judge Prompt 1You are evaluating a response to a scientific/technical question against a re...

work page

[17] [17]

Scientific accuracy of facts and concepts

work page

[18] [18]

Mathematical correctness (if applicable)

work page

[19] [19]

Completeness of the answer

work page

[20] [20]

difficulty

Technical precision 10 11Question: {question} 12 13Response: {response} 14 15Reference Answer: {reference} Measuring Intelligence Efficiency of Local AI 16 17Return ONLY ’True’ if the response is correct and complete, ’False’ otherwise. Metric Description flops per request FLOPs per query. macs per request MACs per query; proxy for compute. per query joul...

work page 2025