pith. sign in

arxiv: 2605.20706 · v1 · pith:Z7TJCSCUnew · submitted 2026-05-20 · 💻 cs.DC · cs.AI· cs.LG

Llamas on the Web: Memory-Efficient, Performance-Portable, and Multi-Precision LLM Inference with WebGPU

Pith reviewed 2026-05-21 02:48 UTC · model grok-4.3

classification 💻 cs.DC cs.AIcs.LG
keywords WebGPULLM inferencebrowserllama.cppquantizationperformance portabilitymemory efficiency
0
0 comments X

The pith

LlamaWeb is a WebGPU backend for llama.cpp that cuts browser LLM memory use by 29-33 percent while raising decode throughput by 45-69 percent across devices.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LlamaWeb as a new backend for llama.cpp that runs large language models directly inside web browsers. It lowers memory demands through upfront static planning of memory allocation and streamlined model loading. A tunable set of GPU kernels handles differences across hardware and browsers while supporting several model weight formats at once. Tests across 16 devices from eight vendors and four weight formats show the claimed memory and speed gains over other browser frameworks. The work matters because it opens a path to private, device-local AI tools that do not require sending data to remote servers or installing native software.

Core claim

LlamaWeb enables memory-efficient and performance-portable LLM inference in the browser by reducing memory overhead through static memory planning and efficient model loading, addressing cross-device variability through a tunable kernel library, and supporting multiple quantization formats through templated GPU kernels.

What carries the argument

Templated GPU kernels inside a tunable kernel library that together support multiple quantization formats while adapting to different devices and browsers.

If this is right

  • LLM inference becomes feasible on a wider range of consumer hardware without custom native code.
  • Multiple quantization formats can be supported from a single kernel codebase with minimal added overhead.
  • Browser-based applications can keep model weights and computations entirely on the client device.
  • Performance remains competitive with vendor-specific llama.cpp backends on some platforms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same static-planning approach could be applied to other inference engines to reduce browser memory footprints.
  • Local browser execution inherently keeps user prompts and outputs off remote servers, improving privacy for interactive AI tools.
  • Extending the kernel library to additional low-precision formats would further widen the set of runnable models on memory-limited devices.

Load-bearing premise

The performance and memory measurements collected on the 16 tested devices and four weight formats are representative of typical real-world browser usage patterns and hardware variability.

What would settle it

Running the same models on a device-browser pair outside the original test set and checking whether memory reduction stays inside the 29-33 percent band and decode speedup stays inside the 45-69 percent band.

Figures

Figures reproduced from arXiv: 2605.20706 by Abhijit Ramesh, James Contini, Neha Abbas, Nikhil Jain, Reese Levine, Rithik Sharma, Tyler Sorensen, Zheyuan Chen.

Figure 1
Figure 1. Figure 1: Overview of llama.cpp’s core and backend design. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Breakdown of the LlamaWeb llama.cpp WebGPU [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Memory usage for the llama model with f16 weights during inference. memory usage for the unified tab process for Safari, and the tab pro￾cess and GPU renderer process for Chrome. Each framework was prompted to generate sufficient decode output for measurement. In all four cases, LlamaWeb uses the least memory, with a geo￾metric mean of normalized peak memory usage 49% lower than WebLLM and 41% lower than T… view at source ↗
Figure 5
Figure 5. Figure 5: Throughput of the llama model across different llama.cpp backends and weight formats. The native backend is CUDA on the NVIDIA GPU, HIP on the AMD GPU, SYCL on the Intel GPU, and Metal on the Apple GPU. backend underperforms the WebGPU backend during prefill by 3×, but outperforms even the native HIP backend during decode by 38% on the q4_k_m model. On the Intel GPU, the WebGPU backend, with safety checks … view at source ↗
Figure 7
Figure 7. Figure 7: Throughput on the llama model across four weight formats (q2_k, q4_k_m, q8_0, f16), grouped by the same device clusters as in Sec. 6.1. running LlamaWeb natively with checks from Sec. 6.2, LlamaWeb outperforms the prefill numbers from other frameworks on every device except WebLLM on the Apple M4 Pro, with a geometric mean speedup of 88% over WebLLM and 205% over Transformers.js. Therefore, although subgro… view at source ↗
Figure 8
Figure 8. Figure 8: Coverage matrices for the portability and cross-quantization studies. Each cell reports throughput on a shared log [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Per-device prefill and decode throughput by model, part 1 of 2. [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Per-device prefill and decode throughput by model, part 2 of 2. [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Per-device prefill and decode throughput for [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
read the original abstract

Running language models in the browser presents a unique opportunity to build efficient, private, and portable AI applications, but requires contending with constrained memory availability and heterogeneous hardware targets. To realize this opportunity, we present Llamas on the Web (LlamaWeb), a WebGPU backend for llama$.$cpp that enables memory-efficient and performance-portable LLM inference across a wide range of model weight formats in the browser. Our design significantly reduces memory overhead through static memory planning and efficient model loading, addresses cross-device variability through a tunable kernel library, and introduces templated GPU kernels that support performant implementations of numerous quantization formats, enabling broad model support and extensibility to new formats. We evaluate LlamaWeb on 16 devices from 8 vendors, collecting data from 10 language models and four model weight formats. We compare LlamaWeb against existing browser-based LLM frameworks and find that LlamaWeb requires 29-33% less memory across several combinations of device, browser, and operating system. We also evaluate LlamaWeb's performance against these frameworks and find that it increases decode throughput by 45-69% across four GPUs from separate vendors. In addition, we compare LlamaWeb's performance against other llama$.$cpp backends, where it is competitive with and even beats vendor-specific backend performance on some devices.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript presents LlamaWeb, a WebGPU backend for llama.cpp enabling memory-efficient, performance-portable, and multi-precision LLM inference in browsers. It uses static memory planning and efficient model loading to reduce overhead, a tunable kernel library to handle device variability, and templated GPU kernels supporting multiple quantization formats. Evaluations on 16 devices from 8 vendors with 10 models and 4 weight formats report 29-33% lower memory usage versus other browser frameworks and 45-69% higher decode throughput on four GPUs from separate vendors, while remaining competitive with other llama.cpp backends.

Significance. If the reported gains prove robust, this implementation could meaningfully advance browser-based LLM deployment by mitigating memory constraints and hardware heterogeneity while preserving privacy. Notable strengths include the broad multi-vendor device coverage, explicit support for multiple weight formats with extensibility, and direct integration with the established llama.cpp ecosystem, which facilitates reproducibility and adoption.

major comments (1)
  1. [§5] §5 (Evaluation): The headline claims of 29-33% memory reduction and 45-69% throughput improvement rest on measurements from 16 devices. The section provides no device-selection criteria, run-to-run variance or error bars, and no sensitivity analysis for WebGPU driver differences, shader compilation overhead, or sandbox memory behavior under concurrent tab load. These omissions are load-bearing for assessing whether the quoted percentages generalize beyond the tested sample.
minor comments (2)
  1. [Abstract] Abstract: The comparison baselines for the memory and throughput numbers are referenced only generically; naming the specific competing browser frameworks and llama.cpp backends in the abstract would improve immediate clarity.
  2. [§4] §4 (Implementation): The description of the templated kernel library would benefit from a small table listing supported quantization formats and their corresponding kernel variants for quick reference.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights important considerations for strengthening the evaluation section. We address the major comment point by point below and outline the revisions we will make.

read point-by-point responses
  1. Referee: [§5] §5 (Evaluation): The headline claims of 29-33% memory reduction and 45-69% throughput improvement rest on measurements from 16 devices. The section provides no device-selection criteria, run-to-run variance or error bars, and no sensitivity analysis for WebGPU driver differences, shader compilation overhead, or sandbox memory behavior under concurrent tab load. These omissions are load-bearing for assessing whether the quoted percentages generalize beyond the tested sample.

    Authors: We agree that additional methodological details are needed to support the generalizability of the reported gains. In the revised manuscript, we will expand §5 with explicit device-selection criteria, explaining that the 16 devices were chosen to span 8 vendors and a range of performance tiers (from integrated graphics to high-end discrete GPUs) to evaluate portability. We will also add run-to-run variance information and error bars for the memory and throughput metrics, based on repeated measurements collected during our experiments. For sensitivity to WebGPU driver differences, shader compilation overhead, and concurrent-tab memory behavior, we will include a new discussion subsection acknowledging these factors as sources of variability in browser environments and reporting any qualitative observations from our testing across browsers and OSes. These revisions will be made without altering the core results. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on direct implementation and empirical benchmarks

full rationale

The paper describes an engineering implementation of a WebGPU backend for llama.cpp, including static memory planning, tunable kernels, and templated quantization support. All headline performance numbers (29-33% memory reduction, 45-69% decode throughput gains) are presented as direct outcomes of measurements collected on 16 devices across 8 vendors, 10 models, and 4 weight formats. No equations, fitted parameters, or uniqueness theorems are invoked; the central claims do not reduce to self-citations or to quantities defined in terms of the reported results. The work is therefore self-contained as an implementation-plus-measurement contribution with external comparisons.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an applied systems paper whose central claims rest on implementation choices and empirical measurements rather than mathematical axioms or new physical entities.

axioms (1)
  • domain assumption WebGPU is available and sufficiently stable on the target browsers and devices for the reported kernels to execute.
    The entire system depends on the WebGPU API existing and behaving as expected across the tested hardware.

pith-pipeline@v0.9.0 · 5801 in / 1347 out tokens · 31772 ms · 2026-05-21T02:48:54.202757+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 20 internal anchors

  1. [1]

    Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, Gabriel Martín Blázquez, Guil- herme Penedo, Lewis Tunstall, Andrés Marafioti, Hynek Kydlíček, Agustín Pi- queres Lajarín, Vaibhav Srivastav, Joshua Lochner, Caleb Fahlgren, Xuan-Son Nguyen, Clémentine Fourrier, Ben Burtenshaw, Hugo Larcher, Haojun Zhao, Cyril Zakka, Mathieu Morlon, Colin Raffel, Leandro von ...

  2. [2]

    LFM2 technical report.arXiv:2511.23404, 2025

    Alexander Amini, Anna Banaszak, Harold Benoit, Arthur Böök, Tarek Dakhran, Song Duong, Alfred Eng, Fernando Fernandes, Marc Härkönen, Anne Harring- ton, Ramin Hasani, Saniya Karwa, Yuri Khrustalev, Maxime Labonne, Math- ias Lechner, Valentine Lechner, Simon Lee, Zetian Li, Noel Loo, Jacob Marks, Edoardo Mosca, Samuel J. Paech, Paul Pak, Rom N. Parnichkun,...

  3. [3]

    Apple Inc. 2026. Metal. https://developer.apple.com/documentation/metal/

  4. [4]

    Elie Bakouch, Carlos Miguel Patiño, Anton Lozhkov, Edward Beeching, Aymeric Roucher, Nouamane Tazi, Aksel Joonas Reedi, Guilherme Penedo, Hynek Ky- dlicek, Clémentine Fourrier, Nathan Habib, Kashif Rasul, Quentin Gallouédec, Hugo Larcher, Mathieu Morlon, Joshua Lochner, Vaibhav Srivastav, Xuan-Son Nguyen, Colin Raffel, Lewis Tunstall, Loubna Ben Allal, Le...

  5. [5]

    Peter Belcak, Greg Heinrich, Shizhe Diao, Yonggan Fu, Xin Dong, Saurav Muralid- haran, Yingyan Celine Lin, and Pavlo Molchanov. 2025. Small Language Models are the Future of Agentic AI. arXiv:2506.02153 https://arxiv.org/abs/2506.02153

  6. [6]

    Zhiyang Chen, Yun Ma, Haiyang Shen, and Mugeng Liu. 2025. WeInfer: Unleash- ing the Power of WebGPU on LLM Inference in Web Browsers. InProceedings of the ACM on Web Conference 2025. Association for Computing Machinery. https://doi.org/10.1145/3696410.3714553

  7. [7]

    Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. 2015. Training Deep Neural Networks with Low Precision Multiplications. arXiv:1412.7024 https://arxiv.org/abs/1412.7024

  8. [8]

    FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

    Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. InProceedings of the 36th International Conference on Neural Information Process- ing Systems. https://doi.org/10.48550/arXiv.2205.14135

  9. [9]

    Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

    Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2023. Flash-Decoding for Long-Context Inference. https://crfm.stanford.edu/2023/10/ 12/flashdecoding.html

  10. [10]

    Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer

  11. [11]

    LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

    LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale. arXiv:2208.07339 https://arxiv.org/abs/2208.07339

  12. [12]

    Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. QLoRA: Efficient Finetuning of Quantized LLMs. arXiv:2305.14314 https://arxiv. org/abs/2305.14314

  13. [13]

    Esser, Jeffrey L

    Steven K. Esser, Jeffrey L. McKinstry, Deepika Bablani, Rathinakumar Ap- puswamy, and Dharmendra S. Modha. 2020. Learned Step Size Quantization. arXiv:1902.08153 https://arxiv.org/abs/1902.08153

  14. [14]

    Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2023. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. arXiv:2210.17323 https://arxiv.org/abs/2210.17323

  15. [15]

    Georgi Gerganov et al . 2026. llama.cpp: Inference of LLaMA models in pure C/C++. https://github.com/ggml-org/llama.cpp

  16. [16]

    Allen Gersho and Robert M. Gray. 1991.Vector Quantization and Signal Com- pression. Kluwer Academic Publishers

  17. [17]

    Google. 2026. Dawn: A WebGPU Implementation. https://dawn.googlesource. com/dawn

  18. [18]

    Google. 2026. Protocol Buffers Documentation. https://protobuf.dev/

  19. [19]

    Google Cloud. 2019. BFloat16: The Secret to High Performance on Cloud TPUs. https://cloud.google.com/blog/products/ai-machine-learning/bfloat16- the-secret-to-high-performance-on-cloud-tpus

  20. [20]

    Google DeepMind. 2026. Gemma 4 Model Card. https://ai.google.dev/gemma/ docs/core/model_card_4

  21. [21]

    Khronos Group. 2026. Vulkan 1.3 Specification. https://registry.khronos.org/ vulkan/specs/1.3-extensions/html/vkspec.html

  22. [22]

    Albert Gu and Tri Dao. 2024. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv:2312.00752 https://arxiv.org/abs/2312.00752

  23. [23]

    Schuff, Ben L

    Andreas Haas, Andreas Rossberg, Derek L. Schuff, Ben L. Titzer, Michael Holman, Dan Gohman, Luke Wagner, Alon Zakai, and JF Bastien. 2017. Bringing the web up to speed with WebAssembly.SIGPLAN Not.(2017). doi:10.1145/3140587. 3062363

  24. [24]

    Awni Hannun, Jagrit Digani, Angelos Katharopoulos, and Ronan Collobert. 2026. MLX: Efficient and Flexible Machine Learning on Apple Silicon. https://github. com/ml-explore

  25. [25]

    2026.Transformers.js

    Hugging Face. 2026.Transformers.js. https://github.com/huggingface/ transformers.js

  26. [26]

    Hugging Face. 2026. Transformers.js Documentation. https://huggingface.co/ docs/transformers.js/index

  27. [27]

    2026.Transformers.js Examples

    Hugging Face. 2026.Transformers.js Examples. https://github.com/huggingface/ transformers.js-examples

  28. [28]

    Erik Johannes Husom, Arda Goknil, Merve Astekin, Lwin Khin Shar, Andre KÃ¥sen, Sagar Sen, Benedikt Andreas Mithassel, and Ahmet Soylu. 2025. Sus- tainable LLM Inference for Edge AI: Evaluating Quantized LLMs for Energy Efficiency, Output Accuracy, and Inference Latency.ACM Trans. Internet Things (2025). https://doi.org/10.1145/3767742

  29. [29]

    IBM. 2026. Granite Models Documentation. https://www.ibm.com/granite/docs/ models/granite

  30. [30]

    Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, An- drew Howard, Hartwig Adam, and Dmitry Kalenichenko. 2017. Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. arXiv:1712.05877 https://arxiv.org/abs/1712.05877

  31. [31]

    Nidhal Jegham, Marwan Abdelatti, Chan Young Koh, Lassad Elmoubarki, and Abdeltawab Hendawi. 2025. How Hungry is AI? Benchmarking Energy, Water, and Carbon Footprint of LLM Inference. arXiv:2505.09598 https://arxiv.org/abs/ 2505.09598

  32. [32]

    Iwan Kawrakow. 2023. K-Quants. https://github.com/ggml-org/llama.cpp/pull/ 1684#issuecomment-2474462323. GitHub comment

  33. [33]

    2026.MoltenVK

    Khronos Group. 2026.MoltenVK. https://github.com/KhronosGroup/MoltenVK Vulkan portability implementation over Apple’s Metal API. Accessed: 2026-05-11

  34. [34]

    Jennifer King, Kevin Klyman, Emily Capstick, Tiffany Saade, and Victoria Hsieh

  35. [35]

    arXiv:2509.05382 https://arxiv.org/abs/2509.05382

    User Privacy and Large Language Models: An Analysis of Frontier Devel- opers’ Privacy Policies. arXiv:2509.05382 https://arxiv.org/abs/2509.05382

  36. [36]

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAtten- tion. InProceedings of the 29th Symposium on Operating Systems Principles. https://doi.org/10.1145/3600006.3613165

  37. [37]

    Reese Levine. 2026. PreWGSL: Universal preprocessor for WGSL shaders. https: //github.com/reeselevine/pre-wgsl

  38. [38]

    Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. 2026. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. arXiv:2306.00978 https://arxiv.org/abs/2306.00978

  39. [39]

    Alexander H. Liu, Kartik Khandelwal, Sandeep Subramanian, Victor Jouault, Abhinav Rastogi, Adrien Sadé, Alan Jeffares, Albert Jiang, Alexandre Cahill, Alexandre Gavaudan, Alexandre Sablayrolles, Amélie Héliou, Amos You, Andy Ehrenberg, Andy Lo, Anton Eliseev, Antonia Calvi, Avinash Sooriyarachchi, Baptiste Bout, Baptiste Rozière, Baudouin De Monicault, Cl...

  40. [40]

    Sid-Lakhdar, Osni Marques, Xinran Zhu, Chang Meng, James W

    Yang Liu, Wissam M. Sid-Lakhdar, Osni Marques, Xinran Zhu, Chang Meng, James W. Demmel, and Xiaoye S. Li. 2021. GPTune: multitask learning for auto- tuning exascale applications. InProceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. Association for Computing Machinery. doi:10.1145/3437801.3441621

  41. [41]

    Camillo Lugaresi, Jiuqiang Tang, Hadon Nash, Chris McClanahan, Esha Uboweja, Michael Hays, Fan Zhang, Chuo-Ling Chang, Ming Guang Yong, Juhyun Lee, Wan-Teh Chang, Wei Hua, Manfred Georg, and Matthias Grundmann. 2019. MediaPipe: A Framework for Building Perception Pipelines. arXiv:1906.08172 https://arxiv.org/abs/1906.08172

  42. [42]

    Rust Graphics Mages. 2026. wgpu. https://github.com/gfx-rs/wgpu. 12 Llamas on the Web: Memory-Efficient, Performance-Portable, and Multi-Precision LLM Inference with WebGPU

  43. [43]

    Meta. 2024. Llama 3.2 Model Card. https://www.llama.com/docs/model-cards- and-prompt-formats/llama3_2/

  44. [44]

    Microsoft. 2026. DirectX Specifications. https://microsoft.github.io/DirectX- Specs/

  45. [45]

    2026.ONNX Runtime

    Microsoft. 2026.ONNX Runtime. https://github.com/microsoft/onnxruntime

  46. [46]

    Microsoft. 2026. ONNX Runtime Web: Tutorials and Documentation. https: //onnxruntime.ai/docs/tutorials/web/

  47. [47]

    MLC AI. 2026. WebLLM Chat Demo. https://chat.webllm.ai/

  48. [48]

    2026.MLC-LLM

    MLC team. 2026.MLC-LLM. https://github.com/mlc-ai/mlc-llm

  49. [49]

    2026.Origin Private File System

    Mozilla. 2026.Origin Private File System. https://developer.mozilla.org/en- US/docs/Web/API/File_System_API/Origin_private_file_system

  50. [50]

    Xuan-Son Nguyen. 2026. wllama: Run llama.cpp models in the browser. https: //github.com/ngxson/wllama

  51. [51]

    Cedric Nugteren. 2018. CLBlast: A Tuned OpenCL BLAS Library. InProceedings of the International Workshop on OpenCL (IWOCL ’18). ACM. doi:10.1145/3204919. 3204924

  52. [52]

    NVIDIA. 2025. Introducing NVFP4 for Efficient and Accurate Low-Precision Inference. https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient- and-accurate-low-precision-inference/

  53. [53]

    2026.TensorRT-LLM

    NVIDIA. 2026.TensorRT-LLM. https://github.com/NVIDIA/TensorRT-LLM

  54. [54]

    Open Compute Project. 2023. OCP Microscaling Formats (MX) Specification Ver- sion 1.0. https://www.opencompute.org/documents/ocp-microscaling-formats- mx-v1-0-spec-final-pdf

  55. [55]

    OpenAI, :, Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Apple- baum, Edwin Arbus, Rahul K. Arora, Yu Bai, Bowen Baker, Haiming Bao, Boaz Barak, Ally Bennett, Tyler Bertao, Nivedita Brett, Eugene Brevdo, Greg Brockman, Sebastien Bubeck, Che Chang, Kai Chen, Mark Chen, Enoch Cheung, Aidan Clark, Dan Cook, Marat Dukhan, Casey Dvorak, Kevin Fives,...

  56. [56]

    S. J. Pennycook, J. D. Sewall, and V. W. Lee. 2016. A Metric for Performance Portability. arXiv:1611.07409 https://arxiv.org/abs/1611.07409

  57. [57]

    PrismML. 2025. 1-bit Bonsai 8B Whitepaper. https://github.com/PrismML- Eng/Bonsai-demo/blob/main/1-bit-bonsai-8b-whitepaper.pdf. Technical report

  58. [58]

    Qwen Team. 2026. Qwen3.5-2B. https://huggingface.co/Qwen/Qwen3.5-2B

  59. [59]

    WebLLM: A High-Performance In-Browser LLM Inference Engine

    Charlie F. Ruan, Yucheng Qin, Akaash R. Parthasarathy, Xun Zhou, Ruihang Lai, Hongyi Jin, Yixin Dong, Bohan Hou, Meng-Shiun Yu, Yiyan Zhai, Sudeep Agarwal, Hangrui Cao, Siyuan Feng, and Tianqi Chen. 2026. WebLLM: A High- Performance In-Browser LLM Inference Engine. arXiv:2412.15803 https://arxiv. org/abs/2412.15803

  60. [60]

    Jon Saad-Falcon, Avanika Narayan, Hakki Orhun Akengin, J. Wes Griffin, Herumb Shandilya, Adrian Gamarra Lafuente, Medhya Goel, Rebecca Joseph, Shlok Natarajan, Etash Kumar Guha, Shang Zhu, Ben Athiwaratkun, John Hennessy, Azalia Mirhoseini, and Christopher Ré. 2026. Intelligence per Watt: Measuring Intelligence Efficiency of Local AI. arXiv:2511.07885 htt...

  61. [61]

    Sha Sajadieh, Loredana Fattorini, Raymond Perrault, Yolanda Gil, Vanessa Parli, Lapo Santarlasci, Juan Pava, Nestor Maslej, Russ Altman, Erik Brynjolfsson, Carla Brodley, Jack Clark, Virginia Dignum, Vipin Kumar, James Landay, Terah Lyons, James Manyika, Juan Carlos Niebles, Yoav Shoham, Elham Tabassi, Russell Wald, Toby Walsh, and Dan Weld. 2026. The AI ...

  62. [62]

    Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieil- lard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, Gaël Liu, Francesco Visin, Kathleen Kenealy, Lucas B...

  63. [63]

    Albert Tseng, Jerry Chee, Qingyao Sun, Volodymyr Kuleshov, and Christopher De Sa. 2024. QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks. arXiv:2402.04396 https://arxiv.org/abs/2402.04396

  64. [64]

    Attention Is All You Need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is All You Need. InProceedings of the 31st International Conference on Neural Informa- tion Processing Systems (NIPS’17). https://doi.org/10.48550/arXiv.1706.03762

  65. [65]

    W3C. 2026. WebGPU. https://www.w3.org/TR/webgpu/

  66. [66]

    Conrad Watt, Andreas Rossberg, and Jean Pichon-Pharabod. 2019. Weakening WebAssembly.Proc. ACM Program. Lang.(2019). doi:10.1145/3360559

  67. [67]

    2026.WebKit: WebKit Browser Engine on GitHub

    WebKit Contributors. 2026.WebKit: WebKit Browser Engine on GitHub. https: //github.com/WebKit/WebKit

  68. [68]

    Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. 2024. SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models. arXiv:2211.10438 https://arxiv.org/abs/2211.10438

  69. [69]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

  70. [70]

    Alon Zakai. 2011. Emscripten: an LLVM-to-JavaScript compiler. InProceedings of the ACM International Conference Companion on Object Oriented Programming Systems Languages and Applications Companion. Association for Computing Machinery. doi:10.1145/2048147.2048224

  71. [71]

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. 2024. WebArena: A Realistic Web Environment for Building Autonomous Agents. arXiv:2307.13854 https://arxiv.org/abs/2307.13854 13 Reese Levine, Rithik Sharma, Nikhil Jain, Abhijit Ramesh, Zheyuan Chen...