WebLLM: A High-Performance In-Browser LLM Inference Engine

Akaash R. Parthasarathy; Bohan Hou; Charlie F. Ruan; Hangrui Cao; Hongyi Jin; Meng-Shiun Yu; Ruihang Lai; Siyuan Feng; Sudeep Agarwal; Tianqi Chen

arxiv: 2412.15803 · v2 · submitted 2024-12-20 · 💻 cs.LG · cs.AI

WebLLM: A High-Performance In-Browser LLM Inference Engine

Charlie F. Ruan , Yucheng Qin , Akaash R. Parthasarathy , Xun Zhou , Ruihang Lai , Hongyi Jin , Yixin Dong , Bohan Hou

show 6 more authors

Meng-Shiun Yu Yiyan Zhai Sudeep Agarwal Hangrui Cao Siyuan Feng Tianqi Chen

This is my paper

classification 💻 cs.LG cs.AI

keywords webllminferencemodelswebgpuaccessibleapplicationsbrowsersdeployment

0 comments

read the original abstract

Advancements in large language models (LLMs) have unlocked remarkable capabilities. While deploying these models typically requires server-grade GPUs and cloud-based inference, the recent emergence of smaller open-source models and increasingly powerful consumer devices have made on-device deployment practical. The web browser as a platform for on-device deployment is universally accessible, provides a natural agentic environment, and conveniently abstracts out the different backends from diverse device vendors. To address this opportunity, we introduce WebLLM, an open-source JavaScript framework that enables high-performance LLM inference entirely within web browsers. WebLLM provides an OpenAI-style API for seamless integration into web applications, and leverages WebGPU for efficient local GPU acceleration and WebAssembly for performant CPU computation. With machine learning compilers MLC-LLM and Apache TVM, WebLLM leverages optimized WebGPU kernels, overcoming the absence of performant WebGPU kernel libraries. Evaluations show that WebLLM can retain up to 80% native performance on the same device, with room to further close the gap. WebLLM paves the way for universally accessible, privacy-preserving, personalized, and locally powered LLM applications in web browsers. The code is available at: https://github.com/mlc-ai/web-llm.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Llamas on the Web: Memory-Efficient, Performance-Portable, and Multi-Precision LLM Inference with WebGPU
cs.DC 2026-05 conditional novelty 7.0

LlamaWeb is a WebGPU backend for llama.cpp that uses static memory planning, tunable kernels, and templated multi-precision support to cut memory use by 29-33% and raise decode throughput by 45-69% versus prior browse...
VIGIL: An Extensible System for Real-Time Detection and Mitigation of Cognitive Bias Triggers
cs.CL 2026-03 conditional novelty 7.0

VIGIL is the first browser extension for real-time detection and mitigation of cognitive bias triggers, with scroll-synced highlighting, LLM reformulation, privacy tiers, and extensible validated plugins.