WebLLM: A High-Performance In-Browser LLM Inference Engine

· 2024 · cs.LG · arXiv 2412.15803

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

Advancements in large language models (LLMs) have unlocked remarkable capabilities. While deploying these models typically requires server-grade GPUs and cloud-based inference, the recent emergence of smaller open-source models and increasingly powerful consumer devices have made on-device deployment practical. The web browser as a platform for on-device deployment is universally accessible, provides a natural agentic environment, and conveniently abstracts out the different backends from diverse device vendors. To address this opportunity, we introduce WebLLM, an open-source JavaScript framework that enables high-performance LLM inference entirely within web browsers. WebLLM provides an OpenAI-style API for seamless integration into web applications, and leverages WebGPU for efficient local GPU acceleration and WebAssembly for performant CPU computation. With machine learning compilers MLC-LLM and Apache TVM, WebLLM leverages optimized WebGPU kernels, overcoming the absence of performant WebGPU kernel libraries. Evaluations show that WebLLM can retain up to 80% native performance on the same device, with room to further close the gap. WebLLM paves the way for universally accessible, privacy-preserving, personalized, and locally powered LLM applications in web browsers. The code is available at: https://github.com/mlc-ai/web-llm.

representative citing papers

Llamas on the Web: Memory-Efficient, Performance-Portable, and Multi-Precision LLM Inference with WebGPU

cs.DC · 2026-05-20 · conditional · novelty 7.0

LlamaWeb is a WebGPU backend for llama.cpp that uses static memory planning, tunable kernels, and templated multi-precision support to cut memory use by 29-33% and raise decode throughput by 45-69% versus prior browser frameworks on tested hardware.

VIGIL: An Extensible System for Real-Time Detection and Mitigation of Cognitive Bias Triggers

cs.CL · 2026-03-12 · conditional · novelty 7.0

VIGIL is the first browser extension for real-time detection and mitigation of cognitive bias triggers, with scroll-synced highlighting, LLM reformulation, privacy tiers, and extensible validated plugins.

citing papers explorer

Showing 2 of 2 citing papers.

Llamas on the Web: Memory-Efficient, Performance-Portable, and Multi-Precision LLM Inference with WebGPU cs.DC · 2026-05-20 · conditional · none · ref 59 · internal anchor
LlamaWeb is a WebGPU backend for llama.cpp that uses static memory planning, tunable kernels, and templated multi-precision support to cut memory use by 29-33% and raise decode throughput by 45-69% versus prior browser frameworks on tested hardware.
VIGIL: An Extensible System for Real-Time Detection and Mitigation of Cognitive Bias Triggers cs.CL · 2026-03-12 · conditional · none · ref 15 · internal anchor
VIGIL is the first browser extension for real-time detection and mitigation of cognitive bias triggers, with scroll-synced highlighting, LLM reformulation, privacy tiers, and extensible validated plugins.

WebLLM: A High-Performance In-Browser LLM Inference Engine

fields

years

verdicts

representative citing papers

citing papers explorer