pith. sign in

arxiv: 2412.15803 · v2 · submitted 2024-12-20 · 💻 cs.LG · cs.AI

WebLLM: A High-Performance In-Browser LLM Inference Engine

classification 💻 cs.LG cs.AI
keywords webllminferencemodelswebgpuaccessibleapplicationsbrowsersdeployment
0
0 comments X
read the original abstract

Advancements in large language models (LLMs) have unlocked remarkable capabilities. While deploying these models typically requires server-grade GPUs and cloud-based inference, the recent emergence of smaller open-source models and increasingly powerful consumer devices have made on-device deployment practical. The web browser as a platform for on-device deployment is universally accessible, provides a natural agentic environment, and conveniently abstracts out the different backends from diverse device vendors. To address this opportunity, we introduce WebLLM, an open-source JavaScript framework that enables high-performance LLM inference entirely within web browsers. WebLLM provides an OpenAI-style API for seamless integration into web applications, and leverages WebGPU for efficient local GPU acceleration and WebAssembly for performant CPU computation. With machine learning compilers MLC-LLM and Apache TVM, WebLLM leverages optimized WebGPU kernels, overcoming the absence of performant WebGPU kernel libraries. Evaluations show that WebLLM can retain up to 80% native performance on the same device, with room to further close the gap. WebLLM paves the way for universally accessible, privacy-preserving, personalized, and locally powered LLM applications in web browsers. The code is available at: https://github.com/mlc-ai/web-llm.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Llamas on the Web: Memory-Efficient, Performance-Portable, and Multi-Precision LLM Inference with WebGPU

    cs.DC 2026-05 conditional novelty 7.0

    LlamaWeb is a WebGPU backend for llama.cpp that uses static memory planning, tunable kernels, and templated multi-precision support to cut memory use by 29-33% and raise decode throughput by 45-69% versus prior browse...

  2. VIGIL: An Extensible System for Real-Time Detection and Mitigation of Cognitive Bias Triggers

    cs.CL 2026-03 conditional novelty 7.0

    VIGIL is the first browser extension for real-time detection and mitigation of cognitive bias triggers, with scroll-synced highlighting, LLM reformulation, privacy tiers, and extensible validated plugins.