← Back to Blog

WebGPU + WebLLM: Running a 4B Parameter LLM in Chrome at 90 Tokens/Second

A deep dive into how MLC compilation transforms HuggingFace models into WebGPU shaders, enabling 30 curated LLMs to run entirely in the browser. We cover the full model catalog, Qwen3-4B's 97% on MATH-500, VRAM management, and real performance numbers across GPU tiers.

LocalMode·

Two years ago, the idea of running a multi-billion parameter language model inside a browser tab would have been dismissed as impractical. The models were too large, the browser APIs too limited, and the performance gap with native inference too wide.

That gap has closed.

With WebGPU now shipping by default across Chrome, Edge, Firefox, and Safari, and the MLC-AI team's WebLLM engine reaching maturity at v0.2.82, browser-based LLM inference has crossed the threshold from demo to production. A 4-bit quantized Qwen3-4B model loads in under 30 seconds, occupies roughly 2.2GB of GPU memory, and generates text at 60-90+ tokens per second on modern hardware -- all inside a Chrome tab, with zero network requests after the initial download.

This post explains exactly how it works, what you can run, and what the real-world limitations are.


How WebLLM Turns a HuggingFace Model into WebGPU Shaders

The path from a HuggingFace model checkpoint to real-time browser inference involves three stages, all powered by the MLC-AI machine learning compilation framework built on Apache TVM.

Stage 1: Weight Conversion and Quantization

The process begins with a standard HuggingFace model (SafeTensors or PyTorch format). MLC converts the weights into an optimized layout and applies 4-bit quantization (q4f16_1), reducing a 4B parameter model from roughly 8GB (FP16) down to approximately 2.2GB. This quantization scheme stores weights in 4-bit integers while keeping activations in FP16, preserving most of the model's quality while cutting memory by 75%.

Stage 2: Graph Compilation to WebGPU Shaders

This is where MLC diverges from typical inference frameworks. Instead of interpreting model operations at runtime, MLC compiles the entire computation graph ahead of time into optimized WebGPU Shading Language (WGSL) compute shaders. Each transformer layer -- attention, feed-forward, layer norm -- becomes a GPU kernel tuned for the target architecture.

The shader-f16 feature is particularly important: it enables native FP16 operations in WebGPU shaders rather than emulating them in FP32. Since GPU inference is typically memory-bandwidth limited, halving the data width can nearly double throughput.

The compiled output is a WebAssembly module (the runtime orchestrator) plus a set of pre-compiled GPU shader programs. No JIT compilation happens in the browser -- everything is ready to execute on load.

Stage 3: Browser Runtime

When you call CreateMLCEngine() in the browser, WebLLM downloads the compiled WASM module and quantized weights, stores them in the Cache API for offline reuse, and initializes the WebGPU pipeline. Subsequent loads skip the download entirely, going straight from cache to GPU.

import { streamText } from '@localmode/core';
import { webllm } from '@localmode/webllm';

// MLC-compiled model runs entirely on the GPU via WebGPU
const model = webllm.languageModel('Qwen3-4B-q4f16_1-MLC');

const result = await streamText({
  model,
  prompt: 'Explain the difference between TCP and UDP.',
});

for await (const chunk of result.stream) {
  process.stdout.write(chunk.text);
}

The key insight is that MLC compilation eliminates the interpreter overhead that plagues other browser-based approaches. Every matrix multiplication, every attention head, every activation function is a pre-optimized GPU dispatch -- not a generic WASM loop.


The Complete Model Catalog: 30 Curated Models

LocalMode's @localmode/webllm package ships with 30 curated, pre-compiled models spanning seven model families. Every model uses 4-bit quantization (q4f16_1) for efficient VRAM usage. Here is the full catalog.

Tiny Models (under 500MB VRAM)

Model IDParametersVRAMContextBest For
SmolLM2-135M135M~78MB2KInstant loading, prototyping
SmolLM2-360M360M~210MB2KUltra-fast responses
Qwen 2.5 0.5B0.5B~278MB4KSmall but capable
Qwen 3 0.6B0.6B~350MB4KLatest-gen tiny model
TinyLlama 1.1B1.1B~400MB2KFast general chat

These models load in seconds and work well on integrated GPUs and lower-end hardware. They are ideal for autocomplete, simple Q&A, and testing during development.

Small Models (500MB - 1GB VRAM)

Model IDParametersVRAMContextBest For
Llama 3.2 1B1B~712MB4KGeneral tasks, fast inference
Qwen 2.5 1.5B1.5B~868MB4KMultilingual, Chinese support
Qwen 2.5 Coder 1.5B1.5B~868MB4KCode generation

The Llama 3.2 1B model is the recommended starting point for testing and development. It downloads quickly and provides surprisingly good quality for its size, scoring 49.3 on MMLU (5-shot) according to Meta's benchmarks. For comparison, the 3B variant scores 63.4.

Medium Models (1GB - 2GB VRAM)

Model IDParametersVRAMContextBest For
Qwen 3 1.7B1.7B~1.1GB4KMultilingual, thinking mode
SmolLM2 1.7B1.7B~1GB2KBest small model (shader-f16)
Gemma 2 2B2B~1.44GB2KGoogle quality (shader-f16)
Qwen 2.5 3B3B~1.7GB4KHigh-quality multilingual
Qwen 2.5 Coder 3B3B~1.7GB4KMid-range code model
Llama 3.2 3B3B~1.76GB4KExcellent general purpose
Hermes 3 Llama 3.2 3B3B~1.76GB4KEnhanced chat fine-tune
Ministral 3 3B3B~1.8GB4KLatest Mistral architecture
Ministral 3 3B Reasoning3B~1.8GB4KReasoning-tuned

This tier offers the best balance of quality and resource usage for production browser applications. Llama 3.2 3B delivers strong instruction-following (77.4 IFEval) and general knowledge, while the Ministral 3B models bring Mistral's latest architecture to the browser.

Large Models (over 2GB VRAM)

Model IDParametersVRAMContextBest For
Phi 3.5 Mini3.8B~2.1GB4KReasoning, coding
Phi 3 Mini 4K3.8B~2.2GB4KMicrosoft Phi reasoning
Phi 3.5 Vision3.8B~2.4GB1KMultimodal (text + images)
Qwen 3 4B4B~2.2GB4KBest medium-range quality
Mistral 7B v0.37B~4GB4KStrong general-purpose
Qwen 2.5 7B7B~4GB4KExcellent multilingual
Qwen 2.5 Coder 7B7B~4GB4KBest-in-class browser code model
DeepSeek R1 Distill Qwen 7B7B~4.18GB4KAdvanced reasoning
DeepSeek R1 Distill Llama 8B8B~4.41GB4KStrongest reasoning
Hermes 3 Llama 3.1 8B8B~4.9GB4KDPO-optimized chat
Llama 3.1 8B8B~4.5GB4KMeta's flagship 8B
Qwen 3 8B8B~4.5GB4KHighest-quality multilingual
Gemma 2 9B9B~5GB1KGoogle's best quality

These models require dedicated GPU memory (4GB+ VRAM) and are best suited for desktop browsers with discrete GPUs. The 7B-9B class models deliver quality that was server-only territory just a year ago.


Qwen3-4B: The Sweet Spot for Browser LLMs

Among the 30 models in the catalog, Qwen3-4B stands out as the best balance of capability and browser viability. At just 2.2GB quantized, it fits comfortably in the GPU memory of most modern laptops and desktops.

Benchmark Highlights

The numbers from the Qwen3 Technical Report and Open Laboratory evaluation tell the story:

BenchmarkQwen3-4BWhat It Measures
MMLU-Redux83.7%Broad knowledge across 57 subjects
MATH-50097.0%Competition-level mathematics
C-Eval77.5%Chinese language understanding
MLogiQA65.9%Logical reasoning
RULER85.2Long-context retrieval (non-thinking)

A 97% score on MATH-500 from a 4-billion parameter model running inside a browser tab is remarkable. For context, the Qwen team notes that "even a tiny model like Qwen3-4B can rival the performance of Qwen2.5-72B-Instruct" -- a model 18 times its size.

Thinking Mode

Qwen3-4B supports dual inference modes. In non-thinking mode, the model generates responses directly -- fast and efficient. In thinking mode, the model produces internal chain-of-thought reasoning before its final answer, significantly improving performance on math, logic, and multi-step problems. The MATH-500 score of 97.0% leverages this thinking capability.

The trade-off is latency: thinking mode generates more tokens (the reasoning chain plus the answer), so wall-clock time increases. For interactive chat, non-thinking mode is typically preferred. For tasks where accuracy matters more than speed -- code generation, data extraction, math tutoring -- thinking mode is worth the extra tokens.

Qwen3.5-4B via ONNX

The newer Qwen3.5-4B model (scoring 88.8% on MMLU-Redux per its model card) is available through the @localmode/transformers ONNX provider. The WebLLM catalog uses MLC-compiled Qwen3-4B. Both run in the browser; the choice depends on whether you prefer WebGPU-native (WebLLM) or the ONNX runtime with WebGPU acceleration (Transformers.js v4).


Performance: Real Numbers by GPU Tier

Browser LLM performance depends on three factors: model size, GPU capability, and available VRAM. Based on WebLLM benchmarks and community reports, here are representative token generation speeds.

Performance varies

All numbers below are approximate and vary significantly by hardware, browser version, driver, thermal throttling, and concurrent GPU workload. Treat these as directional ranges, not guarantees.

Tokens per Second by Hardware Tier

Hardware1B Models3B Models4B Models7-8B Models
Apple M3 Max (40-core GPU)120-180 tok/s60-90 tok/s50-80 tok/s35-50 tok/s
Apple M1/M2 (8-core GPU)60-100 tok/s30-50 tok/s25-40 tok/s15-25 tok/s
RTX 4070+ (12GB VRAM)130-200 tok/s70-100 tok/s60-90 tok/s40-60 tok/s
RTX 3060 (8GB VRAM)80-120 tok/s40-60 tok/s35-50 tok/s20-35 tok/s
Integrated GPU (Intel/AMD)30-60 tok/s15-30 tok/s10-20 tok/sNot recommended

WebLLM on an M3 Max achieves roughly 70-80% of native llama.cpp performance for equivalent models, according to published benchmarks. Phi 3.5 Mini has been measured at 71 tok/s and Llama 3.1 8B at 41 tok/s on M3 Max hardware -- and smaller models scale significantly faster.

The "90 tokens/second" in this post's title reflects the upper range achievable with a 3B-4B model on high-end consumer hardware with discrete GPU. It is not a universal number -- but it is a real, reproducible result on hardware that millions of developers already own.


VRAM Management: What Happens When Memory Runs Out

Understanding GPU memory is critical for browser LLM deployment. Unlike server-side inference where you control the hardware, browser users bring wildly varying GPU capabilities.

Where VRAM Goes

For a 4-bit quantized model, VRAM consumption breaks down roughly as:

ComponentQwen3-4B (2.2GB total)Llama 3.1 8B (4.5GB total)
Model weights (4-bit)~1.8GB~3.8GB
KV cache (4K context)~200MB~400MB
Activation buffers~200MB~300MB

The KV cache grows linearly with conversation length. At 4K tokens of context, it is manageable. At 32K tokens, a Qwen 3 4B model's KV cache alone could consume over 1.5GB, which is why the WebLLM catalog limits context to 4K tokens for browser deployment.

When VRAM Is Insufficient

If the model's total memory requirement exceeds available GPU VRAM, WebLLM will fail to load the model and throw a MODEL_LOAD_FAILED error. Unlike native runtimes such as llama.cpp that can split layers between GPU and CPU, WebGPU requires the entire model to fit in GPU memory. There is no partial offloading.

This is why model selection matters. The catalog spans 78MB to 5GB precisely so developers can target the right model for the user's hardware:

import { isWebGPUSupported, getStorageQuota } from '@localmode/core';
import { webllm, WEBLLM_MODELS, getModelCategory } from '@localmode/webllm';

// Check capabilities before loading
if (!await isWebGPUSupported()) {
  // Fall back to wllama (WASM) or transformers (ONNX)
  console.warn('WebGPU not available, using WASM fallback');
}

// Choose model based on available resources
const quota = await getStorageQuota();
const targetSize = quota.available > 4 * 1024 * 1024 * 1024
  ? 'Qwen3-4B-q4f16_1-MLC'       // 2.2GB - high-end
  : 'Llama-3.2-1B-Instruct-q4f16_1-MLC';  // 712MB - safe default

Structured Output: generateObject() with WebLLM

Beyond free-form text generation, WebLLM models can produce validated JSON objects using generateObject(). This is powered by constrained decoding with automatic retry -- the model generates JSON, LocalMode validates it against a Zod schema, and retries with the validation error appended to the prompt if it fails.

import { generateObject, jsonSchema } from '@localmode/core';
import { webllm } from '@localmode/webllm';
import { z } from 'zod';

const model = webllm.languageModel('Qwen3-4B-q4f16_1-MLC');

const { object } = await generateObject({
  model,
  schema: jsonSchema(z.object({
    name: z.string(),
    email: z.string().email(),
    company: z.string().optional(),
    sentiment: z.enum(['positive', 'neutral', 'negative']),
  })),
  prompt: 'Extract contact info and sentiment: "Love working with Sarah at sarah@acme.co, Acme Corp is great!"',
});

console.log(object);
// { name: "Sarah", email: "sarah@acme.co", company: "Acme Corp", sentiment: "positive" }

This runs entirely in the browser. No API call, no server-side validation, no data transmitted. The Qwen3-4B and Phi 3.5 Mini models handle structured output particularly well due to their strong instruction-following capabilities.


The LLM Chat Showcase App

The LLM Chat showcase app demonstrates everything discussed in this post. It provides a multi-model chat interface where you can:

  • Select from all 30 WebLLM models (plus wllama GGUF and Transformers ONNX backends)
  • Stream responses in real time with cancellation support
  • Send images to the Phi 3.5 Vision model for multimodal understanding
  • Enable semantic caching to instantly replay similar prompts
  • Toggle between agent mode (ReAct reasoning with tool use) and direct chat
  • Export conversation history as JSON

The app is built with @localmode/react's useChat hook, which wraps streamText() with React state management, message persistence, and AbortSignal cancellation:

import { useChat } from '@localmode/react';
import { webllm } from '@localmode/webllm';

function ChatApp() {
  const model = webllm.languageModel('Qwen3-4B-q4f16_1-MLC');
  const { messages, isStreaming, send, cancel } = useChat({
    model,
    systemPrompt: 'You are a helpful assistant.',
  });

  return (
    <div>
      {messages.map((m) => (
        <p key={m.id}>{m.role}: {m.content}</p>
      ))}
      <button onClick={() => send('Hello!')}>Send</button>
      {isStreaming && <button onClick={cancel}>Stop</button>}
    </div>
  );
}

Limitations and Requirements

WebGPU Browser Support

WebGPU is required for WebLLM. As of early 2026, browser support has reached critical mass:

BrowserWebGPU Status
Chrome 113+Stable since May 2023
Edge 113+Stable since May 2023
Safari 26+ (macOS Tahoe 26, iOS 26)Stable since September 2025
Firefox 141+ (Windows)Stable since July 2025
Firefox (macOS ARM64)Stable since Firefox 145
Firefox (Linux)Nightly only

For users without WebGPU, LocalMode offers two alternative providers that work everywhere: @localmode/wllama (llama.cpp via WebAssembly) and @localmode/transformers (ONNX with WASM fallback). You can detect WebGPU support and choose accordingly:

import { isWebGPUSupported } from '@localmode/core';
import { webllm } from '@localmode/webllm';
import { wllama } from '@localmode/wllama';

const hasGPU = await isWebGPUSupported();

const model = hasGPU
  ? webllm.languageModel('Qwen3-4B-q4f16_1-MLC')
  : wllama.languageModel('bartowski/Qwen3-4B-GGUF:Qwen3-4B-Q4_K_M.gguf');

Initial Load Time

The first time a user loads a model, the full quantized weights must be downloaded. For Qwen3-4B, that is approximately 2.2GB. On a 50 Mbps connection, expect roughly 6 minutes. On gigabit, under 30 seconds. After the first download, the model is cached in the browser's Cache API and loads from disk in seconds.

Use preloadModel() during app initialization or behind a loading screen to avoid surprising users:

import { preloadModel, isModelCached } from '@localmode/webllm';

if (!(await isModelCached('Qwen3-4B-q4f16_1-MLC'))) {
  await preloadModel('Qwen3-4B-q4f16_1-MLC', {
    onProgress: (p) => updateLoadingBar(p.progress),
  });
}

Context Length

The WebLLM catalog caps context at 1K-4K tokens to keep VRAM usage predictable. This is shorter than the native context lengths of these models (Qwen3-4B natively supports 32K+). For applications needing longer context, consider RAG (retrieval-augmented generation) to inject relevant snippets rather than entire documents.

Single Model at a Time

WebGPU allocates GPU memory per model. Running two 4B models simultaneously would require 4.4GB+ of VRAM. In practice, load one model at a time and call model.unload() before switching. The @localmode/webllm implementation handles this automatically when you create a new engine.


What This Means for Application Developers

Browser LLMs are not a replacement for GPT-4o or Claude on complex tasks. They are a new deployment target with a fundamentally different trade-off: zero marginal cost, complete privacy, offline capability, and no infrastructure -- in exchange for smaller model sizes and GPU requirements on the client.

The use cases where this trade-off makes sense are growing rapidly: local document Q&A, structured data extraction, code assistance, content summarization, form autofill, smart autocomplete, and agentic workflows that chain multiple local models together.

With 30 models spanning 78MB to 5GB, WebGPU support across all major browsers, and generation speeds that reach 60-90+ tokens per second on modern hardware, the browser is no longer a toy environment for AI. It is a viable production platform.


Methodology

All benchmark scores cited in this post are sourced from official model cards, technical reports, and published evaluations:

Model Benchmarks:

WebLLM and MLC-AI:

WebGPU Browser Support:

Performance References:


Try it yourself

Visit localmode.ai to try 30+ AI demo apps running entirely in your browser. No sign-up, no API keys, no data leaves your device.

Read the Getting Started guide to add local AI to your application in under 5 minutes.