What are the four browser LLM providers in LocalMode?

LocalMode ships WebLLM (MLC-compiled, WebGPU, fastest inference), Transformers.js v4 (ONNX, broadest model selection including vision), wllama (llama.cpp WASM, runs in every browser without WebGPU, 160K+ GGUF models), and LiteRT-LM (Google's on-device engine, early preview with Gemma 4 models).

Do I need to change my code when switching between LLM providers?

No. All four providers implement the same LanguageModel interface from @localmode/core. Your generateText() and streamText() calls remain identical -- only the model import changes. Middleware, AbortSignal, and React hooks all work unchanged across providers.

Which browser LLM provider should I choose?

Choose WebLLM for maximum speed with WebGPU (60-90+ tokens/sec). Choose Transformers.js v4 for the broadest model catalog and vision support. Choose wllama for universal browser compatibility without WebGPU via WASM. Choose LiteRT-LM specifically for Google's Gemma 4 models.

Can I set up automatic fallback between LLM providers?

Yes. Because all providers share the same interface, you can wrap model creation in a try/catch chain that falls back from one provider to another. For example, try WebLLM first for WebGPU speed, then fall back to wllama for WASM compatibility if WebGPU is unavailable.

Browser LLM Providers, One API: WebLLM, Transformers.js v4, wllama, and LiteRT-LM

Running a large language model inside the browser used to mean picking a single runtime and coupling your entire application to it. If you chose WebLLM, you were locked into MLC-compiled models and WebGPU-only hardware. If you went with llama.cpp WASM, you accepted slower inference on CPU. And switching later meant rewriting every call site.

LocalMode eliminates that trade-off. The @localmode/core package defines a single LanguageModel interface. Four separate provider packages -- @localmode/webllm, @localmode/transformers, @localmode/wllama, and @localmode/litert -- each implement that interface using a different engine. Your application code calls generateText() or streamText() exactly the same way no matter which provider is active. Swap a single import, and the engine changes. Your UI, your prompts, your streaming logic -- none of it moves.

This post breaks down how each provider works, what it is best at, and how to choose between them. We also show how to set up automatic fallback so your app works on every browser without a single if statement in your components. Three of the four providers -- WebLLM, Transformers.js v4, and wllama -- are production-stable; the fourth, LiteRT-LM, is a new early-preview addition covered in its own section below.

The Shared Interface

Every provider implements this interface from @localmode/core:

interface LanguageModel {
  readonly modelId: string;
  readonly provider: string;
  readonly contextLength: number;
  readonly supportsVision?: boolean;

  doGenerate(options: DoGenerateOptions): Promise<DoGenerateResult>;
  doStream?(options: DoStreamOptions): AsyncIterable<StreamChunk>;
}

The core functions generateText() and streamText() accept any LanguageModel. They handle retries, abort signals, usage tracking, and response metadata. The provider handles model loading, tokenization, and inference. Clean separation.

Here is the same streaming call with all three providers:

import { streamText } from '@localmode/core';
import { webllm } from '@localmode/webllm';
import { transformers } from '@localmode/transformers';
import { wllama } from '@localmode/wllama';
import { litert } from '@localmode/litert';

// Option A: WebLLM (WebGPU, MLC-compiled)
const resultA = await streamText({
  model: webllm.languageModel('Llama-3.2-3B-Instruct-q4f16_1-MLC'),
  prompt: 'Explain quantum tunneling in plain English',
  maxTokens: 300,
});

// Option B: Transformers.js v4 (ONNX, WebGPU + WASM fallback)
const resultB = await streamText({
  model: transformers.languageModel('onnx-community/Qwen3.5-0.8B-ONNX'),
  prompt: 'Explain quantum tunneling in plain English',
  maxTokens: 300,
});

// Option C: wllama (GGUF, llama.cpp WASM)
const resultC = await streamText({
  model: wllama.languageModel('Llama-3.2-1B-Instruct-Q4_K_M'),
  prompt: 'Explain quantum tunneling in plain English',
  maxTokens: 300,
});

// Option D: LiteRT-LM (.litertlm, Google's on-device engine -- early preview)
const resultD = await streamText({
  model: litert.languageModel('qwen3-0.6B'),
  prompt: 'Explain quantum tunneling in plain English',
  maxTokens: 300,
});

// All four return the same StreamTextResult shape:
for await (const chunk of resultA.stream) {
  process.stdout.write(chunk.text); // identical API
}

The return type is always StreamTextResult with .stream, .text, .usage, and .response. The chunk shape is always StreamChunk with .text, .done, and optional .finishReason and .usage on the final chunk. No provider-specific handling required.

Provider 1: WebLLM (Fastest Inference)

Package: @localmode/webllm | Engine: MLC-AI WebLLM | Runtime: WebGPU

WebLLM compiles models through the MLC (Machine Learning Compilation) pipeline into WebGPU-optimized shaders. The result is the fastest in-browser LLM inference available today -- published benchmarks show Phi 3.5 Mini at 71 tokens/second and Llama 3.1 8B at 41 tokens/second on an M3 Max.

How it works: Models are pre-compiled into MLC format with 4-bit quantization (q4f16). At runtime, WebLLM loads the compiled model, creates WebGPU compute pipelines, and runs inference entirely on the GPU. The JavaScript API is OpenAI-compatible under the hood, which LocalMode wraps into the standard LanguageModel interface.

Curated catalog: 32 models ranging from SmolLM2 135M (78MB) to Qwen 3.5 9B (5.06GB). Includes 1 vision model (Phi 3.5 Vision). Model families include Qwen 2.5/3/3.5, Llama 3.1/3.2, Phi 3/3.5, Gemma 2, DeepSeek R1, Mistral, Ministral, and SmolLM2.

Best for: Applications where inference speed matters most -- interactive chat, real-time code completion, streaming agents. If your users have Chrome 113+, Edge 113+, or Safari 26+ with a discrete or integrated GPU, WebLLM delivers the best experience.

Limitation: Requires WebGPU. Firefox ships WebGPU in stable releases: 141+ on Windows, 147+ on macOS Apple Silicon. Older devices without GPU support cannot run WebLLM at all. The model catalog is fixed to MLC-compiled variants -- you cannot bring an arbitrary model file.

Provider 2: Transformers.js v4 (Broadest Model Selection)

Package: @localmode/transformers | Engine: HuggingFace Transformers.js v4 | Runtime: WebGPU with WASM fallback

Transformers.js v4 runs ONNX-format models using ONNX Runtime Web. It prefers WebGPU when available but automatically falls back to WASM, meaning it works on every modern browser. All @localmode/transformers implementations - embeddings, classification, speech-to-text, text generation, and more - use a unified @huggingface/transformers@^4.2.0 dependency.

How it works: Models are loaded from HuggingFace Hub in ONNX format with q4 quantization. For standard text models (SmolLM2, Phi, Qwen3, Granite), TJS v4 uses its text-generation pipeline. For multimodal models (Qwen3.5, Gemma 4), it uses dedicated model classes with a split architecture (embed_tokens, vision_encoder, decoder). The loading strategy is auto-detected from the model ID.

Curated catalog: 16 recommended ONNX models from 120MB (Granite 4.0 350M) to 3GB (Gemma 4 E4B). Includes 5 vision-capable models (Qwen3.5-0.8B, Qwen3.5-2B, Qwen3.5-4B, Gemma 4 E2B, Gemma 4 E4B). But any ONNX model on HuggingFace Hub can be loaded -- you are not restricted to the curated list.

Best for: Applications that need vision input (image + text), broad model selection, or the ability to load custom ONNX models. Also the right choice when you want WebGPU speed where available but need guaranteed WASM fallback on older hardware.

Limitation: ONNX model availability on HuggingFace lags behind GGUF. Slightly slower than WebLLM on WebGPU due to the ONNX Runtime overhead (40-60 tok/s vs 60-100 tok/s).

Provider 3: wllama (Universal Browser Support)

Package: @localmode/wllama | Engine: wllama v3 (llama.cpp compiled to WASM) | Runtime: WASM (CPU) with optional WebGPU acceleration

wllama is llama.cpp compiled to WebAssembly. It runs any standard GGUF model file on the CPU without requiring WebGPU, WebNN, or any GPU API at all. With wllama v3, optional WebGPU acceleration is available via useWebGPU and nGpuLayers for faster inference on capable devices. This makes it the most universally compatible provider -- if the browser supports WASM (which is every modern browser since 2017), wllama works.

How it works: The wllama library loads GGUF files directly from HuggingFace URLs, initializes a llama.cpp WASM instance, and runs inference token by token. Multi-threaded execution is available when the page has Cross-Origin Isolation headers (Cross-Origin-Opener-Policy: same-origin and Cross-Origin-Embedder-Policy: require-corp), which enables SharedArrayBuffer. Without CORS headers, wllama falls back to single-threaded mode -- functional but 2-4x slower.

Curated catalog: 30 curated models (25 language + 3 embedding + 2 reranker) from SmolLM2 135M (70MB) to Gemma 4 E4B (5.41GB), spanning Q4_K_M text models, four vision-language models (Holo2 4B/8B for UI grounding, Gemma 4 E2B/E4B for vision + tool calling), and three GGUF embedding models. wllama v3 adds embeddings via wllama.embedding(), optional WebGPU acceleration, OAI-compatible tool calling, vision input via mmprojUrl, and native Jinja chat templates. But the real story is the open ecosystem: wllama accepts any GGUF URL, which means access to the 160,000+ GGUF models on HuggingFace Hub. Any model that bartowski, TheBloke, or the original authors have quantized to GGUF is one URL away.

Best for: Maximum compatibility, maximum model selection, and environments where WebGPU is unavailable (Firefox stable, older Safari, corporate locked-down browsers). Also the right choice when you need a specific fine-tuned or quantized model that only exists in GGUF format.

Limitation: CPU-only inference is inherently slower (5-15 tok/s), though WebGPU acceleration closes the gap on supported browsers. The 2GB ArrayBuffer limit in browsers means large models must be split into chunks (wllama handles this automatically for catalog models). Single-threaded mode without CORS headers is noticeably slow.

Provider 4: LiteRT-LM (Google On-Device Engine -- Early Preview)

Package: @localmode/litert | Engine: Google LiteRT-LM via @litert-lm/core | Runtime: WebGPU with CPU WASM fallback

LiteRT-LM is the on-device inference engine Google uses to ship Gemini Nano-class models to Android and ChromeOS. The @litert-lm/core package compiles it to WebAssembly, and @localmode/litert wraps it behind the standard LanguageModel interface. It loads models in the .litertlm format -- a single artifact bundling tokenizer, weights, and metadata.

How it works: The provider downloads and caches the .litertlm file, creates a LiteRT-LM Engine (WebGPU backend by default), and runs a Conversation for generation. The Gemma 4 builds are GPU-compiled, so the provider checks WebGPU availability up front and fails fast with a clear error if it is missing. For portable models (Qwen3 0.6B), if GPU streaming load is unsupported it retries once on the CPU backend automatically.

Curated catalog: Three models -- gemma-4-E2B (Gemma 4 E2B, 2.0GB), gemma-4-E4B (Gemma 4 E4B, 3.0GB), and qwen3-0.6B (Qwen3 0.6B, 614MB) -- all verified end-to-end in real Chrome. The Gemma 4 entries use the web-optimized *-it-web.litertlm builds Google publishes as the models officially supported by the LiteRT-LM JS API; those builds are WebGPU-only. Qwen3 0.6B is a portable build that also runs on the CPU backend. Any other .litertlm file loads via a HuggingFace shorthand or full URL; gated Google models (Gemma 3n, Gemma 3 1B, FunctionGemma) load via a resolved modelUrl.

Best for: Running Gemma 4 in the browser via Google's first-party engine (on WebGPU-capable hardware), or wiring LiteRT-LM into a provider-fallback chain. The other three providers are longer-established if you need a wider catalog or vision input.

Limitation: Early preview. The @litert-lm/core API surface may still change, the JS API is text-in / text-out (no vision or audio input), Gemma 4 is WebGPU-only, stopSequences is unsupported (LiteRT-LM uses token IDs), and token usage counts are estimated from text length. See the dedicated post, Google's LiteRT-LM in the Browser, for the full details.

Master Comparison Table

Dimension	WebLLM	Transformers.js v4	wllama	LiteRT-LM (Preview)
Engine	MLC-AI WebLLM	ONNX Runtime Web (TJS v4)	llama.cpp WASM	LiteRT-LM (`@litert-lm/core`)
Model format	MLC-compiled (q4f16)	ONNX (q4/fp16)	GGUF (Q4_K_M, Q5, Q6, etc.)	`.litertlm`
Primary runtime	WebGPU (GPU)	WebGPU + WASM fallback	WASM (CPU) + optional WebGPU	WebGPU (Gemma 4); WebGPU or CPU (Qwen3 0.6B)
Curated models	32	16	30 (25 language + 3 embedding + 2 reranker) + 160K+ GGUF	3 (Gemma 4 E2B/E4B, Qwen3 0.6B)
Vision models	1 (Phi 3.5 Vision)	5 (Qwen3.5 family, Gemma 4 E2B/E4B)	4 (Holo2 4B/8B, Gemma 4 E2B/E4B)	None (JS API is text-only)
Custom models	No (MLC-compiled only)	Yes (any ONNX on HF Hub)	Yes (any GGUF on HF Hub)	Yes (any `.litertlm` URL)
Size range	78MB -- 5.06GB	120MB -- 3GB	70MB -- 5.41GB	614MB -- 3.0GB
Inference speed	60-100 tok/s	40-60 tok/s (WebGPU)	5-15 tok/s (CPU), faster with WebGPU	Not yet benchmarked
Min browser	Chrome/Edge 113+, Safari 26+	Chrome 80+, any modern browser	Any browser with WASM	Chrome/Edge 113+ with WebGPU (Gemma 4 is WebGPU-only)
WebGPU required	Yes	No (preferred, not required)	No (optional via `useWebGPU`)	Yes for Gemma 4; no for Qwen3 0.6B
Multi-threading	N/A (GPU)	N/A (GPU or single-thread WASM)	Yes (needs CORS headers)	N/A
Structured output	`generateObject()` / `streamObject()`	`generateObject()` / `streamObject()`	`generateObject()` / `streamObject()`	`generateObject()` / `streamObject()`
AbortSignal	Yes	Yes	Yes	Yes
Streaming	Yes	Yes	Yes	Yes
Maturity	Stable	Stable	Stable	Early preview

Decision Flowchart

Use this text-based flowchart to pick the right provider for your use case:

START
  |
  v
Does the user's browser support WebGPU?
  |
  +-- YES --> Do you need vision (image input)?
  |             |
  |             +-- YES --> Use @localmode/transformers
  |             |           (Qwen3.5 / Gemma 4 ONNX models)
  |             |
  |             +-- NO --> Is inference speed the top priority?
  |                          |
  |                          +-- YES --> Use @localmode/webllm
  |                          |           (fastest: 60-100 tok/s)
  |                          |
  |                          +-- NO --> Do you need a specific
  |                                     fine-tuned GGUF model?
  |                                       |
  |                                       +-- YES --> Use @localmode/wllama
  |                                       +-- NO  --> Use @localmode/webllm
  |
  +-- NO --> Does the app need LLM at all on this device?
               |
               +-- YES --> Use @localmode/wllama
               |           (works everywhere with WASM)
               |
               +-- NO --> Skip LLM, use lighter models
                          (embeddings, classification, etc.)

The flowchart covers the three production-stable providers. @localmode/litert is intentionally left off the default decision path while it is in early preview -- opt into it explicitly when you want to experiment with Google's engine, or add it to a fallback chain so your app picks it up once its catalog grows.

Automatic Fallback: Zero Decision Logic in Your App

You can detect capabilities and route to the best available provider at runtime:

import { isWebGPUSupported } from '@localmode/core';

const hasGPU = await isWebGPUSupported();

const model = hasGPU
  ? webllm.languageModel('Qwen3-4B-q4f16_1-MLC')
  : wllama.languageModel('Qwen2.5-3B-Instruct-Q4_K_M');

When to Use Each Provider

Use WebLLM when you are building a chat application, coding assistant, or any interactive feature where response latency directly impacts user experience. Your users have modern hardware with WebGPU, and you are satisfied with the 32 curated model options. This is the default choice for most consumer-facing applications in 2026.

Use Transformers.js v4 when you need multimodal input (image + text), want to load custom ONNX models, or need the safety net of automatic WASM fallback without maintaining two separate code paths. The Qwen3.5 vision models are currently the best sub-4B multimodal option for the browser.

Use wllama when you need to reach every user regardless of hardware, want access to a specific GGUF model from the 160K+ available on HuggingFace, or are targeting environments where WebGPU is blocked. Corporate intranets, kiosk browsers, and older devices are where wllama shines. It also carries four curated vision-language models: Holo2 4B/8B for UI grounding and Gemma 4 E2B/E4B for vision + tool calling.

Use LiteRT-LM when you want Gemma 4 in the browser via Google's first-party on-device engine, or want the provider wired into a fallback chain. Treat it as early preview -- the other three providers are longer-established if you need a wider catalog or vision input.

Use the fallback chain when you do not control the deployment environment and want the best possible experience on every device. This is the recommended approach for open-ended web applications where you cannot predict your users' hardware.

Regardless of which provider you choose, you get the same core capabilities through @localmode/core:

generateText() and streamText() with identical options and return types
generateObject() and streamObject() for structured JSON output with Zod schema validation
LanguageModelMiddleware for logging, caching, guardrails, and semantic cache
AbortSignal support for cancellation on every call
Usage tracking with input tokens, output tokens, and duration on every response
React hooks via @localmode/react: useGenerateText(), useChat()

The provider is an implementation detail. The interface is the contract. Build against the interface, and your application stays portable across all four engines -- today and as new providers emerge.

Methodology

All model counts, model IDs, size figures, and API shapes were verified directly against the LocalMode codebase (packages/webllm/src/models.ts, packages/wllama/src/models.ts, packages/transformers/src/models.ts, packages/litert/src/models.ts, and packages/core/src/generation/types.ts). WebLLM performance figures (Phi 3.5 Mini at 71.1 tok/s, Llama 3.1 8B at 41.1 tok/s on M3 Max) were verified against Table 1 of the arxiv paper (2412.15803). The FOSDEM 2025 wllama talk URL was verified against the FOSDEM archive. No figures were sourced from unverified secondary summaries.

Sources

MLC-AI WebLLM -- WebGPU LLM inference engine
WebLLM: A High-Performance In-Browser LLM Inference Engine -- primary source for Phi 3.5 Mini (71.1 tok/s) and Llama 3.1 8B (41.1 tok/s) M3 Max benchmarks (Table 1)
wllama -- llama.cpp WebAssembly binding for browser inference
FOSDEM 2025: wllama -- bringing llama.cpp to the web
Transformers.js -- HuggingFace ML framework for the browser
Transformers.js v3: WebGPU Support -- WebGPU integration announcement
LocalMode Transformers Text Generation Docs -- curated ONNX model catalog
LocalMode WebLLM Docs -- curated MLC model catalog
LocalMode wllama Docs -- curated GGUF model catalog and GGUF metadata parser
Google LiteRT-LM -- Google's on-device LLM inference engine
LocalMode LiteRT Docs -- .litertlm model catalog and early-preview status

Try it yourself

Visit localmode.ai to try 30+ AI demo apps running entirely in your browser. No sign-up, no API keys, no data leaves your device.

Read the Getting Started guide to add local AI to your application in under 5 minutes.

Frequently Asked Questions