Text Generation
Run ONNX language models in the browser with WebGPU acceleration.
Run ONNX-format language models directly in the browser using Transformers.js with WebGPU acceleration and automatic WASM fallback.
For full API reference (generateText(), streamText(), options, result types, and middleware), see the Core Generation guide.
See it in action
Try LLM Chat for a working demo.
4-Tier Fallback
LocalMode offers four LLM providers, each with different trade-offs:
| Provider | Format | Best For | Notes |
|---|---|---|---|
@localmode/litert | .litertlm (Google) | Google's officially-supported Gemma 4 pipeline | 3 verified models; text-only; early preview |
@localmode/webllm | MLC-compiled | Fastest WebGPU inference | 32 curated models (incl. Phi 3.5 Vision); WebGPU required |
@localmode/transformers | ONNX | Broad model selection, WebGPU + WASM | 16 curated ONNX LLMs (5 vision); auto-fallback to WASM |
@localmode/wllama | GGUF (llama.cpp) | Universal browser support | 30 curated (25 language + 3 embedding + 2 reranker) + 160K+ GGUF; true streaming, structured output, reasoning, optional WebGPU, tool calling, vision |
Use a try/catch chain for automatic failover:
import { litert } from '@localmode/litert';
import { webllm } from '@localmode/webllm';
import { transformers } from '@localmode/transformers';
import { wllama } from '@localmode/wllama';
let model;
try {
// Google's optimized engine for supported devices
model = litert.languageModel('qwen3-0.6B');
} catch {
try {
model = webllm.languageModel('Llama-3.2-1B-Instruct-q4f16_1-MLC');
} catch {
try {
model = transformers.languageModel('onnx-community/Qwen3.5-0.8B-ONNX');
} catch {
model = wllama.languageModel('Llama-3.2-1B-Instruct-Q4_K_M');
}
}
}Recommended ONNX Models
Tiny (<500MB download)
| Model | Size | Context | Use Case |
|---|---|---|---|
onnx-community/granite-4.0-350m-ONNX-web | ~120MB | 4K | IBM Granite 4.0 350M, 12 languages, fastest download |
Small (500MB–1GB)
| Model | Size | Context | Use Case |
|---|---|---|---|
onnx-community/Qwen3.5-0.8B-ONNX | ~500MB | 32K | Vision — best sub-1B multimodal, WebGPU recommended |
onnx-community/Qwen3-0.6B-ONNX | ~570MB | 4K | Smallest Qwen3 text model, fast |
onnx-community/granite-4.0-1b-ONNX-web | ~350MB | 4K | IBM Granite 4.0 1B, 12 languages |
onnx-community/Llama-3.2-1B-Instruct-ONNX | ~380MB | 8K | Meta Llama 3.2 1B, q4f16, good general quality |
onnx-community/TinyLlama-1.1B-Chat-v1.0-ONNX | ~350MB | 2K | Small, fast, q4f16, no login required |
onnx-community/Qwen2.5-Coder-1.5B-Instruct | ~450MB | 4K | Code-specialized Qwen2.5, q4f16, great for programming |
onnx-community/DeepSeek-R1-Distill-Qwen-1.5B-ONNX | ~500MB | 4K | DeepSeek-R1 distilled, q4f16, strong reasoning |
Medium (1–2.5GB)
| Model | Size | Context | Use Case |
|---|---|---|---|
onnx-community/Qwen3.5-2B-ONNX | ~1.5GB | 32K | Vision — high quality 2B multimodal, needs 4GB+ RAM |
onnx-community/Llama-3.2-3B-Instruct-ONNX | ~900MB | 8K | Meta Llama 3.2 3B, q4f16, strong quality |
onnx-community/Phi-4-mini-instruct-web-q4f16 | ~2.3GB | 4K | Microsoft Phi-4 Mini, strong reasoning/coding, WebGPU recommended |
microsoft/Phi-3-mini-4k-instruct-onnx-web | ~1.2GB | 4K | Microsoft Phi-3, q4, strong reasoning |
onnx-community/Qwen3-4B-ONNX | ~1.2GB | 4K | Qwen3 4B text, q4f16, needs 4GB+ RAM |
Large (2.5GB+)
| Model | Size | Context | Use Case |
|---|---|---|---|
onnx-community/Qwen3.5-4B-ONNX | ~2.5GB | 32K | Vision — best Qwen3.5 for browser, multimodal, needs 8GB+ RAM |
onnx-community/gemma-4-E2B-it-ONNX | ~1.5GB | 128K | Vision — Google Gemma 4 E2B multimodal, 128K context, needs 6GB+ VRAM |
onnx-community/gemma-4-E4B-it-ONNX | ~3GB | 128K | Vision — Google Gemma 4 E4B multimodal, 128K context, needs 8GB+ VRAM |
Gemma 4 ONNX performance
Gemma 4 via Transformers.js achieves ~2 tok/s on WebGPU — significantly slower than the LiteRT path (~14-16 tok/s). Use @localmode/litert for text-only performance. The ONNX path is valuable for vision/multimodal support and WASM fallback when WebGPU is unavailable. See Gemma 4: LiteRT vs ONNX for a detailed comparison.
Quick Start
import { transformers } from '@localmode/transformers';
import { generateText } from '@localmode/core';
const model = transformers.languageModel('onnx-community/Qwen3.5-0.8B-ONNX');
const { text } = await generateText({
model,
prompt: 'Explain quantum computing in simple terms',
maxTokens: 200,
});Streaming
import { transformers } from '@localmode/transformers';
import { streamText } from '@localmode/core';
const model = transformers.languageModel('onnx-community/Qwen3.5-0.8B-ONNX');
const result = await streamText({
model,
prompt: 'Write a short story about a robot learning to paint',
maxTokens: 500,
});
for await (const chunk of result.stream) {
process.stdout.write(chunk.text);
}Device Selection
The provider automatically selects WebGPU when available, falling back to WASM:
// Automatic: WebGPU if available, WASM fallback
const model = transformers.languageModel('onnx-community/Qwen3.5-0.8B-ONNX');
// Force WASM (for testing or compatibility)
const wasmModel = transformers.languageModel('onnx-community/Qwen3.5-0.8B-ONNX', {
device: 'wasm',
});
// Force WebGPU (will error if unavailable)
const gpuModel = transformers.languageModel('onnx-community/Qwen3.5-0.8B-ONNX', {
device: 'webgpu',
});Custom Settings
import { createTransformers } from '@localmode/transformers';
const myTransformers = createTransformers({
device: 'webgpu',
onProgress: (p) => console.log(`Loading: ${p.progress}%`),
});
const model = myTransformers.languageModel('onnx-community/Qwen3.5-0.8B-ONNX', {
contextLength: 32768,
maxTokens: 1024,
temperature: 0.7,
topP: 0.95,
systemPrompt: 'You are a helpful assistant.',
});Vision (Image Input)
Qwen3.5 and Gemma 4 models support image input via their built-in vision encoder. Check model.supportsVision for feature detection:
const model = transformers.languageModel('onnx-community/Qwen3.5-0.8B-ONNX');
if (model.supportsVision) {
const result = await streamText({
model,
prompt: '',
messages: [{
role: 'user',
content: [
{ type: 'text', text: 'Describe this image.' },
{ type: 'image', data: base64Data, mimeType: 'image/png' },
],
}],
});
}For full multimodal API reference including ContentPart types and content utilities, see the Core Generation guide.
Using Any ONNX Model
The curated catalog is a starting point. You can use any HuggingFace ONNX model:
// Any ONNX model from HuggingFace Hub
const model = transformers.languageModel('your-org/custom-model-ONNX');Not all ONNX models work in the browser. The model must support the Transformers.js text-generation pipeline. Stick to the recommended models above for best results.
Showcase Apps
The LLM Chat showcase app demonstrates ONNX models alongside the WebLLM, wllama, and LiteRT backends. Select a Qwen3.5 or Gemma 4 model to try vision input with image uploads.