LocalMode
Transformers

Text Generation

Run ONNX language models in the browser with WebGPU acceleration.

Run ONNX-format language models directly in the browser using Transformers.js with WebGPU acceleration and automatic WASM fallback.

For full API reference (generateText(), streamText(), options, result types, and middleware), see the Core Generation guide.

See it in action

Try LLM Chat for a working demo.

4-Tier Fallback

LocalMode offers four LLM providers, each with different trade-offs:

ProviderFormatBest ForNotes
@localmode/litert.litertlm (Google)Google's officially-supported Gemma 4 pipeline3 verified models; text-only; early preview
@localmode/webllmMLC-compiledFastest WebGPU inference32 curated models (incl. Phi 3.5 Vision); WebGPU required
@localmode/transformersONNXBroad model selection, WebGPU + WASM16 curated ONNX LLMs (5 vision); auto-fallback to WASM
@localmode/wllamaGGUF (llama.cpp)Universal browser support30 curated (25 language + 3 embedding + 2 reranker) + 160K+ GGUF; true streaming, structured output, reasoning, optional WebGPU, tool calling, vision

Use a try/catch chain for automatic failover:

import { litert } from '@localmode/litert';
import { webllm } from '@localmode/webllm';
import { transformers } from '@localmode/transformers';
import { wllama } from '@localmode/wllama';

let model;
try {
  // Google's optimized engine for supported devices
  model = litert.languageModel('qwen3-0.6B');
} catch {
  try {
    model = webllm.languageModel('Llama-3.2-1B-Instruct-q4f16_1-MLC');
  } catch {
    try {
      model = transformers.languageModel('onnx-community/Qwen3.5-0.8B-ONNX');
    } catch {
      model = wllama.languageModel('Llama-3.2-1B-Instruct-Q4_K_M');
    }
  }
}

Tiny (<500MB download)

ModelSizeContextUse Case
onnx-community/granite-4.0-350m-ONNX-web~120MB4KIBM Granite 4.0 350M, 12 languages, fastest download

Small (500MB–1GB)

ModelSizeContextUse Case
onnx-community/Qwen3.5-0.8B-ONNX~500MB32KVision — best sub-1B multimodal, WebGPU recommended
onnx-community/Qwen3-0.6B-ONNX~570MB4KSmallest Qwen3 text model, fast
onnx-community/granite-4.0-1b-ONNX-web~350MB4KIBM Granite 4.0 1B, 12 languages
onnx-community/Llama-3.2-1B-Instruct-ONNX~380MB8KMeta Llama 3.2 1B, q4f16, good general quality
onnx-community/TinyLlama-1.1B-Chat-v1.0-ONNX~350MB2KSmall, fast, q4f16, no login required
onnx-community/Qwen2.5-Coder-1.5B-Instruct~450MB4KCode-specialized Qwen2.5, q4f16, great for programming
onnx-community/DeepSeek-R1-Distill-Qwen-1.5B-ONNX~500MB4KDeepSeek-R1 distilled, q4f16, strong reasoning

Medium (1–2.5GB)

ModelSizeContextUse Case
onnx-community/Qwen3.5-2B-ONNX~1.5GB32KVision — high quality 2B multimodal, needs 4GB+ RAM
onnx-community/Llama-3.2-3B-Instruct-ONNX~900MB8KMeta Llama 3.2 3B, q4f16, strong quality
onnx-community/Phi-4-mini-instruct-web-q4f16~2.3GB4KMicrosoft Phi-4 Mini, strong reasoning/coding, WebGPU recommended
microsoft/Phi-3-mini-4k-instruct-onnx-web~1.2GB4KMicrosoft Phi-3, q4, strong reasoning
onnx-community/Qwen3-4B-ONNX~1.2GB4KQwen3 4B text, q4f16, needs 4GB+ RAM

Large (2.5GB+)

ModelSizeContextUse Case
onnx-community/Qwen3.5-4B-ONNX~2.5GB32KVision — best Qwen3.5 for browser, multimodal, needs 8GB+ RAM
onnx-community/gemma-4-E2B-it-ONNX~1.5GB128KVision — Google Gemma 4 E2B multimodal, 128K context, needs 6GB+ VRAM
onnx-community/gemma-4-E4B-it-ONNX~3GB128KVision — Google Gemma 4 E4B multimodal, 128K context, needs 8GB+ VRAM

Gemma 4 ONNX performance

Gemma 4 via Transformers.js achieves ~2 tok/s on WebGPU — significantly slower than the LiteRT path (~14-16 tok/s). Use @localmode/litert for text-only performance. The ONNX path is valuable for vision/multimodal support and WASM fallback when WebGPU is unavailable. See Gemma 4: LiteRT vs ONNX for a detailed comparison.

Quick Start

import { transformers } from '@localmode/transformers';
import { generateText } from '@localmode/core';

const model = transformers.languageModel('onnx-community/Qwen3.5-0.8B-ONNX');

const { text } = await generateText({
  model,
  prompt: 'Explain quantum computing in simple terms',
  maxTokens: 200,
});

Streaming

import { transformers } from '@localmode/transformers';
import { streamText } from '@localmode/core';

const model = transformers.languageModel('onnx-community/Qwen3.5-0.8B-ONNX');

const result = await streamText({
  model,
  prompt: 'Write a short story about a robot learning to paint',
  maxTokens: 500,
});

for await (const chunk of result.stream) {
  process.stdout.write(chunk.text);
}

Device Selection

The provider automatically selects WebGPU when available, falling back to WASM:

// Automatic: WebGPU if available, WASM fallback
const model = transformers.languageModel('onnx-community/Qwen3.5-0.8B-ONNX');

// Force WASM (for testing or compatibility)
const wasmModel = transformers.languageModel('onnx-community/Qwen3.5-0.8B-ONNX', {
  device: 'wasm',
});

// Force WebGPU (will error if unavailable)
const gpuModel = transformers.languageModel('onnx-community/Qwen3.5-0.8B-ONNX', {
  device: 'webgpu',
});

Custom Settings

import { createTransformers } from '@localmode/transformers';

const myTransformers = createTransformers({
  device: 'webgpu',
  onProgress: (p) => console.log(`Loading: ${p.progress}%`),
});

const model = myTransformers.languageModel('onnx-community/Qwen3.5-0.8B-ONNX', {
  contextLength: 32768,
  maxTokens: 1024,
  temperature: 0.7,
  topP: 0.95,
  systemPrompt: 'You are a helpful assistant.',
});

Vision (Image Input)

Qwen3.5 and Gemma 4 models support image input via their built-in vision encoder. Check model.supportsVision for feature detection:

const model = transformers.languageModel('onnx-community/Qwen3.5-0.8B-ONNX');

if (model.supportsVision) {
  const result = await streamText({
    model,
    prompt: '',
    messages: [{
      role: 'user',
      content: [
        { type: 'text', text: 'Describe this image.' },
        { type: 'image', data: base64Data, mimeType: 'image/png' },
      ],
    }],
  });
}

For full multimodal API reference including ContentPart types and content utilities, see the Core Generation guide.

Using Any ONNX Model

The curated catalog is a starting point. You can use any HuggingFace ONNX model:

// Any ONNX model from HuggingFace Hub
const model = transformers.languageModel('your-org/custom-model-ONNX');

Not all ONNX models work in the browser. The model must support the Transformers.js text-generation pipeline. Stick to the recommended models above for best results.

Showcase Apps

The LLM Chat showcase app demonstrates ONNX models alongside the WebLLM, wllama, and LiteRT backends. Select a Qwen3.5 or Gemma 4 model to try vision input with image uploads.

AppDescriptionLinks
LLM ChatChat with local ONNX language models, including vision input on Qwen3.5Demo · Source

On this page