LocalMode
Transformers

Text Generation (Experimental)

Run ONNX language models in the browser with WebGPU acceleration.

Run ONNX-format language models directly in the browser using Transformers.js v4 with WebGPU acceleration and automatic WASM fallback.

Experimental: This feature uses Transformers.js v4 which is currently a preview release (@next tag). The API may change in future releases.

For full API reference (generateText(), streamText(), options, result types, and middleware), see the Core Generation guide.

See it in action

Try LLM Chat for a working demo.

3-Tier Fallback

LocalMode offers three LLM providers, each with different trade-offs:

ProviderFormatBest ForSpeed
@localmode/webllmMLC-compiledFastest inference, limited models60-100 tok/s
@localmode/transformersONNX (TJS v4)Broad model selection, WebGPU40-60 tok/s
@localmode/wllamaGGUF (llama.cpp)Universal browser support5-15 tok/s

Use createProviderWithFallback() for automatic failover:

import { createProviderWithFallback } from '@localmode/core';
import { webllm } from '@localmode/webllm';
import { transformers } from '@localmode/transformers';
import { wllama } from '@localmode/wllama';

const model = await createProviderWithFallback({
  providers: [
    () => webllm.languageModel('Llama-3.2-1B-Instruct-q4f16_1-MLC'),
    () => transformers.languageModel('onnx-community/Qwen3.5-0.8B-ONNX'),
    () => wllama.languageModel('bartowski/Llama-3.2-1B-Instruct-GGUF:Q4_K_M'),
  ],
  onFallback: (error, idx) => console.warn(`Provider ${idx} failed:`, error),
});

Tiny (<500MB download)

ModelSizeContextUse Case
onnx-community/granite-4.0-350m-ONNX-web~120MB4KIBM Granite 4.0 350M, 12 languages, fastest download

Small (500MB–1GB)

ModelSizeContextUse Case
onnx-community/Qwen3.5-0.8B-ONNX~500MB32KVision — best sub-1B multimodal, WebGPU recommended
onnx-community/Qwen3-0.6B-ONNX~570MB4KSmallest Qwen3 text model, fast
onnx-community/granite-4.0-1b-ONNX-web~350MB4KIBM Granite 4.0 1B, 12 languages
onnx-community/Llama-3.2-1B-Instruct-ONNX~380MB8KMeta Llama 3.2 1B, q4f16, good general quality
onnx-community/TinyLlama-1.1B-Chat-v1.0-ONNX~350MB2KSmall, fast, q4f16, no login required
onnx-community/Qwen2.5-Coder-1.5B-Instruct~450MB4KCode-specialized Qwen2.5, q4f16, great for programming
onnx-community/DeepSeek-R1-Distill-Qwen-1.5B-ONNX~500MB4KDeepSeek-R1 distilled, q4f16, strong reasoning

Medium (1–2.5GB)

ModelSizeContextUse Case
onnx-community/Qwen3.5-2B-ONNX~1.5GB32KVision — high quality 2B multimodal, needs 4GB+ RAM
onnx-community/Llama-3.2-3B-Instruct-ONNX~900MB8KMeta Llama 3.2 3B, q4f16, strong quality
onnx-community/Phi-4-mini-instruct-web-q4f16~2.3GB4KMicrosoft Phi-4 Mini, strong reasoning/coding, WebGPU recommended
microsoft/Phi-3-mini-4k-instruct-onnx-web~1.2GB4KMicrosoft Phi-3, q4, strong reasoning
onnx-community/Qwen3-4B-ONNX~1.2GB4KQwen3 4B text, q4f16, needs 4GB+ RAM

Large (2.5GB+)

ModelSizeContextUse Case
onnx-community/Qwen3.5-4B-ONNX~2.5GB32KVision — best Qwen3.5 for browser, multimodal, needs 8GB+ RAM

Quick Start

import { transformers } from '@localmode/transformers';
import { generateText } from '@localmode/core';

const model = transformers.languageModel('onnx-community/Qwen3.5-0.8B-ONNX');

const { text } = await generateText({
  model,
  prompt: 'Explain quantum computing in simple terms',
  maxTokens: 200,
});

Streaming

import { transformers } from '@localmode/transformers';
import { streamText } from '@localmode/core';

const model = transformers.languageModel('onnx-community/Qwen3.5-0.8B-ONNX');

const result = await streamText({
  model,
  prompt: 'Write a short story about a robot learning to paint',
  maxTokens: 500,
});

for await (const chunk of result.stream) {
  process.stdout.write(chunk.text);
}

Device Selection

The provider automatically selects WebGPU when available, falling back to WASM:

// Automatic: WebGPU if available, WASM fallback
const model = transformers.languageModel('onnx-community/Qwen3.5-0.8B-ONNX');

// Force WASM (for testing or compatibility)
const wasmModel = transformers.languageModel('onnx-community/Qwen3.5-0.8B-ONNX', {
  device: 'wasm',
});

// Force WebGPU (will error if unavailable)
const gpuModel = transformers.languageModel('onnx-community/Qwen3.5-0.8B-ONNX', {
  device: 'webgpu',
});

Custom Settings

import { createTransformers } from '@localmode/transformers';

const myTransformers = createTransformers({
  device: 'webgpu',
  onProgress: (p) => console.log(`Loading: ${p.progress}%`),
});

const model = myTransformers.languageModel('onnx-community/Qwen3.5-0.8B-ONNX', {
  contextLength: 32768,
  maxTokens: 1024,
  temperature: 0.7,
  topP: 0.95,
  systemPrompt: 'You are a helpful assistant.',
});

Vision (Image Input)

Qwen3.5 models support image input via their built-in vision encoder. Check model.supportsVision for feature detection:

const model = transformers.languageModel('onnx-community/Qwen3.5-0.8B-ONNX');

if (model.supportsVision) {
  const result = await streamText({
    model,
    prompt: '',
    messages: [{
      role: 'user',
      content: [
        { type: 'text', text: 'Describe this image.' },
        { type: 'image', data: base64Data, mimeType: 'image/png' },
      ],
    }],
  });
}

For full multimodal API reference including ContentPart types and content utilities, see the Core Generation guide.

Using Any ONNX Model

The curated catalog is a starting point. You can use any HuggingFace ONNX model:

// Any ONNX model from HuggingFace Hub
const model = transformers.languageModel('your-org/custom-model-ONNX');

Not all ONNX models work in the browser. The model must support the TJS v4 text-generation pipeline. Stick to the recommended models above for best results.

Showcase Apps

The LLM Chat showcase app demonstrates ONNX models alongside WebLLM and wllama backends. Select a Qwen3.5 model to try vision input with image uploads.

AppDescriptionLinks
LLM ChatChat with local ONNX language models, including vision input on Qwen3.5Demo · Source

On this page