Text Generation

Run ONNX-format language models directly in the browser using Transformers.js with WebGPU acceleration and automatic WASM fallback.

For full API reference (generateText(), streamText(), options, result types, and middleware), see the Core Generation guide.

See it in action

Try LLM Chat for a working demo.

4-Tier Fallback

LocalMode offers four LLM providers, each with different trade-offs:

Provider	Format	Best For	Notes
`@localmode/litert`	`.litertlm` (Google)	Google's officially-supported Gemma 4 pipeline	3 verified models; text-only; early preview
`@localmode/webllm`	MLC-compiled	Fastest WebGPU inference	32 curated models (incl. Phi 3.5 Vision); WebGPU required
`@localmode/transformers`	ONNX	Broad model selection, WebGPU + WASM	16 curated ONNX LLMs (5 vision); auto-fallback to WASM
`@localmode/wllama`	GGUF (llama.cpp)	Universal browser support	30 curated (25 language + 3 embedding + 2 reranker) + 160K+ GGUF; true streaming, structured output, reasoning, optional WebGPU, tool calling, vision

Use a try/catch chain for automatic failover:

import { litert } from '@localmode/litert';
import { webllm } from '@localmode/webllm';
import { transformers } from '@localmode/transformers';
import { wllama } from '@localmode/wllama';

let model;
try {
  // Google's optimized engine for supported devices
  model = litert.languageModel('qwen3-0.6B');
} catch {
  try {
    model = webllm.languageModel('Llama-3.2-1B-Instruct-q4f16_1-MLC');
  } catch {
    try {
      model = transformers.languageModel('onnx-community/Qwen3.5-0.8B-ONNX');
    } catch {
      model = wllama.languageModel('Llama-3.2-1B-Instruct-Q4_K_M');
    }
  }
}

Model	Size	Context	Use Case
`onnx-community/granite-4.0-350m-ONNX-web`	~120MB	4K	IBM Granite 4.0 350M, 12 languages, fastest download

Model	Size	Context	Use Case
`onnx-community/Qwen3.5-0.8B-ONNX`	~500MB	32K	Vision — best sub-1B multimodal, WebGPU recommended
`onnx-community/Qwen3-0.6B-ONNX`	~570MB	4K	Smallest Qwen3 text model, fast
`onnx-community/granite-4.0-1b-ONNX-web`	~350MB	4K	IBM Granite 4.0 1B, 12 languages
`onnx-community/Llama-3.2-1B-Instruct-ONNX`	~380MB	8K	Meta Llama 3.2 1B, q4f16, good general quality
`onnx-community/TinyLlama-1.1B-Chat-v1.0-ONNX`	~350MB	2K	Small, fast, q4f16, no login required
`onnx-community/Qwen2.5-Coder-1.5B-Instruct`	~450MB	4K	Code-specialized Qwen2.5, q4f16, great for programming
`onnx-community/DeepSeek-R1-Distill-Qwen-1.5B-ONNX`	~500MB	4K	DeepSeek-R1 distilled, q4f16, strong reasoning

Model	Size	Context	Use Case
`onnx-community/Qwen3.5-2B-ONNX`	~1.5GB	32K	Vision — high quality 2B multimodal, needs 4GB+ RAM
`onnx-community/Llama-3.2-3B-Instruct-ONNX`	~900MB	8K	Meta Llama 3.2 3B, q4f16, strong quality
`onnx-community/Phi-4-mini-instruct-web-q4f16`	~2.3GB	4K	Microsoft Phi-4 Mini, strong reasoning/coding, WebGPU recommended
`microsoft/Phi-3-mini-4k-instruct-onnx-web`	~1.2GB	4K	Microsoft Phi-3, q4, strong reasoning
`onnx-community/Qwen3-4B-ONNX`	~1.2GB	4K	Qwen3 4B text, q4f16, needs 4GB+ RAM

Model	Size	Context	Use Case
`onnx-community/Qwen3.5-4B-ONNX`	~2.5GB	32K	Vision — best Qwen3.5 for browser, multimodal, needs 8GB+ RAM
`onnx-community/gemma-4-E2B-it-ONNX`	~1.5GB	128K	Vision — Google Gemma 4 E2B multimodal, 128K context, needs 6GB+ VRAM
`onnx-community/gemma-4-E4B-it-ONNX`	~3GB	128K	Vision — Google Gemma 4 E4B multimodal, 128K context, needs 8GB+ VRAM

Gemma 4 via Transformers.js achieves ~2 tok/s on WebGPU — significantly slower than the LiteRT path (~14-16 tok/s). Use @localmode/litert for text-only performance. The ONNX path is valuable for vision/multimodal support and WASM fallback when WebGPU is unavailable. See Gemma 4: LiteRT vs ONNX for a detailed comparison.

Quick Start

import { transformers } from '@localmode/transformers';
import { generateText } from '@localmode/core';

const model = transformers.languageModel('onnx-community/Qwen3.5-0.8B-ONNX');

const { text } = await generateText({
  model,
  prompt: 'Explain quantum computing in simple terms',
  maxTokens: 200,
});

Streaming

import { transformers } from '@localmode/transformers';
import { streamText } from '@localmode/core';

const model = transformers.languageModel('onnx-community/Qwen3.5-0.8B-ONNX');

const result = await streamText({
  model,
  prompt: 'Write a short story about a robot learning to paint',
  maxTokens: 500,
});

for await (const chunk of result.stream) {
  process.stdout.write(chunk.text);
}

Device Selection

The provider automatically selects WebGPU when available, falling back to WASM:

// Automatic: WebGPU if available, WASM fallback
const model = transformers.languageModel('onnx-community/Qwen3.5-0.8B-ONNX');

// Force WASM (for testing or compatibility)
const wasmModel = transformers.languageModel('onnx-community/Qwen3.5-0.8B-ONNX', {
  device: 'wasm',
});

// Force WebGPU (will error if unavailable)
const gpuModel = transformers.languageModel('onnx-community/Qwen3.5-0.8B-ONNX', {
  device: 'webgpu',
});

Custom Settings

import { createTransformers } from '@localmode/transformers';

const myTransformers = createTransformers({
  device: 'webgpu',
  onProgress: (p) => console.log(`Loading: ${p.progress}%`),
});

const model = myTransformers.languageModel('onnx-community/Qwen3.5-0.8B-ONNX', {
  contextLength: 32768,
  maxTokens: 1024,
  temperature: 0.7,
  topP: 0.95,
  systemPrompt: 'You are a helpful assistant.',
});

Vision (Image Input)

Qwen3.5 and Gemma 4 models support image input via their built-in vision encoder. Check model.supportsVision for feature detection:

const model = transformers.languageModel('onnx-community/Qwen3.5-0.8B-ONNX');

if (model.supportsVision) {
  const result = await streamText({
    model,
    prompt: '',
    messages: [{
      role: 'user',
      content: [
        { type: 'text', text: 'Describe this image.' },
        { type: 'image', data: base64Data, mimeType: 'image/png' },
      ],
    }],
  });
}

For full multimodal API reference including ContentPart types and content utilities, see the Core Generation guide.

Using Any ONNX Model

The curated catalog is a starting point. You can use any HuggingFace ONNX model:

// Any ONNX model from HuggingFace Hub
const model = transformers.languageModel('your-org/custom-model-ONNX');

Not all ONNX models work in the browser. The model must support the Transformers.js text-generation pipeline. Stick to the recommended models above for best results.

Showcase Apps

The LLM Chat showcase app demonstrates ONNX models alongside the WebLLM, wllama, and LiteRT backends. Select a Qwen3.5 or Gemma 4 model to try vision input with image uploads.

App	Description	Links
LLM Chat	Chat with local ONNX language models, including vision input on Qwen3.5	Demo · Source

Text Generation

4-Tier Fallback

Recommended ONNX Models

Tiny (<500MB download)

Small (500MB–1GB)

Medium (1–2.5GB)

Large (2.5GB+)