@localmode/webllm

Run large language models locally in the browser using WebGPU. Uses 4-bit quantized models for efficient inference.

Features

🚀 WebGPU Acceleration — Native GPU performance in the browser
🔒 Private — Models run entirely on-device
📦 Cached — Models stored in browser cache after first download
⚡ Streaming — Real-time token generation

Installation

bash pnpm install @localmode/webllm @localmode/core

bash npm install @localmode/webllm @localmode/core

bash yarn add @localmode/webllm @localmode/core

Quick Start

import { streamText } from '@localmode/core';
import { webllm } from '@localmode/webllm';

const model = webllm.languageModel('Llama-3.2-1B-Instruct-q4f16_1-MLC');

const stream = await streamText({
  model,
  prompt: 'Explain quantum computing in simple terms.',
});

for await (const chunk of stream) {
  process.stdout.write(chunk.text);
}

Available Models

Model Selection Guide

Testing: Llama-3.2-1B-Instruct - fastest to download and run
Production: Llama-3.2-3B-Instruct or Phi-3.5-mini - best quality
Code/Reasoning: Phi-3.5-mini - specialized for these tasks
Multilingual: Qwen2.5-1.5B-Instruct - 100+ languages
Low Memory: SmolLM2-360M-Instruct - ~250MB

Text Generation

Streaming

import { streamText } from '@localmode/core';

const stream = await streamText({
  model: webllm.languageModel('Llama-3.2-1B-Instruct-q4f16_1-MLC'),
  prompt: 'Write a haiku about programming.',
});

let fullText = '';
for await (const chunk of stream) {
  fullText += chunk.text;
  // Update UI with each chunk
}

// Or get full text at once
const text = await stream.text;

Non-Streaming

import { generateText } from '@localmode/core';

const { text, usage } = await generateText({
  model: webllm.languageModel('Llama-3.2-1B-Instruct-q4f16_1-MLC'),
  prompt: 'What is the capital of France?',
});

console.log(text);
console.log('Tokens used:', usage.totalTokens);

Configuration

Model Options

const model = webllm.languageModel('Llama-3.2-1B-Instruct-q4f16_1-MLC', {
  systemPrompt: 'You are a helpful coding assistant.',
  temperature: 0.7,
  maxTokens: 1024,
  topP: 0.9,
});

Prop

Type

Custom Provider

import { createWebLLM } from '@localmode/webllm';

const myWebLLM = createWebLLM({
  onProgress: (progress) => {
    console.log(`Loading: ${(progress.progress * 100).toFixed(1)}%`);
    console.log(`Status: ${progress.text}`);
  },
});

const model = myWebLLM.languageModel('Llama-3.2-1B-Instruct-q4f16_1-MLC');

Model Preloading

Preload models during app initialization:

import { preloadModel, isModelCached } from '@localmode/webllm';

// Check if already cached
if (!(await isModelCached('Llama-3.2-1B-Instruct-q4f16_1-MLC'))) {
  // Show loading UI
  await preloadModel('Llama-3.2-1B-Instruct-q4f16_1-MLC', {
    onProgress: (progress) => {
      updateLoadingBar(progress.progress * 100);
    },
  });
}

// Model is ready for instant inference

Model Management

Available Models Registry

Access model metadata programmatically:

import { WEBLLM_MODELS, type WebLLMModelId } from '@localmode/webllm';

// Get all available models
const modelIds = Object.keys(WEBLLM_MODELS) as WebLLMModelId[];

// Access model info
const llama = WEBLLM_MODELS['Llama-3.2-1B-Instruct-q4f16_1-MLC'];
console.log(llama.name);          // 'Llama 3.2 1B Instruct'
console.log(llama.contextLength); // 4096
console.log(llama.size);          // '~700MB'
console.log(llama.sizeBytes);     // 734003200
console.log(llama.description);   // 'Fast, lightweight model...'

Model Categorization

Categorize models by size for UI display:

import { getModelCategory, WEBLLM_MODELS, type WebLLMModelId } from '@localmode/webllm';

// Get category based on model size
const modelId: WebLLMModelId = 'Llama-3.2-1B-Instruct-q4f16_1-MLC';
const sizeBytes = WEBLLM_MODELS[modelId].sizeBytes;
const category = getModelCategory(sizeBytes);

console.log(category); // 'small' | 'medium' | 'large'

// Use for UI grouping
function getModelsByCategory() {
  const categories = { small: [], medium: [], large: [] };
  
  for (const [id, info] of Object.entries(WEBLLM_MODELS)) {
    const cat = getModelCategory(info.sizeBytes);
    categories[cat].push({ id, ...info });
  }
  
  return categories;
}

Delete Cached Models

Remove models from browser cache to free up storage:

import { deleteModelCache, isModelCached } from '@localmode/webllm';

// Delete a specific model's cache
await deleteModelCache('Llama-3.2-1B-Instruct-q4f16_1-MLC');

// Verify deletion
const stillCached = await isModelCached('Llama-3.2-1B-Instruct-q4f16_1-MLC');
console.log(stillCached); // false

Storage Management

LLM models can be large (700MB - 4GB). Use deleteModelCache() to let users free up storage when they no longer need a model.

Type-Safe Model IDs

Use the WebLLMModelId type for type-safe model selection:

import type { WebLLMModelId } from '@localmode/webllm';

// Type-safe function that only accepts valid model IDs
function selectModel(modelId: WebLLMModelId) {
  return webllm.languageModel(modelId);
}

// ✅ Valid
selectModel('Llama-3.2-1B-Instruct-q4f16_1-MLC');

// ❌ TypeScript error: invalid model ID
selectModel('invalid-model-name');

Chat Application

import { streamText } from '@localmode/core';

interface Message {
  role: 'user' | 'assistant';
  content: string;
}

async function chat(messages: Message[], userMessage: string) {
  const model = webllm.languageModel('Llama-3.2-1B-Instruct-q4f16_1-MLC', {
    systemPrompt: 'You are a helpful assistant.',
  });

  // Build conversation prompt
  const prompt = messages
    .map((m) => `${m.role}: ${m.content}`)
    .concat([`user: ${userMessage}`, 'assistant:'])
    .join('\n');

  const stream = await streamText({
    model,
    prompt,
    stopSequences: ['user:', '\n\n'],
  });

  let response = '';
  for await (const chunk of stream) {
    response += chunk.text;
    // Update UI
  }

  return response;
}

RAG Integration

Combine with retrieval for document-grounded chat:

import { semanticSearch, rerank, streamText } from '@localmode/core';

async function ragChat(query: string, db: VectorDB) {
  // 1. Retrieve context
  const results = await semanticSearch({
    db,
    model: embeddingModel,
    query,
    k: 10,
  });

  // 2. Rerank for relevance
  const reranked = await rerank({
    model: rerankerModel,
    query,
    documents: results.map((r) => r.metadata.text as string),
    topK: 3,
  });

  const context = reranked.map((r) => r.document).join('\n\n---\n\n');

  // 3. Generate with context
  const llm = webllm.languageModel('Llama-3.2-3B-Instruct-q4f16_1-MLC');

  const stream = await streamText({
    model: llm,
    prompt: `You are a helpful assistant. Answer based only on the provided context.
If the answer is not in the context, say "I don't have that information."

Context:
${context}

Question: ${query}

Answer:`,
  });

  return stream;
}

Requirements

WebGPU Required

WebLLM requires WebGPU support. Check availability:

import { isWebGPUSupported } from '@localmode/core';

if (!isWebGPUSupported()) {
  console.warn('WebGPU not available. LLM features disabled.');
}

Browser Support

Browser	Support
Chrome 113+	✅
Edge 113+	✅
Firefox	❌ (Nightly only)
Safari 18+	✅
iOS Safari	✅ (iOS 26+)

Hardware Requirements

GPU: Any modern GPU with WebGPU support
VRAM: Depends on model (1-3GB for 1-3B models)
RAM: 4GB minimum, 8GB+ recommended

Best Practices

WebLLM Tips

Preload models - Load during app init for instant inference
Start small - Use 1B models for testing, larger for production
Stream responses - Better UX than waiting for complete response
Handle errors - GPU errors, OOM, etc. can occur
Check capabilities - Verify WebGPU before showing LLM features

Error Handling

import { streamText, GenerationError } from '@localmode/core';

try {
  const stream = await streamText({
    model,
    prompt: 'Hello',
  });

  for await (const chunk of stream) {
    // ...
  }
} catch (error) {
  if (error instanceof GenerationError) {
    if (error.code === 'WEBGPU_NOT_SUPPORTED') {
      console.error('WebGPU not available');
    } else if (error.code === 'MODEL_LOAD_FAILED') {
      console.error('Failed to load model');
    } else if (error.code === 'OUT_OF_MEMORY') {
      console.error('Not enough GPU memory');
    }
  }
}

Next Steps

Text Generation

Learn about streaming and generation options.

RAG

Build retrieval-augmented generation pipelines.

Capabilities

Detect WebGPU and other features.

Overview

@localmode/webllm

Features

Installation

Quick Start

Available Models

Text Generation

Streaming

Non-Streaming

Configuration

Model Options

Custom Provider

Model Preloading

Model Management

Available Models Registry

Model Categorization

Delete Cached Models

Type-Safe Model IDs

Chat Application

RAG Integration

Requirements

Browser Support

Hardware Requirements

Best Practices

Error Handling

Next Steps

Text Generation

RAG

Capabilities

On this page

Overview

Llama 3.2 Models (Recommended)

Phi Models (Reasoning & Coding)

Qwen Models (Multilingual)

SmolLM Models (Ultra-Compact)

Gemma Models

Text Generation

RAG

Capabilities

On this page