LocalMode
WebLLM

Overview

WebLLM provider for local LLM inference in the browser.

@localmode/webllm

Run large language models locally in the browser using WebGPU. Uses 4-bit quantized models for efficient inference.

Features

  • 🚀 WebGPU Acceleration — Native GPU performance in the browser
  • 🔒 Private — Models run entirely on-device
  • 📦 Cached — Models stored in browser cache after first download
  • Streaming — Real-time token generation

Installation

bash pnpm install @localmode/webllm @localmode/core
bash npm install @localmode/webllm @localmode/core
bash yarn add @localmode/webllm @localmode/core

Quick Start

import { streamText } from '@localmode/core';
import { webllm } from '@localmode/webllm';

const model = webllm.languageModel('Llama-3.2-1B-Instruct-q4f16_1-MLC');

const stream = await streamText({
  model,
  prompt: 'Explain quantum computing in simple terms.',
});

for await (const chunk of stream) {
  process.stdout.write(chunk.text);
}

Available Models

Model Selection Guide

  • Testing: Llama-3.2-1B-Instruct - fastest to download and run
  • Production: Llama-3.2-3B-Instruct or Phi-3.5-mini - best quality
  • Code/Reasoning: Phi-3.5-mini - specialized for these tasks
  • Multilingual: Qwen2.5-1.5B-Instruct - 100+ languages
  • Low Memory: SmolLM2-360M-Instruct - ~250MB

Text Generation

Streaming

import { streamText } from '@localmode/core';

const stream = await streamText({
  model: webllm.languageModel('Llama-3.2-1B-Instruct-q4f16_1-MLC'),
  prompt: 'Write a haiku about programming.',
});

let fullText = '';
for await (const chunk of stream) {
  fullText += chunk.text;
  // Update UI with each chunk
}

// Or get full text at once
const text = await stream.text;

Non-Streaming

import { generateText } from '@localmode/core';

const { text, usage } = await generateText({
  model: webllm.languageModel('Llama-3.2-1B-Instruct-q4f16_1-MLC'),
  prompt: 'What is the capital of France?',
});

console.log(text);
console.log('Tokens used:', usage.totalTokens);

Configuration

Model Options

const model = webllm.languageModel('Llama-3.2-1B-Instruct-q4f16_1-MLC', {
  systemPrompt: 'You are a helpful coding assistant.',
  temperature: 0.7,
  maxTokens: 1024,
  topP: 0.9,
});

Prop

Type

Custom Provider

import { createWebLLM } from '@localmode/webllm';

const myWebLLM = createWebLLM({
  onProgress: (progress) => {
    console.log(`Loading: ${(progress.progress * 100).toFixed(1)}%`);
    console.log(`Status: ${progress.text}`);
  },
});

const model = myWebLLM.languageModel('Llama-3.2-1B-Instruct-q4f16_1-MLC');

Model Preloading

Preload models during app initialization:

import { preloadModel, isModelCached } from '@localmode/webllm';

// Check if already cached
if (!(await isModelCached('Llama-3.2-1B-Instruct-q4f16_1-MLC'))) {
  // Show loading UI
  await preloadModel('Llama-3.2-1B-Instruct-q4f16_1-MLC', {
    onProgress: (progress) => {
      updateLoadingBar(progress.progress * 100);
    },
  });
}

// Model is ready for instant inference

Model Management

Available Models Registry

Access model metadata programmatically:

import { WEBLLM_MODELS, type WebLLMModelId } from '@localmode/webllm';

// Get all available models
const modelIds = Object.keys(WEBLLM_MODELS) as WebLLMModelId[];

// Access model info
const llama = WEBLLM_MODELS['Llama-3.2-1B-Instruct-q4f16_1-MLC'];
console.log(llama.name);          // 'Llama 3.2 1B Instruct'
console.log(llama.contextLength); // 4096
console.log(llama.size);          // '~700MB'
console.log(llama.sizeBytes);     // 734003200
console.log(llama.description);   // 'Fast, lightweight model...'

Model Categorization

Categorize models by size for UI display:

import { getModelCategory, WEBLLM_MODELS, type WebLLMModelId } from '@localmode/webllm';

// Get category based on model size
const modelId: WebLLMModelId = 'Llama-3.2-1B-Instruct-q4f16_1-MLC';
const sizeBytes = WEBLLM_MODELS[modelId].sizeBytes;
const category = getModelCategory(sizeBytes);

console.log(category); // 'small' | 'medium' | 'large'

// Use for UI grouping
function getModelsByCategory() {
  const categories = { small: [], medium: [], large: [] };
  
  for (const [id, info] of Object.entries(WEBLLM_MODELS)) {
    const cat = getModelCategory(info.sizeBytes);
    categories[cat].push({ id, ...info });
  }
  
  return categories;
}

Delete Cached Models

Remove models from browser cache to free up storage:

import { deleteModelCache, isModelCached } from '@localmode/webllm';

// Delete a specific model's cache
await deleteModelCache('Llama-3.2-1B-Instruct-q4f16_1-MLC');

// Verify deletion
const stillCached = await isModelCached('Llama-3.2-1B-Instruct-q4f16_1-MLC');
console.log(stillCached); // false

Storage Management

LLM models can be large (700MB - 4GB). Use deleteModelCache() to let users free up storage when they no longer need a model.

Type-Safe Model IDs

Use the WebLLMModelId type for type-safe model selection:

import type { WebLLMModelId } from '@localmode/webllm';

// Type-safe function that only accepts valid model IDs
function selectModel(modelId: WebLLMModelId) {
  return webllm.languageModel(modelId);
}

// ✅ Valid
selectModel('Llama-3.2-1B-Instruct-q4f16_1-MLC');

// ❌ TypeScript error: invalid model ID
selectModel('invalid-model-name');

Chat Application

import { streamText } from '@localmode/core';

interface Message {
  role: 'user' | 'assistant';
  content: string;
}

async function chat(messages: Message[], userMessage: string) {
  const model = webllm.languageModel('Llama-3.2-1B-Instruct-q4f16_1-MLC', {
    systemPrompt: 'You are a helpful assistant.',
  });

  // Build conversation prompt
  const prompt = messages
    .map((m) => `${m.role}: ${m.content}`)
    .concat([`user: ${userMessage}`, 'assistant:'])
    .join('\n');

  const stream = await streamText({
    model,
    prompt,
    stopSequences: ['user:', '\n\n'],
  });

  let response = '';
  for await (const chunk of stream) {
    response += chunk.text;
    // Update UI
  }

  return response;
}

RAG Integration

Combine with retrieval for document-grounded chat:

import { semanticSearch, rerank, streamText } from '@localmode/core';

async function ragChat(query: string, db: VectorDB) {
  // 1. Retrieve context
  const results = await semanticSearch({
    db,
    model: embeddingModel,
    query,
    k: 10,
  });

  // 2. Rerank for relevance
  const reranked = await rerank({
    model: rerankerModel,
    query,
    documents: results.map((r) => r.metadata.text as string),
    topK: 3,
  });

  const context = reranked.map((r) => r.document).join('\n\n---\n\n');

  // 3. Generate with context
  const llm = webllm.languageModel('Llama-3.2-3B-Instruct-q4f16_1-MLC');

  const stream = await streamText({
    model: llm,
    prompt: `You are a helpful assistant. Answer based only on the provided context.
If the answer is not in the context, say "I don't have that information."

Context:
${context}

Question: ${query}

Answer:`,
  });

  return stream;
}

Requirements

WebGPU Required

WebLLM requires WebGPU support. Check availability:

import { isWebGPUSupported } from '@localmode/core';

if (!isWebGPUSupported()) {
  console.warn('WebGPU not available. LLM features disabled.');
}

Browser Support

BrowserSupport
Chrome 113+
Edge 113+
Firefox❌ (Nightly only)
Safari 18+
iOS Safari✅ (iOS 26+)

Hardware Requirements

  • GPU: Any modern GPU with WebGPU support
  • VRAM: Depends on model (1-3GB for 1-3B models)
  • RAM: 4GB minimum, 8GB+ recommended

Best Practices

WebLLM Tips

  1. Preload models - Load during app init for instant inference
  2. Start small - Use 1B models for testing, larger for production
  3. Stream responses - Better UX than waiting for complete response
  4. Handle errors - GPU errors, OOM, etc. can occur
  5. Check capabilities - Verify WebGPU before showing LLM features

Error Handling

import { streamText, GenerationError } from '@localmode/core';

try {
  const stream = await streamText({
    model,
    prompt: 'Hello',
  });

  for await (const chunk of stream) {
    // ...
  }
} catch (error) {
  if (error instanceof GenerationError) {
    if (error.code === 'WEBGPU_NOT_SUPPORTED') {
      console.error('WebGPU not available');
    } else if (error.code === 'MODEL_LOAD_FAILED') {
      console.error('Failed to load model');
    } else if (error.code === 'OUT_OF_MEMORY') {
      console.error('Not enough GPU memory');
    }
  }
}

Next Steps

On this page