WebLLM
Overview
WebLLM provider for local LLM inference in the browser.
@localmode/webllm
Run large language models locally in the browser using WebGPU. Uses 4-bit quantized models for efficient inference.
Features
- 🚀 WebGPU Acceleration — Native GPU performance in the browser
- 🔒 Private — Models run entirely on-device
- 📦 Cached — Models stored in browser cache after first download
- ⚡ Streaming — Real-time token generation
Installation
bash pnpm install @localmode/webllm @localmode/core bash npm install @localmode/webllm @localmode/core bash yarn add @localmode/webllm @localmode/core Quick Start
import { streamText } from '@localmode/core';
import { webllm } from '@localmode/webllm';
const model = webllm.languageModel('Llama-3.2-1B-Instruct-q4f16_1-MLC');
const stream = await streamText({
model,
prompt: 'Explain quantum computing in simple terms.',
});
for await (const chunk of stream) {
process.stdout.write(chunk.text);
}Available Models
Model Selection Guide
- Testing:
Llama-3.2-1B-Instruct- fastest to download and run - Production:
Llama-3.2-3B-InstructorPhi-3.5-mini- best quality - Code/Reasoning:
Phi-3.5-mini- specialized for these tasks - Multilingual:
Qwen2.5-1.5B-Instruct- 100+ languages - Low Memory:
SmolLM2-360M-Instruct- ~250MB
Text Generation
Streaming
import { streamText } from '@localmode/core';
const stream = await streamText({
model: webllm.languageModel('Llama-3.2-1B-Instruct-q4f16_1-MLC'),
prompt: 'Write a haiku about programming.',
});
let fullText = '';
for await (const chunk of stream) {
fullText += chunk.text;
// Update UI with each chunk
}
// Or get full text at once
const text = await stream.text;Non-Streaming
import { generateText } from '@localmode/core';
const { text, usage } = await generateText({
model: webllm.languageModel('Llama-3.2-1B-Instruct-q4f16_1-MLC'),
prompt: 'What is the capital of France?',
});
console.log(text);
console.log('Tokens used:', usage.totalTokens);Configuration
Model Options
const model = webllm.languageModel('Llama-3.2-1B-Instruct-q4f16_1-MLC', {
systemPrompt: 'You are a helpful coding assistant.',
temperature: 0.7,
maxTokens: 1024,
topP: 0.9,
});Prop
Type
Custom Provider
import { createWebLLM } from '@localmode/webllm';
const myWebLLM = createWebLLM({
onProgress: (progress) => {
console.log(`Loading: ${(progress.progress * 100).toFixed(1)}%`);
console.log(`Status: ${progress.text}`);
},
});
const model = myWebLLM.languageModel('Llama-3.2-1B-Instruct-q4f16_1-MLC');Model Preloading
Preload models during app initialization:
import { preloadModel, isModelCached } from '@localmode/webllm';
// Check if already cached
if (!(await isModelCached('Llama-3.2-1B-Instruct-q4f16_1-MLC'))) {
// Show loading UI
await preloadModel('Llama-3.2-1B-Instruct-q4f16_1-MLC', {
onProgress: (progress) => {
updateLoadingBar(progress.progress * 100);
},
});
}
// Model is ready for instant inferenceModel Management
Available Models Registry
Access model metadata programmatically:
import { WEBLLM_MODELS, type WebLLMModelId } from '@localmode/webllm';
// Get all available models
const modelIds = Object.keys(WEBLLM_MODELS) as WebLLMModelId[];
// Access model info
const llama = WEBLLM_MODELS['Llama-3.2-1B-Instruct-q4f16_1-MLC'];
console.log(llama.name); // 'Llama 3.2 1B Instruct'
console.log(llama.contextLength); // 4096
console.log(llama.size); // '~700MB'
console.log(llama.sizeBytes); // 734003200
console.log(llama.description); // 'Fast, lightweight model...'Model Categorization
Categorize models by size for UI display:
import { getModelCategory, WEBLLM_MODELS, type WebLLMModelId } from '@localmode/webllm';
// Get category based on model size
const modelId: WebLLMModelId = 'Llama-3.2-1B-Instruct-q4f16_1-MLC';
const sizeBytes = WEBLLM_MODELS[modelId].sizeBytes;
const category = getModelCategory(sizeBytes);
console.log(category); // 'small' | 'medium' | 'large'
// Use for UI grouping
function getModelsByCategory() {
const categories = { small: [], medium: [], large: [] };
for (const [id, info] of Object.entries(WEBLLM_MODELS)) {
const cat = getModelCategory(info.sizeBytes);
categories[cat].push({ id, ...info });
}
return categories;
}Delete Cached Models
Remove models from browser cache to free up storage:
import { deleteModelCache, isModelCached } from '@localmode/webllm';
// Delete a specific model's cache
await deleteModelCache('Llama-3.2-1B-Instruct-q4f16_1-MLC');
// Verify deletion
const stillCached = await isModelCached('Llama-3.2-1B-Instruct-q4f16_1-MLC');
console.log(stillCached); // falseStorage Management
LLM models can be large (700MB - 4GB). Use deleteModelCache() to let users
free up storage when they no longer need a model.
Type-Safe Model IDs
Use the WebLLMModelId type for type-safe model selection:
import type { WebLLMModelId } from '@localmode/webllm';
// Type-safe function that only accepts valid model IDs
function selectModel(modelId: WebLLMModelId) {
return webllm.languageModel(modelId);
}
// ✅ Valid
selectModel('Llama-3.2-1B-Instruct-q4f16_1-MLC');
// ❌ TypeScript error: invalid model ID
selectModel('invalid-model-name');Chat Application
import { streamText } from '@localmode/core';
interface Message {
role: 'user' | 'assistant';
content: string;
}
async function chat(messages: Message[], userMessage: string) {
const model = webllm.languageModel('Llama-3.2-1B-Instruct-q4f16_1-MLC', {
systemPrompt: 'You are a helpful assistant.',
});
// Build conversation prompt
const prompt = messages
.map((m) => `${m.role}: ${m.content}`)
.concat([`user: ${userMessage}`, 'assistant:'])
.join('\n');
const stream = await streamText({
model,
prompt,
stopSequences: ['user:', '\n\n'],
});
let response = '';
for await (const chunk of stream) {
response += chunk.text;
// Update UI
}
return response;
}RAG Integration
Combine with retrieval for document-grounded chat:
import { semanticSearch, rerank, streamText } from '@localmode/core';
async function ragChat(query: string, db: VectorDB) {
// 1. Retrieve context
const results = await semanticSearch({
db,
model: embeddingModel,
query,
k: 10,
});
// 2. Rerank for relevance
const reranked = await rerank({
model: rerankerModel,
query,
documents: results.map((r) => r.metadata.text as string),
topK: 3,
});
const context = reranked.map((r) => r.document).join('\n\n---\n\n');
// 3. Generate with context
const llm = webllm.languageModel('Llama-3.2-3B-Instruct-q4f16_1-MLC');
const stream = await streamText({
model: llm,
prompt: `You are a helpful assistant. Answer based only on the provided context.
If the answer is not in the context, say "I don't have that information."
Context:
${context}
Question: ${query}
Answer:`,
});
return stream;
}Requirements
WebGPU Required
WebLLM requires WebGPU support. Check availability:
import { isWebGPUSupported } from '@localmode/core';
if (!isWebGPUSupported()) {
console.warn('WebGPU not available. LLM features disabled.');
}Browser Support
| Browser | Support |
|---|---|
| Chrome 113+ | ✅ |
| Edge 113+ | ✅ |
| Firefox | ❌ (Nightly only) |
| Safari 18+ | ✅ |
| iOS Safari | ✅ (iOS 26+) |
Hardware Requirements
- GPU: Any modern GPU with WebGPU support
- VRAM: Depends on model (1-3GB for 1-3B models)
- RAM: 4GB minimum, 8GB+ recommended
Best Practices
WebLLM Tips
- Preload models - Load during app init for instant inference
- Start small - Use 1B models for testing, larger for production
- Stream responses - Better UX than waiting for complete response
- Handle errors - GPU errors, OOM, etc. can occur
- Check capabilities - Verify WebGPU before showing LLM features
Error Handling
import { streamText, GenerationError } from '@localmode/core';
try {
const stream = await streamText({
model,
prompt: 'Hello',
});
for await (const chunk of stream) {
// ...
}
} catch (error) {
if (error instanceof GenerationError) {
if (error.code === 'WEBGPU_NOT_SUPPORTED') {
console.error('WebGPU not available');
} else if (error.code === 'MODEL_LOAD_FAILED') {
console.error('Failed to load model');
} else if (error.code === 'OUT_OF_MEMORY') {
console.error('Not enough GPU memory');
}
}
}