LocalMode
wllama

Overview

wllama provider for browser LLM inference via llama.cpp WASM. Run any GGUF model without WebGPU.

@localmode/wllama

Run any GGUF model in the browser using llama.cpp compiled to WebAssembly. Access 135,000+ models from HuggingFace without WebGPU.

See it in action

Try GGUF Explorer and LLM Chat for working demos.

Features

  • Universal Browser Support -- Works in Chrome, Firefox, Safari, and Edge (WASM only, no WebGPU needed)
  • 135K+ Models -- Run any GGUF model from HuggingFace
  • GGUF Inspection -- Read model metadata before downloading via ~4KB Range requests
  • Compatibility Check -- Estimate if a model will run on the current device
  • Multi-Threading -- Auto-detects CORS isolation for 2-4x faster inference
  • Streaming -- Real-time token generation

Installation

bash pnpm install @localmode/wllama @localmode/core
bash npm install @localmode/wllama @localmode/core
bash yarn add @localmode/wllama @localmode/core
bash bun add @localmode/wllama @localmode/core

Quick Start

import { streamText } from '@localmode/core';
import { wllama } from '@localmode/wllama';

const model = wllama.languageModel(
  'bartowski/Llama-3.2-1B-Instruct-GGUF:Llama-3.2-1B-Instruct-Q4_K_M.gguf'
);

const result = await streamText({
  model,
  prompt: 'Explain quantum computing in simple terms.',
});

let fullText = '';
for await (const chunk of result.stream) {
  fullText += chunk.text;
  // Update your UI with each chunk
}

Model Selection

Use WLLAMA_MODELS for curated models or pass any HuggingFace GGUF URL:

import { WLLAMA_MODELS } from '@localmode/wllama';

// Curated catalog entry
const model = wllama.languageModel('Llama-3.2-1B-Instruct-Q4_K_M');

// HuggingFace shorthand (repo:filename)
const model2 = wllama.languageModel(
  'bartowski/Llama-3.2-1B-Instruct-GGUF:Llama-3.2-1B-Instruct-Q4_K_M.gguf'
);

// Full URL
const model3 = wllama.languageModel(
  'https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q4_K_M.gguf'
);

Recommended Picks

  • Testing / Prototyping: SmolLM2-135M Q4_K_M -- 70MB, instant loading
  • General Purpose: Llama 3.2 1B Q4_K_M -- 750MB, good balance with 128K context
  • Multilingual: Qwen 2.5 1.5B Q4_K_M -- 986MB, strong multilingual support
  • Reasoning / Coding: Phi 3.5 Mini Q4_K_M -- 1.24GB, excellent reasoning
  • Higher Quality: Llama 3.2 3B Q4_K_M -- 1.93GB, excellent quality with 128K context
  • Best Quality: Llama 3.1 8B Q4_K_M -- 4.92GB, requires 8GB+ RAM

WLLAMA_MODELS Catalog (16 Models)

The curated catalog ships 16 Q4_K_M quantized models across four size tiers:

Tiny (< 500MB)

Catalog KeyNameSizeContextParamsBest For
SmolLM2-135M-Instruct-Q4_K_MSmolLM2 135M70MB8K135MInstant loading, testing
SmolLM2-360M-Instruct-Q4_K_MSmolLM2 360M234MB8K360MVery small, fast responses
Qwen2.5-0.5B-Instruct-Q4_K_MQwen 2.5 0.5B386MB4K500MTiny with great quality

Small (500MB -- 1GB)

Catalog KeyNameSizeContextParamsBest For
TinyLlama-1.1B-Chat-Q4_K_MTinyLlama 1.1B Chat670MB2K1.1BClassic, fast and reliable
Llama-3.2-1B-Instruct-Q4_K_MLlama 3.2 1B750MB128K1.2BGeneral purpose, huge context
Qwen2.5-1.5B-Instruct-Q4_K_MQwen 2.5 1.5B986MB32K1.5BMultilingual

Medium (1 -- 2GB)

Catalog KeyNameSizeContextParamsBest For
Qwen2.5-Coder-1.5B-Instruct-Q4_K_MQwen 2.5 Coder 1.5B1.0GB32K1.5BCode-specialized, programming
SmolLM2-1.7B-Instruct-Q4_K_MSmolLM2 1.7B1.06GB8K1.7BEfficient per-param quality
Phi-3.5-mini-instruct-Q4_K_MPhi 3.5 Mini1.24GB4K3.8BReasoning, coding
Gemma-2-2B-IT-Q4_K_MGemma 2 2B IT1.3GB8K2BInstruction following
Llama-3.2-3B-Instruct-Q4_K_MLlama 3.2 3B1.93GB128K3.2BHigh quality, huge context
Qwen2.5-3B-Instruct-Q4_K_MQwen 2.5 3B1.94GB32K3BHigh quality multilingual

Large (2GB+)

Catalog KeyNameSizeContextParamsBest For
Phi-4-mini-instruct-Q4_K_MPhi-4 Mini2.3GB4K3.8BStrong reasoning and coding
Qwen2.5-Coder-7B-Instruct-Q4_K_MQwen 2.5 Coder 7B4.5GB32K7BBest code generation quality
Mistral-7B-Instruct-v0.3-Q4_K_MMistral 7B v0.34.37GB32K7.2BStrong general performance
Llama-3.1-8B-Instruct-Q4_K_MLlama 3.1 8B4.92GB128K8BBest quality (8GB+ RAM)

Access the catalog programmatically:

import { WLLAMA_MODELS, getModelCategory } from '@localmode/wllama';
import type { WllamaModelId } from '@localmode/wllama';

// Iterate all 16 curated models
for (const [id, info] of Object.entries(WLLAMA_MODELS)) {
  const category = getModelCategory(info.sizeBytes);
  console.log(`[${category}] ${info.name}: ${info.size}, ${info.description}`);
}

// Type-safe catalog key
const modelId: WllamaModelId = 'Llama-3.2-1B-Instruct-Q4_K_M';
const entry = WLLAMA_MODELS[modelId];
console.log(entry.url); // HuggingFace download URL

Text Generation

Streaming

import { streamText } from '@localmode/core';

const result = await streamText({
  model: wllama.languageModel('Llama-3.2-1B-Instruct-Q4_K_M'),
  prompt: 'Write a haiku about programming.',
});

let fullText = '';
for await (const chunk of result.stream) {
  fullText += chunk.text;
}

Non-Streaming

import { generateText } from '@localmode/core';

const { text, usage } = await generateText({
  model: wllama.languageModel('Llama-3.2-1B-Instruct-Q4_K_M'),
  prompt: 'What is the capital of France?',
});

console.log(text);
console.log('Tokens used:', usage.totalTokens);

Configuration

Model Options

const model = wllama.languageModel('Llama-3.2-1B-Instruct-Q4_K_M', {
  systemPrompt: 'You are a helpful coding assistant.',
  temperature: 0.7,
  maxTokens: 1024,
  topP: 0.9,
  contextLength: 4096,
});

Prop

Type

Custom Provider

import { createWllama } from '@localmode/wllama';

const myWllama = createWllama({
  numThreads: 4,
  onProgress: (progress) => {
    console.log(`Loading: ${progress.progress?.toFixed(1)}%`);
    console.log(`Status: ${progress.text}`);
  },
});

const model = myWllama.languageModel('Llama-3.2-1B-Instruct-Q4_K_M');

Provider-Specific Sampling

wllama supports additional sampling parameters via providerOptions:

const { text } = await generateText({
  model,
  prompt: 'Hello!',
  providerOptions: {
    wllama: {
      top_k: 40,
      repeat_penalty: 1.1,
      mirostat: 2,
      mirostat_tau: 5.0,
      mirostat_eta: 0.1,
    },
  },
});

Model Preloading

Preload models during app initialization:

import { preloadModel, isModelCached } from '@localmode/wllama';

const modelId = 'bartowski/Llama-3.2-1B-Instruct-GGUF:Llama-3.2-1B-Instruct-Q4_K_M.gguf';

if (!(await isModelCached(modelId))) {
  await preloadModel(modelId, {
    onProgress: (progress) => {
      updateLoadingBar(progress.progress ?? 0);
    },
  });
}

Model Management

Delete Cached Models

import { deleteModelCache } from '@localmode/wllama';

await deleteModelCache('bartowski/Llama-3.2-1B-Instruct-GGUF:Llama-3.2-1B-Instruct-Q4_K_M.gguf');

CORS Multi-Threading

wllama uses SharedArrayBuffer for multi-threaded WASM execution, which requires CORS isolation headers:

Cross-Origin-Opener-Policy: same-origin
Cross-Origin-Embedder-Policy: require-corp

Without these headers, wllama automatically falls back to single-threaded mode (~2-4x slower).

import { isCrossOriginIsolated } from '@localmode/wllama';

if (isCrossOriginIsolated()) {
  console.log('Multi-threading enabled');
} else {
  console.log('Single-thread fallback (add CORS headers for 2-4x speed)');
}

Next.js CORS Headers

Add to your next.config.js:

async headers() {
  return [{
    source: '/(.*)',
    headers: [
      { key: 'Cross-Origin-Opener-Policy', value: 'same-origin' },
      { key: 'Cross-Origin-Embedder-Policy', value: 'require-corp' },
    ],
  }];
}

WebLLM Fallback Pattern

Use wllama as a fallback when WebGPU is not available:

import { createProviderWithFallback } from '@localmode/core';
import { webllm } from '@localmode/webllm';
import { wllama } from '@localmode/wllama';

const model = await createProviderWithFallback({
  providers: [
    () => webllm.languageModel('Llama-3.2-1B-Instruct-q4f16_1-MLC'),
    () => wllama.languageModel('Llama-3.2-1B-Instruct-Q4_K_M'),
  ],
  onFallback: (error, idx) => console.warn(`Provider ${idx} failed:`, error),
});

Browser Support

BrowserSupport
Chrome 57+Yes
Edge 16+Yes
Firefox 52+Yes
Safari 11+Yes
iOS SafariYes

wllama works in ALL modern browsers since it only requires WebAssembly support, unlike WebLLM which requires WebGPU.

wllama vs WebLLM

Feature@localmode/wllama@localmode/webllm
Runtimellama.cpp WASMMLC WebGPU
Browser SupportAll modern browsersWebGPU-capable only
Models135K+ GGUF models~20 curated MLC models
Performance~40-50% of WebGPU speedNative GPU speed
GPU RequiredNoYes
Model FormatGGUF (standard)MLC (pre-compiled)

Error Handling

import { generateText, ModelLoadError, GenerationError } from '@localmode/core';

try {
  const { text } = await generateText({ model, prompt: 'Hello' });
} catch (error) {
  if (error instanceof ModelLoadError) {
    console.error('Failed to load model:', error.hint);
  } else if (error instanceof GenerationError) {
    console.error('Generation failed:', error.hint);
  }
}

Next Steps

Showcase Apps

AppDescriptionLinks
GGUF ExplorerBrowse, load, and chat with GGUF models via wllamaDemo · Source
LLM ChatChat with wllama GGUF models alongside WebLLM and ONNXDemo · Source

On this page