wllama provider for browser LLM inference via llama.cpp WASM. Run any GGUF model without WebGPU.

@localmode/wllama

Run any GGUF model in the browser using llama.cpp compiled to WebAssembly. Access 160,000+ models from HuggingFace without WebGPU.

See it in action

Try GGUF Explorer and LLM Chat for working demos.

Features

Universal Browser Support -- Works in Chrome, Firefox, Safari, and Edge (WASM only, no WebGPU needed)
160K+ Models -- Run any GGUF model from HuggingFace
Embedding Models -- wllama.embedding() for GGUF embedding models (nomic-embed, mxbai-embed, bge-small)
WebGPU Acceleration -- Optional GPU offload via useWebGPU and nGpuLayers for faster inference
Tool Calling -- OAI-compatible tool calling via providerOptions.wllama.tools
Vision / Multimodal -- Image input for vision-language models (Holo2 4B/8B, Gemma 4 E2B/E4B) via mmprojUrl
Jinja Chat Templates -- Native template parsing for accurate prompt formatting (default on)
GGUF Inspection -- Read model metadata before downloading via ~4KB Range requests
Compatibility Check -- Estimate if a model will run on the current device
Multi-Threading -- Auto-detects CORS isolation for 2-4x faster inference
True Streaming -- Token-by-token streaming via createChatCompletion({ stream: true })
Structured Output -- JSON mode and JSON schema via providerOptions.wllama.response_format
Reranking -- Cross-encoder reranking via wllama.reranker() (Jina, BGE)
Reasoning Mode -- DeepSeek-R1 thinking models with configurable reasoning budgets
Grammar Sampling -- Constrained generation via GBNF grammars
Performance Tuning -- KV cache quantization, flash attention, speculative decoding
LoRA Adapters -- Load fine-tuned LoRA adapters alongside base models
Safari Compatibility -- Optional @wllama/wllama-compat for Safari/iOS support

Installation

bash pnpm install @localmode/wllama @localmode/core

bash npm install @localmode/wllama @localmode/core

bash yarn add @localmode/wllama @localmode/core

bash bun add @localmode/wllama @localmode/core

Quick Start

import { streamText } from '@localmode/core';
import { wllama } from '@localmode/wllama';

const model = wllama.languageModel(
  'bartowski/Llama-3.2-1B-Instruct-GGUF:Llama-3.2-1B-Instruct-Q4_K_M.gguf'
);

const result = await streamText({
  model,
  prompt: 'Explain quantum computing in simple terms.',
});

let fullText = '';
for await (const chunk of result.stream) {
  fullText += chunk.text;
  // Update your UI with each chunk
}

Model Selection

Use WLLAMA_MODELS for curated models or pass any HuggingFace GGUF URL:

import { WLLAMA_MODELS } from '@localmode/wllama';

// Curated catalog entry
const model = wllama.languageModel('Llama-3.2-1B-Instruct-Q4_K_M');

// HuggingFace shorthand (repo:filename)
const model2 = wllama.languageModel(
  'bartowski/Llama-3.2-1B-Instruct-GGUF:Llama-3.2-1B-Instruct-Q4_K_M.gguf'
);

// Full URL
const model3 = wllama.languageModel(
  'https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q4_K_M.gguf'
);

Recommended Picks

Testing / Prototyping: SmolLM2-135M Q4_K_M -- 70MB, instant loading
General Purpose: Llama 3.2 1B Q4_K_M -- 750MB, good balance with 128K context
Multilingual: Qwen3 1.7B Q4_K_M -- 1.2GB, hybrid thinking with strong multilingual
Reasoning / Coding: Qwen3 4B Q4_K_M -- 2.7GB, excellent reasoning and code generation
Thinking / Chain-of-Thought: DeepSeek R1 1.5B Q4_K_M -- 1.1GB, reasoning model with <think> tags
Higher Quality: Llama 3.2 3B Q4_K_M -- 1.93GB, excellent quality with 128K context
Best Quality: Llama 3.1 8B Q4_K_M -- 4.92GB, requires 8GB+ RAM
UI Grounding / Vision: Holo2 4B Q4_K_M -- 2.8GB, vision-language model for browser-agent / GUI navigation
Reranking: Jina Reranker v2 Q4_K_M -- 163MB, multilingual cross-encoder reranking

`WLLAMA_MODELS` Catalog (30 Models)

The curated catalog ships 30 quantized models across multiple categories: language models in four size tiers, vision-language models, Qwen3 models, DeepSeek R1 reasoning models, embedding models, and reranker models:

Tiny (< 500MB)

Catalog Key	Name	Size	Context	Params	Best For
`SmolLM2-135M-Instruct-Q4_K_M`	SmolLM2 135M	70MB	8K	135M	Instant loading, testing
`SmolLM2-360M-Instruct-Q4_K_M`	SmolLM2 360M	234MB	8K	360M	Very small, fast responses
`Qwen2.5-0.5B-Instruct-Q4_K_M`	Qwen 2.5 0.5B	386MB	4K	500M	Tiny with great quality

Small (500MB -- 1GB)

Catalog Key	Name	Size	Context	Params	Best For
`Qwen3-0.6B-Q4_K_M`	Qwen3 0.6B	530MB	40K	600M	Fast multilingual reasoning, hybrid thinking
`TinyLlama-1.1B-Chat-Q4_K_M`	TinyLlama 1.1B Chat	670MB	2K	1.1B	Classic, fast and reliable
`Llama-3.2-1B-Instruct-Q4_K_M`	Llama 3.2 1B	750MB	128K	1.2B	General purpose, huge context
`Qwen2.5-1.5B-Instruct-Q4_K_M`	Qwen 2.5 1.5B	986MB	32K	1.5B	Multilingual

Medium (1 -- 2GB)

Catalog Key	Name	Size	Context	Params	Best For
`Qwen2.5-Coder-1.5B-Instruct-Q4_K_M`	Qwen 2.5 Coder 1.5B	1.0GB	32K	1.5B	Code-specialized, programming
`SmolLM2-1.7B-Instruct-Q4_K_M`	SmolLM2 1.7B	1.06GB	8K	1.7B	Efficient per-param quality
`DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_M`	DeepSeek R1 1.5B	1.1GB	128K	1.5B	Reasoning/thinking, chain-of-thought
`Qwen3-1.7B-Q4_K_M`	Qwen3 1.7B	1.2GB	40K	1.7B	Multilingual reasoning, hybrid thinking
`Phi-3.5-mini-instruct-Q4_K_M`	Phi 3.5 Mini	1.24GB	4K	3.8B	Reasoning, coding
`Gemma-2-2B-IT-Q4_K_M`	Gemma 2 2B IT	1.3GB	8K	2B	Instruction following
`Llama-3.2-3B-Instruct-Q4_K_M`	Llama 3.2 3B	1.93GB	128K	3.2B	High quality, huge context
`Qwen2.5-3B-Instruct-Q4_K_M`	Qwen 2.5 3B	1.94GB	32K	3B	High quality multilingual

Large (2GB+)

Catalog Key	Name	Size	Context	Params	Best For
`Phi-4-mini-instruct-Q4_K_M`	Phi-4 Mini	2.3GB	4K	3.8B	Strong reasoning and coding
`Qwen3-4B-Q4_K_M`	Qwen3 4B	2.7GB	40K	4B	Excellent multilingual reasoning and code
`Qwen2.5-Coder-7B-Instruct-Q4_K_M`	Qwen 2.5 Coder 7B	4.5GB	32K	7B	Best code generation quality
`Mistral-7B-Instruct-v0.3-Q4_K_M`	Mistral 7B v0.3	4.37GB	32K	7.2B	Strong general performance
`DeepSeek-R1-Distill-Qwen-7B-Q4_K_M`	DeepSeek R1 7B	4.7GB	128K	7B	Strong reasoning/thinking, 8GB+ RAM
`Llama-3.1-8B-Instruct-Q4_K_M`	Llama 3.1 8B	4.92GB	128K	8B	Best quality (8GB+ RAM)

Vision-Language (UI grounding)

Catalog Key	Name	Size	Context	Params	Best For
`Holo2-4B-Q4_K_M`	Holo2 4B	2.8GB	256K	4B	UI grounding (vision), browser-agent / GUI navigation
`Holo2-8B-Q4_K_M`	Holo2 8B	5.1GB	256K	8B	Premium UI grounding (vision), 8GB+ RAM required
`Gemma-4-E2B-IT-Q4_K_M`	Gemma 4 E2B IT	3.46GB	128K	5.1B (2.3B eff.)	Google Gemma 4, vision + tool calling
`Gemma-4-E4B-IT-Q4_K_M`	Gemma 4 E4B IT	5.41GB	128K	8B (~4B eff.)	Google Gemma 4, vision + tool calling, 8GB+ RAM

Embedding Models

Catalog Key	Name	Size	Dimensions	Best For
`nomic-embed-text-v1.5-Q4_K_M`	Nomic Embed Text v1.5	78MB	768	High-quality semantic search
`mxbai-embed-large-v1-Q4_K_M`	MxBai Embed Large v1	197MB	1024	Top-quality English embeddings
`bge-small-en-v1.5-Q8_0`	BGE Small EN v1.5	35MB	384	Lightweight on-device embeddings

Reranker Models

Catalog Key	Name	Size	Context	Best For
`jina-reranker-v2-base-multilingual-Q4_K_M`	Jina Reranker v2	163MB	1K	Multilingual cross-encoder reranking
`bge-reranker-v2-m3-Q4_K_M`	BGE Reranker v2 M3	218MB	8K	Multilingual reranking with long context

Access the catalog programmatically:

import { WLLAMA_MODELS, getModelCategory } from '@localmode/wllama';
import type { WllamaModelId } from '@localmode/wllama';

// Iterate all 30 curated models
for (const [id, info] of Object.entries(WLLAMA_MODELS)) {
  const category = getModelCategory(info.sizeBytes);
  console.log(`[${category}] ${info.name}: ${info.size}, ${info.description}`);
}

// Type-safe catalog key
const modelId: WllamaModelId = 'Llama-3.2-1B-Instruct-Q4_K_M';
const entry = WLLAMA_MODELS[modelId];
console.log(entry.url); // HuggingFace download URL

Text Generation

True Streaming

doStream() streams token-by-token via createChatCompletion({ stream: true }), delivering each token as it is generated rather than buffering the full response:

import { streamText } from '@localmode/core';
import { wllama } from '@localmode/wllama';

const result = await streamText({
  model: wllama.languageModel('SmolLM2-135M-Instruct-Q4_K_M'),
  prompt: 'Write a poem',
});

for await (const chunk of result.stream) {
  process.stdout.write(chunk.text);
}

Non-Streaming

import { generateText } from '@localmode/core';

const { text, usage } = await generateText({
  model: wllama.languageModel('Llama-3.2-1B-Instruct-Q4_K_M'),
  prompt: 'What is the capital of France?',
});

console.log(text);
console.log('Tokens used:', usage.totalTokens);

Embedding Models

Generate text embeddings from GGUF embedding models using wllama.embedding():

import { embed, embedMany } from '@localmode/core';
import { wllama } from '@localmode/wllama';

const model = wllama.embedding('nomic-embed-text-v1.5-Q4_K_M');

// Single embedding
const { embedding } = await embed({ model, value: 'Hello world' });
console.log(embedding.length); // 768

// Batch embeddings
const { embeddings } = await embedMany({
  model,
  values: ['First document', 'Second document'],
});

Embedding dimensions are auto-detected from GGUF metadata. You can also use any GGUF embedding model from HuggingFace:

const model = wllama.embedding(
  'nomic-ai/nomic-embed-text-v1.5-GGUF:nomic-embed-text-v1.5.Q4_K_M.gguf'
);

Tool Calling

Models that support tool calling can use OAI-compatible tools via providerOptions.wllama:

import { generateText } from '@localmode/core';
import { wllama } from '@localmode/wllama';

const model = wllama.languageModel('Llama-3.2-1B-Instruct-Q4_K_M');

const result = await generateText({
  model,
  prompt: 'What is the weather in Tokyo?',
  providerOptions: {
    wllama: {
      tools: [{
        type: 'function',
        function: {
          name: 'get_weather',
          description: 'Get current weather for a city',
          parameters: {
            type: 'object',
            properties: { city: { type: 'string' } },
            required: ['city'],
          },
        },
      }],
      tool_choice: 'auto',
    },
  },
});

// result.toolCalls contains the tool invocations when the model calls a tool

Models with supportsToolCalling: true in the catalog have been verified for tool calling. Check WLLAMA_MODELS[modelId].supportsToolCalling to confirm support.

Vision / Multimodal

Vision-language models accept image input alongside text. Catalog entries for vision models (Holo2 4B/8B, Gemma 4 E2B/E4B) include mmprojUrl automatically:

import { generateText } from '@localmode/core';
import { wllama } from '@localmode/wllama';

const model = wllama.languageModel('Holo2-4B-Q4_K_M');
console.log(model.supportsVision); // true

const { text } = await generateText({
  model,
  messages: [{
    role: 'user',
    content: [
      { type: 'text', text: 'Describe this screenshot.' },
      { type: 'image', data: base64ImageData, mimeType: 'image/png' },
    ],
  }],
});

For custom vision models, pass mmprojUrl in the model settings:

const model = wllama.languageModel('my-repo/my-vlm-GGUF:model.gguf', {
  mmprojUrl: 'https://huggingface.co/my-repo/my-vlm-GGUF/resolve/main/mmproj-f16.gguf',
});

Images are passed as base64-encoded data in multimodal content parts. The provider converts them to ArrayBuffers internally for wllama v3's vision API.

WebGPU Acceleration

Enable WebGPU to offload transformer layers to the GPU for faster inference:

import { wllama } from '@localmode/wllama';

// Enable GPU offload (falls back to WASM if WebGPU unavailable)
const model = wllama.languageModel('Llama-3.2-1B-Instruct-Q4_K_M', {
  useWebGPU: true,
});

// Auto-detect WebGPU availability
const model2 = wllama.languageModel('Llama-3.2-1B-Instruct-Q4_K_M', {
  useWebGPU: 'auto',
});

// Fine-grained control: offload specific number of layers (-1 for all)
const model3 = wllama.languageModel('Llama-3.2-1B-Instruct-Q4_K_M', {
  nGpuLayers: 20,
});

console.log(model.gpuAccelerated); // true when WebGPU is active

WebGPU acceleration also works with embedding models:

const embedModel = wllama.embedding('nomic-embed-text-v1.5-Q4_K_M', {
  useWebGPU: 'auto',
});

nGpuLayers takes precedence over useWebGPU. Use -1 to offload all layers to GPU.

Jinja Chat Templates

wllama v3 uses the model's built-in Jinja chat template for accurate prompt formatting. This is enabled by default. If a model's template causes errors, wllama automatically falls back to raw completion mode with a console warning.

const model = wllama.languageModel('Llama-3.2-1B-Instruct-Q4_K_M', {
  useJinja: false, // disable Jinja templates (use raw completion)
});

Configuration

Model Options

const model = wllama.languageModel('Llama-3.2-1B-Instruct-Q4_K_M', {
  systemPrompt: 'You are a helpful coding assistant.',
  temperature: 0.7,
  maxTokens: 1024,
  topP: 0.9,
  contextLength: 4096,
});

Prop

Type

Custom Provider

import { createWllama } from '@localmode/wllama';

const myWllama = createWllama({
  numThreads: 4,
  onProgress: (progress) => {
    console.log(`Loading: ${progress.progress?.toFixed(1)}%`);
    console.log(`Status: ${progress.text}`);
  },
});

const model = myWllama.languageModel('Llama-3.2-1B-Instruct-Q4_K_M');

Provider-Specific Sampling

wllama supports additional sampling parameters via providerOptions:

const { text } = await generateText({
  model,
  prompt: 'Hello!',
  providerOptions: {
    wllama: {
      top_k: 40,
      repeat_penalty: 1.1,
      mirostat: 2,
      mirostat_tau: 5.0,
      mirostat_eta: 0.1,
    },
  },
});

Structured Output / JSON Mode

Force the model to output valid JSON using providerOptions.wllama.response_format. Three format types are supported: text (default), json_object, and json_schema:

import { generateText } from '@localmode/core';
import { wllama } from '@localmode/wllama';

// JSON object mode — model outputs valid JSON
const { text } = await generateText({
  model: wllama.languageModel('Qwen3-1.7B-Q4_K_M'),
  prompt: 'List 3 colors as JSON array',
  providerOptions: { wllama: { response_format: { type: 'json_object' } } },
});

console.log(JSON.parse(text)); // ["red", "green", "blue"]

// JSON schema mode — constrain output to a specific schema
const { text: structured } = await generateText({
  model: wllama.languageModel('Qwen3-1.7B-Q4_K_M'),
  prompt: 'Describe a person',
  providerOptions: {
    wllama: {
      response_format: {
        type: 'json_schema',
        json_schema: {
          name: 'person',
          schema: {
            type: 'object',
            properties: {
              name: { type: 'string' },
              age: { type: 'number' },
            },
            required: ['name', 'age'],
          },
          strict: true,
        },
      },
    },
  },
});

Reranking

Use wllama.reranker() with cross-encoder models to rerank search results by relevance. The catalog ships two reranker models (Jina Reranker v2 and BGE Reranker v2 M3):

import { rerank } from '@localmode/core';
import { wllama } from '@localmode/wllama';

const { results } = await rerank({
  model: wllama.reranker('jina-reranker-v2-base-multilingual-Q4_K_M'),
  query: 'machine learning',
  documents: ['AI paper', 'cooking recipe', 'deep learning tutorial'],
});

// results sorted by relevance score
console.log(results[0]); // { index: 2, score: 0.95, text: 'deep learning tutorial' }

Reasoning Mode

Enable reasoning mode for DeepSeek-R1 thinking models. The model produces a chain-of-thought in <think> tags before generating the final answer:

import { generateText } from '@localmode/core';
import { wllama } from '@localmode/wllama';

const model = wllama.languageModel('DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_M', {
  reasoning: true,
  reasoningFormat: 'deepseek',
  reasoningBudgetTokens: 1024,
});

const { text } = await generateText({
  model,
  prompt: 'What is 15% of 240?',
});

Models with supportsReasoning: true in the catalog have been verified for reasoning mode. Check WLLAMA_MODELS[modelId].supportsReasoning to confirm support.

Performance Configuration

Fine-tune inference performance with KV cache quantization, flash attention, and speculative decoding:

import { wllama } from '@localmode/wllama';

const model = wllama.languageModel('Qwen3-4B-Q4_K_M', {
  cacheTypeK: 'q4_0',   // Quantize key cache (reduces VRAM usage)
  cacheTypeV: 'q4_0',   // Quantize value cache
  flashAttention: true,  // Enable flash attention for faster inference
});

Valid cache type values (from least to most aggressive compression): f32, f16, q8_0, q5_1, q5_0, q4_1, q4_0.

When to use KV cache quantization

KV cache quantization (cacheTypeK / cacheTypeV) reduces memory usage at a small quality cost. This is especially useful for large context windows or memory-constrained devices. q4_0 is the most aggressive; q8_0 is a good middle ground.

Speculative Decoding

Use a small draft model alongside the main model for 2-3x faster inference:

const model = wllama.languageModel('Llama-3.1-8B-Instruct-Q4_K_M', {
  specDraftModel: 'https://huggingface.co/bartowski/SmolLM2-135M-Instruct-GGUF/resolve/main/SmolLM2-135M-Instruct-Q4_K_M.gguf',
  specDraftNgl: -1,    // Offload draft model to GPU
  specDraftNMin: 2,    // Minimum draft tokens
  specDraftNMax: 8,    // Maximum draft tokens
  specDraftPMin: 0.4,  // Minimum probability threshold
});

Grammar Sampling

Constrain model output to match a GBNF grammar. This is useful for generating structured data like email addresses, dates, or domain-specific formats:

import { generateText } from '@localmode/core';
import { wllama } from '@localmode/wllama';

const model = wllama.languageModel('Llama-3.2-1B-Instruct-Q4_K_M');

const { text } = await generateText({
  model,
  prompt: 'Generate a valid email address',
  providerOptions: {
    wllama: {
      grammar: 'root ::= [a-z]+ "@" [a-z]+ "." [a-z]+',
    },
  },
});

console.log(text); // e.g., "user@example.com"

LoRA Adapters

Load LoRA adapters alongside a base model for fine-tuned behavior:

import { wllama } from '@localmode/wllama';

const model = wllama.languageModel('Llama-3.2-1B-Instruct-Q4_K_M', {
  loraAdapters: [
    { path: 'https://your-cdn.com/adapters/my-lora.gguf', scale: 1.0 },
  ],
});

Set loraInitWithoutApply: true to load adapters without applying them immediately (for manual control).

Model Preloading

Preload models during app initialization:

import { preloadModel, isModelCached } from '@localmode/wllama';

const modelId = 'bartowski/Llama-3.2-1B-Instruct-GGUF:Llama-3.2-1B-Instruct-Q4_K_M.gguf';

if (!(await isModelCached(modelId))) {
  await preloadModel(modelId, {
    onProgress: (progress) => {
      updateLoadingBar(progress.progress ?? 0);
    },
  });
}

Model Management

List and Clear Cached Models

import { listCachedModels, clearAllModelCache, deleteModelCache } from '@localmode/wllama';

// List all cached models
const models = await listCachedModels();
console.log(`${models.length} models cached`);

// Clear all cached models at once
await clearAllModelCache();

Delete a Single Cached Model

import { deleteModelCache } from '@localmode/wllama';

await deleteModelCache('bartowski/Llama-3.2-1B-Instruct-GGUF:Llama-3.2-1B-Instruct-Q4_K_M.gguf');

Re-download a Cached Model

Use refreshModel() to delete a corrupted cache and re-download:

import { refreshModel } from '@localmode/wllama';

await refreshModel('SmolLM2-135M-Instruct-Q4_K_M', {
  onProgress: (p) => console.log(`${p.progress}%`),
});

CORS Multi-Threading

wllama uses SharedArrayBuffer for multi-threaded WASM execution, which requires CORS isolation headers:

Cross-Origin-Opener-Policy: same-origin
Cross-Origin-Embedder-Policy: require-corp

Without these headers, wllama automatically falls back to single-threaded mode (~2-4x slower).

import { isCrossOriginIsolated } from '@localmode/wllama';

if (isCrossOriginIsolated()) {
  console.log('Multi-threading enabled');
} else {
  console.log('Single-thread fallback (add CORS headers for 2-4x speed)');
}

Next.js CORS Headers

Add to your next.config.js:

async headers() {
  return [{
    source: '/(.*)',
    headers: [
      { key: 'Cross-Origin-Opener-Policy', value: 'same-origin' },
      { key: 'Cross-Origin-Embedder-Policy', value: 'require-corp' },
    ],
  }];
}

WebLLM Fallback Pattern

Use wllama as a fallback when WebGPU is not available:

import { webllm } from '@localmode/webllm';
import { wllama } from '@localmode/wllama';

let model;
try {
  model = webllm.languageModel('Llama-3.2-1B-Instruct-q4f16_1-MLC');
} catch (error) {
  console.warn('WebLLM unavailable, falling back to wllama:', error);
  model = wllama.languageModel('Llama-3.2-1B-Instruct-Q4_K_M');
}

Browser Support

Browser	Support
Chrome 57+	Yes
Edge 16+	Yes
Firefox 52+	Yes
Safari 11+	Yes
iOS Safari	Yes

wllama works in ALL modern browsers since it only requires WebAssembly support, unlike WebLLM which requires WebGPU.

Safari Compatibility

wllama v3 requires the WebAssembly Memory64 proposal, which is not yet supported in Safari or iOS browsers. To support Safari/iOS, install the optional @wllama/wllama-compat package, which provides a compatibility shim:

bash pnpm install @wllama/wllama-compat

bash npm install @wllama/wllama-compat

bash yarn add @wllama/wllama-compat

bash bun add @wllama/wllama-compat

Safari / iOS

Without @wllama/wllama-compat, wllama will fail to load on Safari and iOS. The compat package is an optional dependency -- only needed if your app targets those browsers.

wllama vs WebLLM

Feature	@localmode/wllama	@localmode/webllm
Runtime	llama.cpp WASM + optional WebGPU	MLC WebGPU
Browser Support	All modern browsers	WebGPU-capable only
Models	30 curated + 160K+ GGUF	32 curated MLC models
Embeddings	3 GGUF embedding models	--
Reranking	2 cross-encoder models	--
Reasoning	DeepSeek R1 (1.5B, 7B)	--
Tool Calling	Yes (via providerOptions)	Yes
Vision	Holo2 4B/8B, Gemma 4 E2B/E4B (mmprojUrl)	Phi 3.5 Vision
Performance	Good (CPU), faster with WebGPU	Native GPU speed
GPU Required	No (optional WebGPU)	Yes
Model Format	GGUF (standard)	MLC (pre-compiled)

Error Handling

import { generateText, ModelLoadError, GenerationError } from '@localmode/core';

try {
  const { text } = await generateText({ model, prompt: 'Hello' });
} catch (error) {
  if (error instanceof ModelLoadError) {
    console.error('Failed to load model:', error.hint);
  } else if (error instanceof GenerationError) {
    console.error('Generation failed:', error.hint);
  }
}

App	Description	Links
GGUF Explorer	Browse, load, and chat with GGUF models via wllama	Demo · Source
LLM Chat	Chat with wllama GGUF models alongside the WebLLM, ONNX, and LiteRT backends	Demo · Source

Overview

GGUF Models

Text Generation

WebLLM Provider

On this page