LocalMode
wllama

Overview

wllama provider for browser LLM inference via llama.cpp WASM. Run any GGUF model without WebGPU.

@localmode/wllama

Run any GGUF model in the browser using llama.cpp compiled to WebAssembly. Access 160,000+ models from HuggingFace without WebGPU.

See it in action

Try GGUF Explorer and LLM Chat for working demos.

Features

  • Universal Browser Support -- Works in Chrome, Firefox, Safari, and Edge (WASM only, no WebGPU needed)
  • 160K+ Models -- Run any GGUF model from HuggingFace
  • Embedding Models -- wllama.embedding() for GGUF embedding models (nomic-embed, mxbai-embed, bge-small)
  • WebGPU Acceleration -- Optional GPU offload via useWebGPU and nGpuLayers for faster inference
  • Tool Calling -- OAI-compatible tool calling via providerOptions.wllama.tools
  • Vision / Multimodal -- Image input for vision-language models (Holo2 4B/8B, Gemma 4 E2B/E4B) via mmprojUrl
  • Jinja Chat Templates -- Native template parsing for accurate prompt formatting (default on)
  • GGUF Inspection -- Read model metadata before downloading via ~4KB Range requests
  • Compatibility Check -- Estimate if a model will run on the current device
  • Multi-Threading -- Auto-detects CORS isolation for 2-4x faster inference
  • True Streaming -- Token-by-token streaming via createChatCompletion({ stream: true })
  • Structured Output -- JSON mode and JSON schema via providerOptions.wllama.response_format
  • Reranking -- Cross-encoder reranking via wllama.reranker() (Jina, BGE)
  • Reasoning Mode -- DeepSeek-R1 thinking models with configurable reasoning budgets
  • Grammar Sampling -- Constrained generation via GBNF grammars
  • Performance Tuning -- KV cache quantization, flash attention, speculative decoding
  • LoRA Adapters -- Load fine-tuned LoRA adapters alongside base models
  • Safari Compatibility -- Optional @wllama/wllama-compat for Safari/iOS support

Installation

bash pnpm install @localmode/wllama @localmode/core
bash npm install @localmode/wllama @localmode/core
bash yarn add @localmode/wllama @localmode/core
bash bun add @localmode/wllama @localmode/core

Quick Start

import { streamText } from '@localmode/core';
import { wllama } from '@localmode/wllama';

const model = wllama.languageModel(
  'bartowski/Llama-3.2-1B-Instruct-GGUF:Llama-3.2-1B-Instruct-Q4_K_M.gguf'
);

const result = await streamText({
  model,
  prompt: 'Explain quantum computing in simple terms.',
});

let fullText = '';
for await (const chunk of result.stream) {
  fullText += chunk.text;
  // Update your UI with each chunk
}

Model Selection

Use WLLAMA_MODELS for curated models or pass any HuggingFace GGUF URL:

import { WLLAMA_MODELS } from '@localmode/wllama';

// Curated catalog entry
const model = wllama.languageModel('Llama-3.2-1B-Instruct-Q4_K_M');

// HuggingFace shorthand (repo:filename)
const model2 = wllama.languageModel(
  'bartowski/Llama-3.2-1B-Instruct-GGUF:Llama-3.2-1B-Instruct-Q4_K_M.gguf'
);

// Full URL
const model3 = wllama.languageModel(
  'https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q4_K_M.gguf'
);

Recommended Picks

  • Testing / Prototyping: SmolLM2-135M Q4_K_M -- 70MB, instant loading
  • General Purpose: Llama 3.2 1B Q4_K_M -- 750MB, good balance with 128K context
  • Multilingual: Qwen3 1.7B Q4_K_M -- 1.2GB, hybrid thinking with strong multilingual
  • Reasoning / Coding: Qwen3 4B Q4_K_M -- 2.7GB, excellent reasoning and code generation
  • Thinking / Chain-of-Thought: DeepSeek R1 1.5B Q4_K_M -- 1.1GB, reasoning model with <think> tags
  • Higher Quality: Llama 3.2 3B Q4_K_M -- 1.93GB, excellent quality with 128K context
  • Best Quality: Llama 3.1 8B Q4_K_M -- 4.92GB, requires 8GB+ RAM
  • UI Grounding / Vision: Holo2 4B Q4_K_M -- 2.8GB, vision-language model for browser-agent / GUI navigation
  • Reranking: Jina Reranker v2 Q4_K_M -- 163MB, multilingual cross-encoder reranking

WLLAMA_MODELS Catalog (30 Models)

The curated catalog ships 30 quantized models across multiple categories: language models in four size tiers, vision-language models, Qwen3 models, DeepSeek R1 reasoning models, embedding models, and reranker models:

Tiny (< 500MB)

Catalog KeyNameSizeContextParamsBest For
SmolLM2-135M-Instruct-Q4_K_MSmolLM2 135M70MB8K135MInstant loading, testing
SmolLM2-360M-Instruct-Q4_K_MSmolLM2 360M234MB8K360MVery small, fast responses
Qwen2.5-0.5B-Instruct-Q4_K_MQwen 2.5 0.5B386MB4K500MTiny with great quality

Small (500MB -- 1GB)

Catalog KeyNameSizeContextParamsBest For
Qwen3-0.6B-Q4_K_MQwen3 0.6B530MB40K600MFast multilingual reasoning, hybrid thinking
TinyLlama-1.1B-Chat-Q4_K_MTinyLlama 1.1B Chat670MB2K1.1BClassic, fast and reliable
Llama-3.2-1B-Instruct-Q4_K_MLlama 3.2 1B750MB128K1.2BGeneral purpose, huge context
Qwen2.5-1.5B-Instruct-Q4_K_MQwen 2.5 1.5B986MB32K1.5BMultilingual

Medium (1 -- 2GB)

Catalog KeyNameSizeContextParamsBest For
Qwen2.5-Coder-1.5B-Instruct-Q4_K_MQwen 2.5 Coder 1.5B1.0GB32K1.5BCode-specialized, programming
SmolLM2-1.7B-Instruct-Q4_K_MSmolLM2 1.7B1.06GB8K1.7BEfficient per-param quality
DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_MDeepSeek R1 1.5B1.1GB128K1.5BReasoning/thinking, chain-of-thought
Qwen3-1.7B-Q4_K_MQwen3 1.7B1.2GB40K1.7BMultilingual reasoning, hybrid thinking
Phi-3.5-mini-instruct-Q4_K_MPhi 3.5 Mini1.24GB4K3.8BReasoning, coding
Gemma-2-2B-IT-Q4_K_MGemma 2 2B IT1.3GB8K2BInstruction following
Llama-3.2-3B-Instruct-Q4_K_MLlama 3.2 3B1.93GB128K3.2BHigh quality, huge context
Qwen2.5-3B-Instruct-Q4_K_MQwen 2.5 3B1.94GB32K3BHigh quality multilingual

Large (2GB+)

Catalog KeyNameSizeContextParamsBest For
Phi-4-mini-instruct-Q4_K_MPhi-4 Mini2.3GB4K3.8BStrong reasoning and coding
Qwen3-4B-Q4_K_MQwen3 4B2.7GB40K4BExcellent multilingual reasoning and code
Qwen2.5-Coder-7B-Instruct-Q4_K_MQwen 2.5 Coder 7B4.5GB32K7BBest code generation quality
Mistral-7B-Instruct-v0.3-Q4_K_MMistral 7B v0.34.37GB32K7.2BStrong general performance
DeepSeek-R1-Distill-Qwen-7B-Q4_K_MDeepSeek R1 7B4.7GB128K7BStrong reasoning/thinking, 8GB+ RAM
Llama-3.1-8B-Instruct-Q4_K_MLlama 3.1 8B4.92GB128K8BBest quality (8GB+ RAM)

Vision-Language (UI grounding)

Catalog KeyNameSizeContextParamsBest For
Holo2-4B-Q4_K_MHolo2 4B2.8GB256K4BUI grounding (vision), browser-agent / GUI navigation
Holo2-8B-Q4_K_MHolo2 8B5.1GB256K8BPremium UI grounding (vision), 8GB+ RAM required
Gemma-4-E2B-IT-Q4_K_MGemma 4 E2B IT3.46GB128K5.1B (2.3B eff.)Google Gemma 4, vision + tool calling
Gemma-4-E4B-IT-Q4_K_MGemma 4 E4B IT5.41GB128K8B (~4B eff.)Google Gemma 4, vision + tool calling, 8GB+ RAM

Embedding Models

Catalog KeyNameSizeDimensionsBest For
nomic-embed-text-v1.5-Q4_K_MNomic Embed Text v1.578MB768High-quality semantic search
mxbai-embed-large-v1-Q4_K_MMxBai Embed Large v1197MB1024Top-quality English embeddings
bge-small-en-v1.5-Q8_0BGE Small EN v1.535MB384Lightweight on-device embeddings

Reranker Models

Catalog KeyNameSizeContextBest For
jina-reranker-v2-base-multilingual-Q4_K_MJina Reranker v2163MB1KMultilingual cross-encoder reranking
bge-reranker-v2-m3-Q4_K_MBGE Reranker v2 M3218MB8KMultilingual reranking with long context

Access the catalog programmatically:

import { WLLAMA_MODELS, getModelCategory } from '@localmode/wllama';
import type { WllamaModelId } from '@localmode/wllama';

// Iterate all 30 curated models
for (const [id, info] of Object.entries(WLLAMA_MODELS)) {
  const category = getModelCategory(info.sizeBytes);
  console.log(`[${category}] ${info.name}: ${info.size}, ${info.description}`);
}

// Type-safe catalog key
const modelId: WllamaModelId = 'Llama-3.2-1B-Instruct-Q4_K_M';
const entry = WLLAMA_MODELS[modelId];
console.log(entry.url); // HuggingFace download URL

Text Generation

True Streaming

doStream() streams token-by-token via createChatCompletion({ stream: true }), delivering each token as it is generated rather than buffering the full response:

import { streamText } from '@localmode/core';
import { wllama } from '@localmode/wllama';

const result = await streamText({
  model: wllama.languageModel('SmolLM2-135M-Instruct-Q4_K_M'),
  prompt: 'Write a poem',
});

for await (const chunk of result.stream) {
  process.stdout.write(chunk.text);
}

Non-Streaming

import { generateText } from '@localmode/core';

const { text, usage } = await generateText({
  model: wllama.languageModel('Llama-3.2-1B-Instruct-Q4_K_M'),
  prompt: 'What is the capital of France?',
});

console.log(text);
console.log('Tokens used:', usage.totalTokens);

Embedding Models

Generate text embeddings from GGUF embedding models using wllama.embedding():

import { embed, embedMany } from '@localmode/core';
import { wllama } from '@localmode/wllama';

const model = wllama.embedding('nomic-embed-text-v1.5-Q4_K_M');

// Single embedding
const { embedding } = await embed({ model, value: 'Hello world' });
console.log(embedding.length); // 768

// Batch embeddings
const { embeddings } = await embedMany({
  model,
  values: ['First document', 'Second document'],
});

Embedding dimensions are auto-detected from GGUF metadata. You can also use any GGUF embedding model from HuggingFace:

const model = wllama.embedding(
  'nomic-ai/nomic-embed-text-v1.5-GGUF:nomic-embed-text-v1.5.Q4_K_M.gguf'
);

Tool Calling

Models that support tool calling can use OAI-compatible tools via providerOptions.wllama:

import { generateText } from '@localmode/core';
import { wllama } from '@localmode/wllama';

const model = wllama.languageModel('Llama-3.2-1B-Instruct-Q4_K_M');

const result = await generateText({
  model,
  prompt: 'What is the weather in Tokyo?',
  providerOptions: {
    wllama: {
      tools: [{
        type: 'function',
        function: {
          name: 'get_weather',
          description: 'Get current weather for a city',
          parameters: {
            type: 'object',
            properties: { city: { type: 'string' } },
            required: ['city'],
          },
        },
      }],
      tool_choice: 'auto',
    },
  },
});

// result.toolCalls contains the tool invocations when the model calls a tool

Models with supportsToolCalling: true in the catalog have been verified for tool calling. Check WLLAMA_MODELS[modelId].supportsToolCalling to confirm support.

Vision / Multimodal

Vision-language models accept image input alongside text. Catalog entries for vision models (Holo2 4B/8B, Gemma 4 E2B/E4B) include mmprojUrl automatically:

import { generateText } from '@localmode/core';
import { wllama } from '@localmode/wllama';

const model = wllama.languageModel('Holo2-4B-Q4_K_M');
console.log(model.supportsVision); // true

const { text } = await generateText({
  model,
  messages: [{
    role: 'user',
    content: [
      { type: 'text', text: 'Describe this screenshot.' },
      { type: 'image', data: base64ImageData, mimeType: 'image/png' },
    ],
  }],
});

For custom vision models, pass mmprojUrl in the model settings:

const model = wllama.languageModel('my-repo/my-vlm-GGUF:model.gguf', {
  mmprojUrl: 'https://huggingface.co/my-repo/my-vlm-GGUF/resolve/main/mmproj-f16.gguf',
});

Images are passed as base64-encoded data in multimodal content parts. The provider converts them to ArrayBuffers internally for wllama v3's vision API.

WebGPU Acceleration

Enable WebGPU to offload transformer layers to the GPU for faster inference:

import { wllama } from '@localmode/wllama';

// Enable GPU offload (falls back to WASM if WebGPU unavailable)
const model = wllama.languageModel('Llama-3.2-1B-Instruct-Q4_K_M', {
  useWebGPU: true,
});

// Auto-detect WebGPU availability
const model2 = wllama.languageModel('Llama-3.2-1B-Instruct-Q4_K_M', {
  useWebGPU: 'auto',
});

// Fine-grained control: offload specific number of layers (-1 for all)
const model3 = wllama.languageModel('Llama-3.2-1B-Instruct-Q4_K_M', {
  nGpuLayers: 20,
});

console.log(model.gpuAccelerated); // true when WebGPU is active

WebGPU acceleration also works with embedding models:

const embedModel = wllama.embedding('nomic-embed-text-v1.5-Q4_K_M', {
  useWebGPU: 'auto',
});

nGpuLayers takes precedence over useWebGPU. Use -1 to offload all layers to GPU.

Jinja Chat Templates

wllama v3 uses the model's built-in Jinja chat template for accurate prompt formatting. This is enabled by default. If a model's template causes errors, wllama automatically falls back to raw completion mode with a console warning.

const model = wllama.languageModel('Llama-3.2-1B-Instruct-Q4_K_M', {
  useJinja: false, // disable Jinja templates (use raw completion)
});

Configuration

Model Options

const model = wllama.languageModel('Llama-3.2-1B-Instruct-Q4_K_M', {
  systemPrompt: 'You are a helpful coding assistant.',
  temperature: 0.7,
  maxTokens: 1024,
  topP: 0.9,
  contextLength: 4096,
});

Prop

Type

Custom Provider

import { createWllama } from '@localmode/wllama';

const myWllama = createWllama({
  numThreads: 4,
  onProgress: (progress) => {
    console.log(`Loading: ${progress.progress?.toFixed(1)}%`);
    console.log(`Status: ${progress.text}`);
  },
});

const model = myWllama.languageModel('Llama-3.2-1B-Instruct-Q4_K_M');

Provider-Specific Sampling

wllama supports additional sampling parameters via providerOptions:

const { text } = await generateText({
  model,
  prompt: 'Hello!',
  providerOptions: {
    wllama: {
      top_k: 40,
      repeat_penalty: 1.1,
      mirostat: 2,
      mirostat_tau: 5.0,
      mirostat_eta: 0.1,
    },
  },
});

Structured Output / JSON Mode

Force the model to output valid JSON using providerOptions.wllama.response_format. Three format types are supported: text (default), json_object, and json_schema:

import { generateText } from '@localmode/core';
import { wllama } from '@localmode/wllama';

// JSON object mode — model outputs valid JSON
const { text } = await generateText({
  model: wllama.languageModel('Qwen3-1.7B-Q4_K_M'),
  prompt: 'List 3 colors as JSON array',
  providerOptions: { wllama: { response_format: { type: 'json_object' } } },
});

console.log(JSON.parse(text)); // ["red", "green", "blue"]

// JSON schema mode — constrain output to a specific schema
const { text: structured } = await generateText({
  model: wllama.languageModel('Qwen3-1.7B-Q4_K_M'),
  prompt: 'Describe a person',
  providerOptions: {
    wllama: {
      response_format: {
        type: 'json_schema',
        json_schema: {
          name: 'person',
          schema: {
            type: 'object',
            properties: {
              name: { type: 'string' },
              age: { type: 'number' },
            },
            required: ['name', 'age'],
          },
          strict: true,
        },
      },
    },
  },
});

Reranking

Use wllama.reranker() with cross-encoder models to rerank search results by relevance. The catalog ships two reranker models (Jina Reranker v2 and BGE Reranker v2 M3):

import { rerank } from '@localmode/core';
import { wllama } from '@localmode/wllama';

const { results } = await rerank({
  model: wllama.reranker('jina-reranker-v2-base-multilingual-Q4_K_M'),
  query: 'machine learning',
  documents: ['AI paper', 'cooking recipe', 'deep learning tutorial'],
});

// results sorted by relevance score
console.log(results[0]); // { index: 2, score: 0.95, text: 'deep learning tutorial' }

Reasoning Mode

Enable reasoning mode for DeepSeek-R1 thinking models. The model produces a chain-of-thought in <think> tags before generating the final answer:

import { generateText } from '@localmode/core';
import { wllama } from '@localmode/wllama';

const model = wllama.languageModel('DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_M', {
  reasoning: true,
  reasoningFormat: 'deepseek',
  reasoningBudgetTokens: 1024,
});

const { text } = await generateText({
  model,
  prompt: 'What is 15% of 240?',
});

Models with supportsReasoning: true in the catalog have been verified for reasoning mode. Check WLLAMA_MODELS[modelId].supportsReasoning to confirm support.

Performance Configuration

Fine-tune inference performance with KV cache quantization, flash attention, and speculative decoding:

import { wllama } from '@localmode/wllama';

const model = wllama.languageModel('Qwen3-4B-Q4_K_M', {
  cacheTypeK: 'q4_0',   // Quantize key cache (reduces VRAM usage)
  cacheTypeV: 'q4_0',   // Quantize value cache
  flashAttention: true,  // Enable flash attention for faster inference
});

Valid cache type values (from least to most aggressive compression): f32, f16, q8_0, q5_1, q5_0, q4_1, q4_0.

When to use KV cache quantization

KV cache quantization (cacheTypeK / cacheTypeV) reduces memory usage at a small quality cost. This is especially useful for large context windows or memory-constrained devices. q4_0 is the most aggressive; q8_0 is a good middle ground.

Speculative Decoding

Use a small draft model alongside the main model for 2-3x faster inference:

const model = wllama.languageModel('Llama-3.1-8B-Instruct-Q4_K_M', {
  specDraftModel: 'https://huggingface.co/bartowski/SmolLM2-135M-Instruct-GGUF/resolve/main/SmolLM2-135M-Instruct-Q4_K_M.gguf',
  specDraftNgl: -1,    // Offload draft model to GPU
  specDraftNMin: 2,    // Minimum draft tokens
  specDraftNMax: 8,    // Maximum draft tokens
  specDraftPMin: 0.4,  // Minimum probability threshold
});

Grammar Sampling

Constrain model output to match a GBNF grammar. This is useful for generating structured data like email addresses, dates, or domain-specific formats:

import { generateText } from '@localmode/core';
import { wllama } from '@localmode/wllama';

const model = wllama.languageModel('Llama-3.2-1B-Instruct-Q4_K_M');

const { text } = await generateText({
  model,
  prompt: 'Generate a valid email address',
  providerOptions: {
    wllama: {
      grammar: 'root ::= [a-z]+ "@" [a-z]+ "." [a-z]+',
    },
  },
});

console.log(text); // e.g., "user@example.com"

LoRA Adapters

Load LoRA adapters alongside a base model for fine-tuned behavior:

import { wllama } from '@localmode/wllama';

const model = wllama.languageModel('Llama-3.2-1B-Instruct-Q4_K_M', {
  loraAdapters: [
    { path: 'https://your-cdn.com/adapters/my-lora.gguf', scale: 1.0 },
  ],
});

Set loraInitWithoutApply: true to load adapters without applying them immediately (for manual control).

Model Preloading

Preload models during app initialization:

import { preloadModel, isModelCached } from '@localmode/wllama';

const modelId = 'bartowski/Llama-3.2-1B-Instruct-GGUF:Llama-3.2-1B-Instruct-Q4_K_M.gguf';

if (!(await isModelCached(modelId))) {
  await preloadModel(modelId, {
    onProgress: (progress) => {
      updateLoadingBar(progress.progress ?? 0);
    },
  });
}

Model Management

List and Clear Cached Models

import { listCachedModels, clearAllModelCache, deleteModelCache } from '@localmode/wllama';

// List all cached models
const models = await listCachedModels();
console.log(`${models.length} models cached`);

// Clear all cached models at once
await clearAllModelCache();

Delete a Single Cached Model

import { deleteModelCache } from '@localmode/wllama';

await deleteModelCache('bartowski/Llama-3.2-1B-Instruct-GGUF:Llama-3.2-1B-Instruct-Q4_K_M.gguf');

Re-download a Cached Model

Use refreshModel() to delete a corrupted cache and re-download:

import { refreshModel } from '@localmode/wllama';

await refreshModel('SmolLM2-135M-Instruct-Q4_K_M', {
  onProgress: (p) => console.log(`${p.progress}%`),
});

CORS Multi-Threading

wllama uses SharedArrayBuffer for multi-threaded WASM execution, which requires CORS isolation headers:

Cross-Origin-Opener-Policy: same-origin
Cross-Origin-Embedder-Policy: require-corp

Without these headers, wllama automatically falls back to single-threaded mode (~2-4x slower).

import { isCrossOriginIsolated } from '@localmode/wllama';

if (isCrossOriginIsolated()) {
  console.log('Multi-threading enabled');
} else {
  console.log('Single-thread fallback (add CORS headers for 2-4x speed)');
}

Next.js CORS Headers

Add to your next.config.js:

async headers() {
  return [{
    source: '/(.*)',
    headers: [
      { key: 'Cross-Origin-Opener-Policy', value: 'same-origin' },
      { key: 'Cross-Origin-Embedder-Policy', value: 'require-corp' },
    ],
  }];
}

WebLLM Fallback Pattern

Use wllama as a fallback when WebGPU is not available:

import { webllm } from '@localmode/webllm';
import { wllama } from '@localmode/wllama';

let model;
try {
  model = webllm.languageModel('Llama-3.2-1B-Instruct-q4f16_1-MLC');
} catch (error) {
  console.warn('WebLLM unavailable, falling back to wllama:', error);
  model = wllama.languageModel('Llama-3.2-1B-Instruct-Q4_K_M');
}

Browser Support

BrowserSupport
Chrome 57+Yes
Edge 16+Yes
Firefox 52+Yes
Safari 11+Yes
iOS SafariYes

wllama works in ALL modern browsers since it only requires WebAssembly support, unlike WebLLM which requires WebGPU.

Safari Compatibility

wllama v3 requires the WebAssembly Memory64 proposal, which is not yet supported in Safari or iOS browsers. To support Safari/iOS, install the optional @wllama/wllama-compat package, which provides a compatibility shim:

bash pnpm install @wllama/wllama-compat
bash npm install @wllama/wllama-compat
bash yarn add @wllama/wllama-compat
bash bun add @wllama/wllama-compat

Safari / iOS

Without @wllama/wllama-compat, wllama will fail to load on Safari and iOS. The compat package is an optional dependency -- only needed if your app targets those browsers.

wllama vs WebLLM

Feature@localmode/wllama@localmode/webllm
Runtimellama.cpp WASM + optional WebGPUMLC WebGPU
Browser SupportAll modern browsersWebGPU-capable only
Models30 curated + 160K+ GGUF32 curated MLC models
Embeddings3 GGUF embedding models--
Reranking2 cross-encoder models--
ReasoningDeepSeek R1 (1.5B, 7B)--
Tool CallingYes (via providerOptions)Yes
VisionHolo2 4B/8B, Gemma 4 E2B/E4B (mmprojUrl)Phi 3.5 Vision
PerformanceGood (CPU), faster with WebGPUNative GPU speed
GPU RequiredNo (optional WebGPU)Yes
Model FormatGGUF (standard)MLC (pre-compiled)

Error Handling

import { generateText, ModelLoadError, GenerationError } from '@localmode/core';

try {
  const { text } = await generateText({ model, prompt: 'Hello' });
} catch (error) {
  if (error instanceof ModelLoadError) {
    console.error('Failed to load model:', error.hint);
  } else if (error instanceof GenerationError) {
    console.error('Generation failed:', error.hint);
  }
}

Next Steps

Showcase Apps

AppDescriptionLinks
GGUF ExplorerBrowse, load, and chat with GGUF models via wllamaDemo · Source
LLM ChatChat with wllama GGUF models alongside the WebLLM, ONNX, and LiteRT backendsDemo · Source

On this page