Overview
wllama provider for browser LLM inference via llama.cpp WASM. Run any GGUF model without WebGPU.
@localmode/wllama
Run any GGUF model in the browser using llama.cpp compiled to WebAssembly. Access 160,000+ models from HuggingFace without WebGPU.
See it in action
Try GGUF Explorer and LLM Chat for working demos.
Features
- Universal Browser Support -- Works in Chrome, Firefox, Safari, and Edge (WASM only, no WebGPU needed)
- 160K+ Models -- Run any GGUF model from HuggingFace
- Embedding Models --
wllama.embedding()for GGUF embedding models (nomic-embed, mxbai-embed, bge-small) - WebGPU Acceleration -- Optional GPU offload via
useWebGPUandnGpuLayersfor faster inference - Tool Calling -- OAI-compatible tool calling via
providerOptions.wllama.tools - Vision / Multimodal -- Image input for vision-language models (Holo2 4B/8B, Gemma 4 E2B/E4B) via
mmprojUrl - Jinja Chat Templates -- Native template parsing for accurate prompt formatting (default on)
- GGUF Inspection -- Read model metadata before downloading via ~4KB Range requests
- Compatibility Check -- Estimate if a model will run on the current device
- Multi-Threading -- Auto-detects CORS isolation for 2-4x faster inference
- True Streaming -- Token-by-token streaming via
createChatCompletion({ stream: true }) - Structured Output -- JSON mode and JSON schema via
providerOptions.wllama.response_format - Reranking -- Cross-encoder reranking via
wllama.reranker()(Jina, BGE) - Reasoning Mode -- DeepSeek-R1 thinking models with configurable reasoning budgets
- Grammar Sampling -- Constrained generation via GBNF grammars
- Performance Tuning -- KV cache quantization, flash attention, speculative decoding
- LoRA Adapters -- Load fine-tuned LoRA adapters alongside base models
- Safari Compatibility -- Optional
@wllama/wllama-compatfor Safari/iOS support
Installation
bash pnpm install @localmode/wllama @localmode/core bash npm install @localmode/wllama @localmode/core bash yarn add @localmode/wllama @localmode/core bash bun add @localmode/wllama @localmode/core Quick Start
import { streamText } from '@localmode/core';
import { wllama } from '@localmode/wllama';
const model = wllama.languageModel(
'bartowski/Llama-3.2-1B-Instruct-GGUF:Llama-3.2-1B-Instruct-Q4_K_M.gguf'
);
const result = await streamText({
model,
prompt: 'Explain quantum computing in simple terms.',
});
let fullText = '';
for await (const chunk of result.stream) {
fullText += chunk.text;
// Update your UI with each chunk
}Model Selection
Use WLLAMA_MODELS for curated models or pass any HuggingFace GGUF URL:
import { WLLAMA_MODELS } from '@localmode/wllama';
// Curated catalog entry
const model = wllama.languageModel('Llama-3.2-1B-Instruct-Q4_K_M');
// HuggingFace shorthand (repo:filename)
const model2 = wllama.languageModel(
'bartowski/Llama-3.2-1B-Instruct-GGUF:Llama-3.2-1B-Instruct-Q4_K_M.gguf'
);
// Full URL
const model3 = wllama.languageModel(
'https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q4_K_M.gguf'
);Recommended Picks
- Testing / Prototyping: SmolLM2-135M Q4_K_M -- 70MB, instant loading
- General Purpose: Llama 3.2 1B Q4_K_M -- 750MB, good balance with 128K context
- Multilingual: Qwen3 1.7B Q4_K_M -- 1.2GB, hybrid thinking with strong multilingual
- Reasoning / Coding: Qwen3 4B Q4_K_M -- 2.7GB, excellent reasoning and code generation
- Thinking / Chain-of-Thought: DeepSeek R1 1.5B Q4_K_M -- 1.1GB, reasoning model with
<think>tags - Higher Quality: Llama 3.2 3B Q4_K_M -- 1.93GB, excellent quality with 128K context
- Best Quality: Llama 3.1 8B Q4_K_M -- 4.92GB, requires 8GB+ RAM
- UI Grounding / Vision: Holo2 4B Q4_K_M -- 2.8GB, vision-language model for browser-agent / GUI navigation
- Reranking: Jina Reranker v2 Q4_K_M -- 163MB, multilingual cross-encoder reranking
WLLAMA_MODELS Catalog (30 Models)
The curated catalog ships 30 quantized models across multiple categories: language models in four size tiers, vision-language models, Qwen3 models, DeepSeek R1 reasoning models, embedding models, and reranker models:
Tiny (< 500MB)
| Catalog Key | Name | Size | Context | Params | Best For |
|---|---|---|---|---|---|
SmolLM2-135M-Instruct-Q4_K_M | SmolLM2 135M | 70MB | 8K | 135M | Instant loading, testing |
SmolLM2-360M-Instruct-Q4_K_M | SmolLM2 360M | 234MB | 8K | 360M | Very small, fast responses |
Qwen2.5-0.5B-Instruct-Q4_K_M | Qwen 2.5 0.5B | 386MB | 4K | 500M | Tiny with great quality |
Small (500MB -- 1GB)
| Catalog Key | Name | Size | Context | Params | Best For |
|---|---|---|---|---|---|
Qwen3-0.6B-Q4_K_M | Qwen3 0.6B | 530MB | 40K | 600M | Fast multilingual reasoning, hybrid thinking |
TinyLlama-1.1B-Chat-Q4_K_M | TinyLlama 1.1B Chat | 670MB | 2K | 1.1B | Classic, fast and reliable |
Llama-3.2-1B-Instruct-Q4_K_M | Llama 3.2 1B | 750MB | 128K | 1.2B | General purpose, huge context |
Qwen2.5-1.5B-Instruct-Q4_K_M | Qwen 2.5 1.5B | 986MB | 32K | 1.5B | Multilingual |
Medium (1 -- 2GB)
| Catalog Key | Name | Size | Context | Params | Best For |
|---|---|---|---|---|---|
Qwen2.5-Coder-1.5B-Instruct-Q4_K_M | Qwen 2.5 Coder 1.5B | 1.0GB | 32K | 1.5B | Code-specialized, programming |
SmolLM2-1.7B-Instruct-Q4_K_M | SmolLM2 1.7B | 1.06GB | 8K | 1.7B | Efficient per-param quality |
DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_M | DeepSeek R1 1.5B | 1.1GB | 128K | 1.5B | Reasoning/thinking, chain-of-thought |
Qwen3-1.7B-Q4_K_M | Qwen3 1.7B | 1.2GB | 40K | 1.7B | Multilingual reasoning, hybrid thinking |
Phi-3.5-mini-instruct-Q4_K_M | Phi 3.5 Mini | 1.24GB | 4K | 3.8B | Reasoning, coding |
Gemma-2-2B-IT-Q4_K_M | Gemma 2 2B IT | 1.3GB | 8K | 2B | Instruction following |
Llama-3.2-3B-Instruct-Q4_K_M | Llama 3.2 3B | 1.93GB | 128K | 3.2B | High quality, huge context |
Qwen2.5-3B-Instruct-Q4_K_M | Qwen 2.5 3B | 1.94GB | 32K | 3B | High quality multilingual |
Large (2GB+)
| Catalog Key | Name | Size | Context | Params | Best For |
|---|---|---|---|---|---|
Phi-4-mini-instruct-Q4_K_M | Phi-4 Mini | 2.3GB | 4K | 3.8B | Strong reasoning and coding |
Qwen3-4B-Q4_K_M | Qwen3 4B | 2.7GB | 40K | 4B | Excellent multilingual reasoning and code |
Qwen2.5-Coder-7B-Instruct-Q4_K_M | Qwen 2.5 Coder 7B | 4.5GB | 32K | 7B | Best code generation quality |
Mistral-7B-Instruct-v0.3-Q4_K_M | Mistral 7B v0.3 | 4.37GB | 32K | 7.2B | Strong general performance |
DeepSeek-R1-Distill-Qwen-7B-Q4_K_M | DeepSeek R1 7B | 4.7GB | 128K | 7B | Strong reasoning/thinking, 8GB+ RAM |
Llama-3.1-8B-Instruct-Q4_K_M | Llama 3.1 8B | 4.92GB | 128K | 8B | Best quality (8GB+ RAM) |
Vision-Language (UI grounding)
| Catalog Key | Name | Size | Context | Params | Best For |
|---|---|---|---|---|---|
Holo2-4B-Q4_K_M | Holo2 4B | 2.8GB | 256K | 4B | UI grounding (vision), browser-agent / GUI navigation |
Holo2-8B-Q4_K_M | Holo2 8B | 5.1GB | 256K | 8B | Premium UI grounding (vision), 8GB+ RAM required |
Gemma-4-E2B-IT-Q4_K_M | Gemma 4 E2B IT | 3.46GB | 128K | 5.1B (2.3B eff.) | Google Gemma 4, vision + tool calling |
Gemma-4-E4B-IT-Q4_K_M | Gemma 4 E4B IT | 5.41GB | 128K | 8B (~4B eff.) | Google Gemma 4, vision + tool calling, 8GB+ RAM |
Embedding Models
| Catalog Key | Name | Size | Dimensions | Best For |
|---|---|---|---|---|
nomic-embed-text-v1.5-Q4_K_M | Nomic Embed Text v1.5 | 78MB | 768 | High-quality semantic search |
mxbai-embed-large-v1-Q4_K_M | MxBai Embed Large v1 | 197MB | 1024 | Top-quality English embeddings |
bge-small-en-v1.5-Q8_0 | BGE Small EN v1.5 | 35MB | 384 | Lightweight on-device embeddings |
Reranker Models
| Catalog Key | Name | Size | Context | Best For |
|---|---|---|---|---|
jina-reranker-v2-base-multilingual-Q4_K_M | Jina Reranker v2 | 163MB | 1K | Multilingual cross-encoder reranking |
bge-reranker-v2-m3-Q4_K_M | BGE Reranker v2 M3 | 218MB | 8K | Multilingual reranking with long context |
Access the catalog programmatically:
import { WLLAMA_MODELS, getModelCategory } from '@localmode/wllama';
import type { WllamaModelId } from '@localmode/wllama';
// Iterate all 30 curated models
for (const [id, info] of Object.entries(WLLAMA_MODELS)) {
const category = getModelCategory(info.sizeBytes);
console.log(`[${category}] ${info.name}: ${info.size}, ${info.description}`);
}
// Type-safe catalog key
const modelId: WllamaModelId = 'Llama-3.2-1B-Instruct-Q4_K_M';
const entry = WLLAMA_MODELS[modelId];
console.log(entry.url); // HuggingFace download URLText Generation
True Streaming
doStream() streams token-by-token via createChatCompletion({ stream: true }), delivering each token as it is generated rather than buffering the full response:
import { streamText } from '@localmode/core';
import { wllama } from '@localmode/wllama';
const result = await streamText({
model: wllama.languageModel('SmolLM2-135M-Instruct-Q4_K_M'),
prompt: 'Write a poem',
});
for await (const chunk of result.stream) {
process.stdout.write(chunk.text);
}Non-Streaming
import { generateText } from '@localmode/core';
const { text, usage } = await generateText({
model: wllama.languageModel('Llama-3.2-1B-Instruct-Q4_K_M'),
prompt: 'What is the capital of France?',
});
console.log(text);
console.log('Tokens used:', usage.totalTokens);Embedding Models
Generate text embeddings from GGUF embedding models using wllama.embedding():
import { embed, embedMany } from '@localmode/core';
import { wllama } from '@localmode/wllama';
const model = wllama.embedding('nomic-embed-text-v1.5-Q4_K_M');
// Single embedding
const { embedding } = await embed({ model, value: 'Hello world' });
console.log(embedding.length); // 768
// Batch embeddings
const { embeddings } = await embedMany({
model,
values: ['First document', 'Second document'],
});Embedding dimensions are auto-detected from GGUF metadata. You can also use any GGUF embedding model from HuggingFace:
const model = wllama.embedding(
'nomic-ai/nomic-embed-text-v1.5-GGUF:nomic-embed-text-v1.5.Q4_K_M.gguf'
);Tool Calling
Models that support tool calling can use OAI-compatible tools via providerOptions.wllama:
import { generateText } from '@localmode/core';
import { wllama } from '@localmode/wllama';
const model = wllama.languageModel('Llama-3.2-1B-Instruct-Q4_K_M');
const result = await generateText({
model,
prompt: 'What is the weather in Tokyo?',
providerOptions: {
wllama: {
tools: [{
type: 'function',
function: {
name: 'get_weather',
description: 'Get current weather for a city',
parameters: {
type: 'object',
properties: { city: { type: 'string' } },
required: ['city'],
},
},
}],
tool_choice: 'auto',
},
},
});
// result.toolCalls contains the tool invocations when the model calls a toolModels with supportsToolCalling: true in the catalog have been verified for tool calling. Check WLLAMA_MODELS[modelId].supportsToolCalling to confirm support.
Vision / Multimodal
Vision-language models accept image input alongside text. Catalog entries for vision models (Holo2 4B/8B, Gemma 4 E2B/E4B) include mmprojUrl automatically:
import { generateText } from '@localmode/core';
import { wllama } from '@localmode/wllama';
const model = wllama.languageModel('Holo2-4B-Q4_K_M');
console.log(model.supportsVision); // true
const { text } = await generateText({
model,
messages: [{
role: 'user',
content: [
{ type: 'text', text: 'Describe this screenshot.' },
{ type: 'image', data: base64ImageData, mimeType: 'image/png' },
],
}],
});For custom vision models, pass mmprojUrl in the model settings:
const model = wllama.languageModel('my-repo/my-vlm-GGUF:model.gguf', {
mmprojUrl: 'https://huggingface.co/my-repo/my-vlm-GGUF/resolve/main/mmproj-f16.gguf',
});Images are passed as base64-encoded data in multimodal content parts. The provider converts them to ArrayBuffers internally for wllama v3's vision API.
WebGPU Acceleration
Enable WebGPU to offload transformer layers to the GPU for faster inference:
import { wllama } from '@localmode/wllama';
// Enable GPU offload (falls back to WASM if WebGPU unavailable)
const model = wllama.languageModel('Llama-3.2-1B-Instruct-Q4_K_M', {
useWebGPU: true,
});
// Auto-detect WebGPU availability
const model2 = wllama.languageModel('Llama-3.2-1B-Instruct-Q4_K_M', {
useWebGPU: 'auto',
});
// Fine-grained control: offload specific number of layers (-1 for all)
const model3 = wllama.languageModel('Llama-3.2-1B-Instruct-Q4_K_M', {
nGpuLayers: 20,
});
console.log(model.gpuAccelerated); // true when WebGPU is activeWebGPU acceleration also works with embedding models:
const embedModel = wllama.embedding('nomic-embed-text-v1.5-Q4_K_M', {
useWebGPU: 'auto',
});nGpuLayers takes precedence over useWebGPU. Use -1 to offload all layers to GPU.
Jinja Chat Templates
wllama v3 uses the model's built-in Jinja chat template for accurate prompt formatting. This is enabled by default. If a model's template causes errors, wllama automatically falls back to raw completion mode with a console warning.
const model = wllama.languageModel('Llama-3.2-1B-Instruct-Q4_K_M', {
useJinja: false, // disable Jinja templates (use raw completion)
});Configuration
Model Options
const model = wllama.languageModel('Llama-3.2-1B-Instruct-Q4_K_M', {
systemPrompt: 'You are a helpful coding assistant.',
temperature: 0.7,
maxTokens: 1024,
topP: 0.9,
contextLength: 4096,
});Prop
Type
Custom Provider
import { createWllama } from '@localmode/wllama';
const myWllama = createWllama({
numThreads: 4,
onProgress: (progress) => {
console.log(`Loading: ${progress.progress?.toFixed(1)}%`);
console.log(`Status: ${progress.text}`);
},
});
const model = myWllama.languageModel('Llama-3.2-1B-Instruct-Q4_K_M');Provider-Specific Sampling
wllama supports additional sampling parameters via providerOptions:
const { text } = await generateText({
model,
prompt: 'Hello!',
providerOptions: {
wllama: {
top_k: 40,
repeat_penalty: 1.1,
mirostat: 2,
mirostat_tau: 5.0,
mirostat_eta: 0.1,
},
},
});Structured Output / JSON Mode
Force the model to output valid JSON using providerOptions.wllama.response_format. Three format types are supported: text (default), json_object, and json_schema:
import { generateText } from '@localmode/core';
import { wllama } from '@localmode/wllama';
// JSON object mode — model outputs valid JSON
const { text } = await generateText({
model: wllama.languageModel('Qwen3-1.7B-Q4_K_M'),
prompt: 'List 3 colors as JSON array',
providerOptions: { wllama: { response_format: { type: 'json_object' } } },
});
console.log(JSON.parse(text)); // ["red", "green", "blue"]
// JSON schema mode — constrain output to a specific schema
const { text: structured } = await generateText({
model: wllama.languageModel('Qwen3-1.7B-Q4_K_M'),
prompt: 'Describe a person',
providerOptions: {
wllama: {
response_format: {
type: 'json_schema',
json_schema: {
name: 'person',
schema: {
type: 'object',
properties: {
name: { type: 'string' },
age: { type: 'number' },
},
required: ['name', 'age'],
},
strict: true,
},
},
},
},
});Reranking
Use wllama.reranker() with cross-encoder models to rerank search results by relevance. The catalog ships two reranker models (Jina Reranker v2 and BGE Reranker v2 M3):
import { rerank } from '@localmode/core';
import { wllama } from '@localmode/wllama';
const { results } = await rerank({
model: wllama.reranker('jina-reranker-v2-base-multilingual-Q4_K_M'),
query: 'machine learning',
documents: ['AI paper', 'cooking recipe', 'deep learning tutorial'],
});
// results sorted by relevance score
console.log(results[0]); // { index: 2, score: 0.95, text: 'deep learning tutorial' }Reasoning Mode
Enable reasoning mode for DeepSeek-R1 thinking models. The model produces a chain-of-thought in <think> tags before generating the final answer:
import { generateText } from '@localmode/core';
import { wllama } from '@localmode/wllama';
const model = wllama.languageModel('DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_M', {
reasoning: true,
reasoningFormat: 'deepseek',
reasoningBudgetTokens: 1024,
});
const { text } = await generateText({
model,
prompt: 'What is 15% of 240?',
});Models with supportsReasoning: true in the catalog have been verified for reasoning mode. Check WLLAMA_MODELS[modelId].supportsReasoning to confirm support.
Performance Configuration
Fine-tune inference performance with KV cache quantization, flash attention, and speculative decoding:
import { wllama } from '@localmode/wllama';
const model = wllama.languageModel('Qwen3-4B-Q4_K_M', {
cacheTypeK: 'q4_0', // Quantize key cache (reduces VRAM usage)
cacheTypeV: 'q4_0', // Quantize value cache
flashAttention: true, // Enable flash attention for faster inference
});Valid cache type values (from least to most aggressive compression): f32, f16, q8_0, q5_1, q5_0, q4_1, q4_0.
When to use KV cache quantization
KV cache quantization (cacheTypeK / cacheTypeV) reduces memory usage at a small quality cost. This is especially useful for large context windows or memory-constrained devices. q4_0 is the most aggressive; q8_0 is a good middle ground.
Speculative Decoding
Use a small draft model alongside the main model for 2-3x faster inference:
const model = wllama.languageModel('Llama-3.1-8B-Instruct-Q4_K_M', {
specDraftModel: 'https://huggingface.co/bartowski/SmolLM2-135M-Instruct-GGUF/resolve/main/SmolLM2-135M-Instruct-Q4_K_M.gguf',
specDraftNgl: -1, // Offload draft model to GPU
specDraftNMin: 2, // Minimum draft tokens
specDraftNMax: 8, // Maximum draft tokens
specDraftPMin: 0.4, // Minimum probability threshold
});Grammar Sampling
Constrain model output to match a GBNF grammar. This is useful for generating structured data like email addresses, dates, or domain-specific formats:
import { generateText } from '@localmode/core';
import { wllama } from '@localmode/wllama';
const model = wllama.languageModel('Llama-3.2-1B-Instruct-Q4_K_M');
const { text } = await generateText({
model,
prompt: 'Generate a valid email address',
providerOptions: {
wllama: {
grammar: 'root ::= [a-z]+ "@" [a-z]+ "." [a-z]+',
},
},
});
console.log(text); // e.g., "user@example.com"LoRA Adapters
Load LoRA adapters alongside a base model for fine-tuned behavior:
import { wllama } from '@localmode/wllama';
const model = wllama.languageModel('Llama-3.2-1B-Instruct-Q4_K_M', {
loraAdapters: [
{ path: 'https://your-cdn.com/adapters/my-lora.gguf', scale: 1.0 },
],
});Set loraInitWithoutApply: true to load adapters without applying them immediately (for manual control).
Model Preloading
Preload models during app initialization:
import { preloadModel, isModelCached } from '@localmode/wllama';
const modelId = 'bartowski/Llama-3.2-1B-Instruct-GGUF:Llama-3.2-1B-Instruct-Q4_K_M.gguf';
if (!(await isModelCached(modelId))) {
await preloadModel(modelId, {
onProgress: (progress) => {
updateLoadingBar(progress.progress ?? 0);
},
});
}Model Management
List and Clear Cached Models
import { listCachedModels, clearAllModelCache, deleteModelCache } from '@localmode/wllama';
// List all cached models
const models = await listCachedModels();
console.log(`${models.length} models cached`);
// Clear all cached models at once
await clearAllModelCache();Delete a Single Cached Model
import { deleteModelCache } from '@localmode/wllama';
await deleteModelCache('bartowski/Llama-3.2-1B-Instruct-GGUF:Llama-3.2-1B-Instruct-Q4_K_M.gguf');Re-download a Cached Model
Use refreshModel() to delete a corrupted cache and re-download:
import { refreshModel } from '@localmode/wllama';
await refreshModel('SmolLM2-135M-Instruct-Q4_K_M', {
onProgress: (p) => console.log(`${p.progress}%`),
});CORS Multi-Threading
wllama uses SharedArrayBuffer for multi-threaded WASM execution, which requires CORS isolation headers:
Cross-Origin-Opener-Policy: same-origin
Cross-Origin-Embedder-Policy: require-corpWithout these headers, wllama automatically falls back to single-threaded mode (~2-4x slower).
import { isCrossOriginIsolated } from '@localmode/wllama';
if (isCrossOriginIsolated()) {
console.log('Multi-threading enabled');
} else {
console.log('Single-thread fallback (add CORS headers for 2-4x speed)');
}Next.js CORS Headers
Add to your next.config.js:
async headers() {
return [{
source: '/(.*)',
headers: [
{ key: 'Cross-Origin-Opener-Policy', value: 'same-origin' },
{ key: 'Cross-Origin-Embedder-Policy', value: 'require-corp' },
],
}];
}WebLLM Fallback Pattern
Use wllama as a fallback when WebGPU is not available:
import { webllm } from '@localmode/webllm';
import { wllama } from '@localmode/wllama';
let model;
try {
model = webllm.languageModel('Llama-3.2-1B-Instruct-q4f16_1-MLC');
} catch (error) {
console.warn('WebLLM unavailable, falling back to wllama:', error);
model = wllama.languageModel('Llama-3.2-1B-Instruct-Q4_K_M');
}Browser Support
| Browser | Support |
|---|---|
| Chrome 57+ | Yes |
| Edge 16+ | Yes |
| Firefox 52+ | Yes |
| Safari 11+ | Yes |
| iOS Safari | Yes |
wllama works in ALL modern browsers since it only requires WebAssembly support, unlike WebLLM which requires WebGPU.
Safari Compatibility
wllama v3 requires the WebAssembly Memory64 proposal, which is not yet supported in Safari or iOS browsers. To support Safari/iOS, install the optional @wllama/wllama-compat package, which provides a compatibility shim:
bash pnpm install @wllama/wllama-compat bash npm install @wllama/wllama-compat bash yarn add @wllama/wllama-compat bash bun add @wllama/wllama-compat Safari / iOS
Without @wllama/wllama-compat, wllama will fail to load on Safari and iOS. The compat package is an optional dependency -- only needed if your app targets those browsers.
wllama vs WebLLM
| Feature | @localmode/wllama | @localmode/webllm |
|---|---|---|
| Runtime | llama.cpp WASM + optional WebGPU | MLC WebGPU |
| Browser Support | All modern browsers | WebGPU-capable only |
| Models | 30 curated + 160K+ GGUF | 32 curated MLC models |
| Embeddings | 3 GGUF embedding models | -- |
| Reranking | 2 cross-encoder models | -- |
| Reasoning | DeepSeek R1 (1.5B, 7B) | -- |
| Tool Calling | Yes (via providerOptions) | Yes |
| Vision | Holo2 4B/8B, Gemma 4 E2B/E4B (mmprojUrl) | Phi 3.5 Vision |
| Performance | Good (CPU), faster with WebGPU | Native GPU speed |
| GPU Required | No (optional WebGPU) | Yes |
| Model Format | GGUF (standard) | MLC (pre-compiled) |
Error Handling
import { generateText, ModelLoadError, GenerationError } from '@localmode/core';
try {
const { text } = await generateText({ model, prompt: 'Hello' });
} catch (error) {
if (error instanceof ModelLoadError) {
console.error('Failed to load model:', error.hint);
} else if (error instanceof GenerationError) {
console.error('Generation failed:', error.hint);
}
}Next Steps
GGUF Models
Browse, inspect, and check compatibility of GGUF models.
Text Generation
Learn about streaming and generation options.
WebLLM Provider
WebGPU-accelerated alternative for compatible browsers.