LocalMode
wllama

GGUF Models

Browse, inspect, and check browser compatibility of GGUF models before downloading.

GGUF Model Browsing and Compatibility

The GGUF format is the standard for quantized LLM models, with 160,000+ models on HuggingFace. The @localmode/wllama package provides tools to inspect model metadata and check browser compatibility before downloading multi-GB files.

See it in action

Try GGUF Explorer for a working demo.

What is GGUF?

GGUF (GPT-Generated Unified Format) is a binary format for storing quantized LLM models. It stores model weights, architecture metadata, tokenizer data, and configuration in a single file. The format is used by llama.cpp and supports many quantization levels (Q2_K through F32).

Inspect Models Before Download

Use parseGGUFMetadata() to read a model's metadata via HTTP Range requests. This downloads only ~4KB of header data, not the full model file.

import { parseGGUFMetadata } from '@localmode/wllama';

const metadata = await parseGGUFMetadata(
  'bartowski/Llama-3.2-1B-Instruct-GGUF:Llama-3.2-1B-Instruct-Q4_K_M.gguf'
);

console.log(metadata.architecture);   // 'llama'
console.log(metadata.contextLength);  // 131072
console.log(metadata.quantization);   // 'Q4_K_M'
console.log(metadata.parameterCount); // ~1.24 billion
console.log(metadata.vocabSize);      // 32000
console.log(metadata.layerCount);     // 22
console.log(metadata.headCount);      // 32
console.log(metadata.modelName);      // 'Llama 3.2 1B Instruct'

URL Formats

parseGGUFMetadata() accepts multiple URL formats:

// HuggingFace shorthand (repo:filename)
await parseGGUFMetadata('bartowski/Llama-3.2-1B-Instruct-GGUF:Llama-3.2-1B-Instruct-Q4_K_M.gguf');

// Full HuggingFace URL
await parseGGUFMetadata('https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q4_K_M.gguf');

// Any CDN that supports Range requests
await parseGGUFMetadata('https://your-cdn.com/models/custom-model.gguf');

Check Browser Compatibility

Use checkGGUFBrowserCompatFromURL() to check if a model can run on the current device:

import { checkGGUFBrowserCompatFromURL } from '@localmode/wllama';

const result = await checkGGUFBrowserCompatFromURL(
  'bartowski/Llama-3.2-1B-Instruct-GGUF:Llama-3.2-1B-Instruct-Q4_K_M.gguf'
);

console.log(result.canRun);             // true
console.log(result.estimatedRAMHuman);   // '900 MB'
console.log(result.deviceRAMHuman);      // '8 GB'
console.log(result.estimatedSpeed);      // '~30-50 tok/s multi-thread'
console.log(result.hasCORS);             // true (multi-threading available)
console.log(result.warnings);            // []
console.log(result.recommendations);     // []

// Full metadata is also attached
console.log(result.metadata.architecture); // 'llama'

Two-Step Workflow

You can also separate parsing from compatibility checking:

import { parseGGUFMetadata, checkGGUFBrowserCompat } from '@localmode/wllama';

// Step 1: Parse metadata (light HTTP Range request)
const metadata = await parseGGUFMetadata(modelUrl);

// Step 2: Check compatibility (no network, instant)
const compat = await checkGGUFBrowserCompat(metadata);

Complete Workflow: Browse, Inspect, Check, Run

import {
  parseGGUFMetadata,
  checkGGUFBrowserCompat,
  wllama,
} from '@localmode/wllama';
import { streamText } from '@localmode/core';

// 1. User selects a model from HuggingFace
const modelUrl = 'bartowski/Llama-3.2-1B-Instruct-GGUF:Llama-3.2-1B-Instruct-Q4_K_M.gguf';

// 2. Inspect metadata (~4KB download)
const metadata = await parseGGUFMetadata(modelUrl);
showModelInfo(metadata); // Display to user

// 3. Check compatibility (instant, no download)
const compat = await checkGGUFBrowserCompat(metadata);

if (!compat.canRun) {
  showWarnings(compat.warnings);
  showRecommendations(compat.recommendations);
  return;
}

// 4. Download and run
const model = wllama.languageModel(modelUrl, {
  onProgress: (p) => updateProgressBar(p.progress ?? 0),
});

const result = await streamText({ model, prompt: 'Hello!' });

These quantized models are curated for browser use (compatible with wllama v3). The catalog includes 25 language models (including Qwen3 and DeepSeek R1 reasoning models), 3 embedding models, and 2 reranker models:

Tiny (< 500MB)

ModelSizeContextParametersBest For
SmolLM2-135M Q4_K_M70MB8K135MInstant loading, testing
SmolLM2-360M Q4_K_M234MB8K360MVery small, fast responses
Qwen 2.5 0.5B Q4_K_M386MB4K500MTiny with great quality

Small (500MB -- 1GB)

ModelSizeContextParametersBest For
Qwen3 0.6B Q4_K_M530MB40K600MFast multilingual reasoning, hybrid thinking
TinyLlama 1.1B Chat Q4_K_M670MB2K1.1BClassic, fast and reliable
Llama 3.2 1B Q4_K_M750MB128K1.2BGeneral purpose, huge context
Qwen 2.5 1.5B Q4_K_M986MB32K1.5BMultilingual

Medium (1 -- 2GB)

ModelSizeContextParametersBest For
Qwen 2.5 Coder 1.5B Q4_K_M1.0GB32K1.5BCode-specialized, programming
SmolLM2 1.7B Q4_K_M1.06GB8K1.7BEfficient, great per-param quality
DeepSeek R1 1.5B Q4_K_M1.1GB128K1.5BReasoning/thinking, chain-of-thought
Qwen3 1.7B Q4_K_M1.2GB40K1.7BMultilingual reasoning, hybrid thinking
Phi 3.5 Mini Q4_K_M1.24GB4K3.8BReasoning, coding
Gemma 2 2B IT Q4_K_M1.3GB8K2BInstruction following
Llama 3.2 3B Q4_K_M1.93GB128K3.2BHigh quality, huge context
Qwen 2.5 3B Q4_K_M1.94GB32K3BHigh quality multilingual

Large (2 -- 5GB)

ModelSizeContextParametersBest For
Phi-4 Mini Q4_K_M2.3GB4K3.8BStrong reasoning and coding
Qwen3 4B Q4_K_M2.7GB40K4BExcellent multilingual reasoning and code
Qwen 2.5 Coder 7B Q4_K_M4.5GB32K7BBest code generation quality
Mistral 7B v0.3 Q4_K_M4.37GB32K7.2BStrong general performance
DeepSeek R1 7B Q4_K_M4.7GB128K7BStrong reasoning/thinking, 8GB+ RAM
Llama 3.1 8B Q4_K_M4.92GB128K8BBest quality (8GB+ RAM)

Vision-Language (UI grounding)

ModelSizeContextParametersBest For
Holo2 4B Q4_K_M2.8GB256K4BUI grounding, browser-agent / GUI navigation
Holo2 8B Q4_K_M5.1GB256K8BPremium UI grounding, 8GB+ RAM required
Gemma 4 E2B IT Q4_K_M3.46GB128K5.1B (2.3B eff.)Google Gemma 4, vision + tool calling
Gemma 4 E4B IT Q4_K_M5.41GB128K8B (~4B eff.)Google Gemma 4, vision + tool calling, 8GB+ RAM

Embedding Models

ModelSizeDimensionsParametersBest For
Nomic Embed Text v1.5 Q4_K_M78MB768137MHigh-quality semantic search
MxBai Embed Large v1 Q4_K_M197MB1024335MTop-quality English embeddings
BGE Small EN v1.5 Q8_035MB38433MLightweight on-device embeddings

Reranker Models

ModelSizeContextParametersBest For
Jina Reranker v2 Q4_K_M163MB1K278MMultilingual cross-encoder reranking
BGE Reranker v2 M3 Q4_K_M218MB8K568MMultilingual reranking, long context

Access them programmatically:

import { WLLAMA_MODELS } from '@localmode/wllama';

for (const [id, info] of Object.entries(WLLAMA_MODELS)) {
  console.log(`${info.name}: ${info.size}, ${info.quantization}, ${info.description}`);
}

Quantization Comparison

FormatBits/WeightSize (7B)QualitySpeed
Q2_K~2.5~2.5GBLowFastest
Q3_K_M~3.5~3.3GBFairFast
Q4_K_M~4.5~4.1GBGoodGood
Q5_K_M~5.5~4.8GBVery GoodModerate
Q6_K~6.5~5.5GBExcellentSlower
Q8_0~8.5~7.2GBNear-FP16Slow
F1616~14GBFullSlowest

Recommended Quantization

Q4_K_M is the sweet spot for browser use -- it provides good quality while keeping model sizes manageable. Use Q5_K_M for higher quality when your device has enough RAM.

Finding GGUF Models on HuggingFace

Search HuggingFace for GGUF models:

  • Browse GGUF models
  • Filter by model family (Llama, Mistral, Qwen, Phi, etc.)
  • Look for repositories with "-GGUF" suffix (e.g., bartowski/Llama-3.2-1B-Instruct-GGUF)
  • Choose Q4_K_M files for the best size/quality balance
  • bartowski -- High-quality quantizations of popular models
  • TheBloke -- Extensive GGUF catalog
  • Qwen -- Official Qwen GGUF releases
  • meta-llama -- Official Meta Llama releases (often converted by community)

GGUFMetadata Fields

Prop

Type

GGUFBrowserCompat Fields

Prop

Type

Next Steps

Showcase Apps

AppDescriptionLinks
GGUF ExplorerBrowse 160K+ GGUF models with metadata inspection and chatDemo · Source

On this page