GGUF Models
Browse, inspect, and check browser compatibility of GGUF models before downloading.
GGUF Model Browsing and Compatibility
The GGUF format is the standard for quantized LLM models, with 160,000+ models on HuggingFace. The @localmode/wllama package provides tools to inspect model metadata and check browser compatibility before downloading multi-GB files.
See it in action
Try GGUF Explorer for a working demo.
What is GGUF?
GGUF (GPT-Generated Unified Format) is a binary format for storing quantized LLM models. It stores model weights, architecture metadata, tokenizer data, and configuration in a single file. The format is used by llama.cpp and supports many quantization levels (Q2_K through F32).
Inspect Models Before Download
Use parseGGUFMetadata() to read a model's metadata via HTTP Range requests. This downloads only ~4KB of header data, not the full model file.
import { parseGGUFMetadata } from '@localmode/wllama';
const metadata = await parseGGUFMetadata(
'bartowski/Llama-3.2-1B-Instruct-GGUF:Llama-3.2-1B-Instruct-Q4_K_M.gguf'
);
console.log(metadata.architecture); // 'llama'
console.log(metadata.contextLength); // 131072
console.log(metadata.quantization); // 'Q4_K_M'
console.log(metadata.parameterCount); // ~1.24 billion
console.log(metadata.vocabSize); // 32000
console.log(metadata.layerCount); // 22
console.log(metadata.headCount); // 32
console.log(metadata.modelName); // 'Llama 3.2 1B Instruct'URL Formats
parseGGUFMetadata() accepts multiple URL formats:
// HuggingFace shorthand (repo:filename)
await parseGGUFMetadata('bartowski/Llama-3.2-1B-Instruct-GGUF:Llama-3.2-1B-Instruct-Q4_K_M.gguf');
// Full HuggingFace URL
await parseGGUFMetadata('https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q4_K_M.gguf');
// Any CDN that supports Range requests
await parseGGUFMetadata('https://your-cdn.com/models/custom-model.gguf');Check Browser Compatibility
Use checkGGUFBrowserCompatFromURL() to check if a model can run on the current device:
import { checkGGUFBrowserCompatFromURL } from '@localmode/wllama';
const result = await checkGGUFBrowserCompatFromURL(
'bartowski/Llama-3.2-1B-Instruct-GGUF:Llama-3.2-1B-Instruct-Q4_K_M.gguf'
);
console.log(result.canRun); // true
console.log(result.estimatedRAMHuman); // '900 MB'
console.log(result.deviceRAMHuman); // '8 GB'
console.log(result.estimatedSpeed); // '~30-50 tok/s multi-thread'
console.log(result.hasCORS); // true (multi-threading available)
console.log(result.warnings); // []
console.log(result.recommendations); // []
// Full metadata is also attached
console.log(result.metadata.architecture); // 'llama'Two-Step Workflow
You can also separate parsing from compatibility checking:
import { parseGGUFMetadata, checkGGUFBrowserCompat } from '@localmode/wllama';
// Step 1: Parse metadata (light HTTP Range request)
const metadata = await parseGGUFMetadata(modelUrl);
// Step 2: Check compatibility (no network, instant)
const compat = await checkGGUFBrowserCompat(metadata);Complete Workflow: Browse, Inspect, Check, Run
import {
parseGGUFMetadata,
checkGGUFBrowserCompat,
wllama,
} from '@localmode/wllama';
import { streamText } from '@localmode/core';
// 1. User selects a model from HuggingFace
const modelUrl = 'bartowski/Llama-3.2-1B-Instruct-GGUF:Llama-3.2-1B-Instruct-Q4_K_M.gguf';
// 2. Inspect metadata (~4KB download)
const metadata = await parseGGUFMetadata(modelUrl);
showModelInfo(metadata); // Display to user
// 3. Check compatibility (instant, no download)
const compat = await checkGGUFBrowserCompat(metadata);
if (!compat.canRun) {
showWarnings(compat.warnings);
showRecommendations(compat.recommendations);
return;
}
// 4. Download and run
const model = wllama.languageModel(modelUrl, {
onProgress: (p) => updateProgressBar(p.progress ?? 0),
});
const result = await streamText({ model, prompt: 'Hello!' });Recommended Models (30 Curated)
These quantized models are curated for browser use (compatible with wllama v3). The catalog includes 25 language models (including Qwen3 and DeepSeek R1 reasoning models), 3 embedding models, and 2 reranker models:
Tiny (< 500MB)
| Model | Size | Context | Parameters | Best For |
|---|---|---|---|---|
| SmolLM2-135M Q4_K_M | 70MB | 8K | 135M | Instant loading, testing |
| SmolLM2-360M Q4_K_M | 234MB | 8K | 360M | Very small, fast responses |
| Qwen 2.5 0.5B Q4_K_M | 386MB | 4K | 500M | Tiny with great quality |
Small (500MB -- 1GB)
| Model | Size | Context | Parameters | Best For |
|---|---|---|---|---|
| Qwen3 0.6B Q4_K_M | 530MB | 40K | 600M | Fast multilingual reasoning, hybrid thinking |
| TinyLlama 1.1B Chat Q4_K_M | 670MB | 2K | 1.1B | Classic, fast and reliable |
| Llama 3.2 1B Q4_K_M | 750MB | 128K | 1.2B | General purpose, huge context |
| Qwen 2.5 1.5B Q4_K_M | 986MB | 32K | 1.5B | Multilingual |
Medium (1 -- 2GB)
| Model | Size | Context | Parameters | Best For |
|---|---|---|---|---|
| Qwen 2.5 Coder 1.5B Q4_K_M | 1.0GB | 32K | 1.5B | Code-specialized, programming |
| SmolLM2 1.7B Q4_K_M | 1.06GB | 8K | 1.7B | Efficient, great per-param quality |
| DeepSeek R1 1.5B Q4_K_M | 1.1GB | 128K | 1.5B | Reasoning/thinking, chain-of-thought |
| Qwen3 1.7B Q4_K_M | 1.2GB | 40K | 1.7B | Multilingual reasoning, hybrid thinking |
| Phi 3.5 Mini Q4_K_M | 1.24GB | 4K | 3.8B | Reasoning, coding |
| Gemma 2 2B IT Q4_K_M | 1.3GB | 8K | 2B | Instruction following |
| Llama 3.2 3B Q4_K_M | 1.93GB | 128K | 3.2B | High quality, huge context |
| Qwen 2.5 3B Q4_K_M | 1.94GB | 32K | 3B | High quality multilingual |
Large (2 -- 5GB)
| Model | Size | Context | Parameters | Best For |
|---|---|---|---|---|
| Phi-4 Mini Q4_K_M | 2.3GB | 4K | 3.8B | Strong reasoning and coding |
| Qwen3 4B Q4_K_M | 2.7GB | 40K | 4B | Excellent multilingual reasoning and code |
| Qwen 2.5 Coder 7B Q4_K_M | 4.5GB | 32K | 7B | Best code generation quality |
| Mistral 7B v0.3 Q4_K_M | 4.37GB | 32K | 7.2B | Strong general performance |
| DeepSeek R1 7B Q4_K_M | 4.7GB | 128K | 7B | Strong reasoning/thinking, 8GB+ RAM |
| Llama 3.1 8B Q4_K_M | 4.92GB | 128K | 8B | Best quality (8GB+ RAM) |
Vision-Language (UI grounding)
| Model | Size | Context | Parameters | Best For |
|---|---|---|---|---|
| Holo2 4B Q4_K_M | 2.8GB | 256K | 4B | UI grounding, browser-agent / GUI navigation |
| Holo2 8B Q4_K_M | 5.1GB | 256K | 8B | Premium UI grounding, 8GB+ RAM required |
| Gemma 4 E2B IT Q4_K_M | 3.46GB | 128K | 5.1B (2.3B eff.) | Google Gemma 4, vision + tool calling |
| Gemma 4 E4B IT Q4_K_M | 5.41GB | 128K | 8B (~4B eff.) | Google Gemma 4, vision + tool calling, 8GB+ RAM |
Embedding Models
| Model | Size | Dimensions | Parameters | Best For |
|---|---|---|---|---|
| Nomic Embed Text v1.5 Q4_K_M | 78MB | 768 | 137M | High-quality semantic search |
| MxBai Embed Large v1 Q4_K_M | 197MB | 1024 | 335M | Top-quality English embeddings |
| BGE Small EN v1.5 Q8_0 | 35MB | 384 | 33M | Lightweight on-device embeddings |
Reranker Models
| Model | Size | Context | Parameters | Best For |
|---|---|---|---|---|
| Jina Reranker v2 Q4_K_M | 163MB | 1K | 278M | Multilingual cross-encoder reranking |
| BGE Reranker v2 M3 Q4_K_M | 218MB | 8K | 568M | Multilingual reranking, long context |
Access them programmatically:
import { WLLAMA_MODELS } from '@localmode/wllama';
for (const [id, info] of Object.entries(WLLAMA_MODELS)) {
console.log(`${info.name}: ${info.size}, ${info.quantization}, ${info.description}`);
}Quantization Comparison
| Format | Bits/Weight | Size (7B) | Quality | Speed |
|---|---|---|---|---|
| Q2_K | ~2.5 | ~2.5GB | Low | Fastest |
| Q3_K_M | ~3.5 | ~3.3GB | Fair | Fast |
| Q4_K_M | ~4.5 | ~4.1GB | Good | Good |
| Q5_K_M | ~5.5 | ~4.8GB | Very Good | Moderate |
| Q6_K | ~6.5 | ~5.5GB | Excellent | Slower |
| Q8_0 | ~8.5 | ~7.2GB | Near-FP16 | Slow |
| F16 | 16 | ~14GB | Full | Slowest |
Recommended Quantization
Q4_K_M is the sweet spot for browser use -- it provides good quality while keeping model sizes manageable. Use Q5_K_M for higher quality when your device has enough RAM.
Finding GGUF Models on HuggingFace
Search HuggingFace for GGUF models:
- Browse GGUF models
- Filter by model family (Llama, Mistral, Qwen, Phi, etc.)
- Look for repositories with "-GGUF" suffix (e.g.,
bartowski/Llama-3.2-1B-Instruct-GGUF) - Choose Q4_K_M files for the best size/quality balance
Popular GGUF Repositories
- bartowski -- High-quality quantizations of popular models
- TheBloke -- Extensive GGUF catalog
- Qwen -- Official Qwen GGUF releases
- meta-llama -- Official Meta Llama releases (often converted by community)
GGUFMetadata Fields
Prop
Type
GGUFBrowserCompat Fields
Prop
Type
Next Steps
wllama Overview
Installation, usage, and configuration guide.
Text Generation
Learn about streaming and generation options.
Showcase Apps
Overview
wllama provider for browser LLM inference via llama.cpp WASM. Run any GGUF model without WebGPU.
Overview
LiteRT provider for browser LLM inference via Google's first-party `.litertlm` runtime. WebGPU with a CPU WASM fallback, a curated catalog of Gemma 4 E2B/E4B and Qwen3 0.6B, all verified end-to-end.