GGUF Models
Browse, inspect, and check browser compatibility of GGUF models before downloading.
GGUF Model Browsing and Compatibility
The GGUF format is the standard for quantized LLM models, with 135,000+ models on HuggingFace. The @localmode/wllama package provides tools to inspect model metadata and check browser compatibility before downloading multi-GB files.
See it in action
Try GGUF Explorer for a working demo.
What is GGUF?
GGUF (GPT-Generated Unified Format) is a binary format for storing quantized LLM models. It stores model weights, architecture metadata, tokenizer data, and configuration in a single file. The format is used by llama.cpp and supports many quantization levels (Q2_K through F32).
Inspect Models Before Download
Use parseGGUFMetadata() to read a model's metadata via HTTP Range requests. This downloads only ~4KB of header data, not the full model file.
import { parseGGUFMetadata } from '@localmode/wllama';
const metadata = await parseGGUFMetadata(
'bartowski/Llama-3.2-1B-Instruct-GGUF:Llama-3.2-1B-Instruct-Q4_K_M.gguf'
);
console.log(metadata.architecture); // 'llama'
console.log(metadata.contextLength); // 131072
console.log(metadata.quantization); // 'Q4_K_M'
console.log(metadata.parameterCount); // ~1.24 billion
console.log(metadata.vocabSize); // 32000
console.log(metadata.layerCount); // 22
console.log(metadata.headCount); // 32
console.log(metadata.modelName); // 'Llama 3.2 1B Instruct'URL Formats
parseGGUFMetadata() accepts multiple URL formats:
// HuggingFace shorthand (repo:filename)
await parseGGUFMetadata('bartowski/Llama-3.2-1B-Instruct-GGUF:Llama-3.2-1B-Instruct-Q4_K_M.gguf');
// Full HuggingFace URL
await parseGGUFMetadata('https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q4_K_M.gguf');
// Any CDN that supports Range requests
await parseGGUFMetadata('https://your-cdn.com/models/custom-model.gguf');Check Browser Compatibility
Use checkGGUFBrowserCompatFromURL() to check if a model can run on the current device:
import { checkGGUFBrowserCompatFromURL } from '@localmode/wllama';
const result = await checkGGUFBrowserCompatFromURL(
'bartowski/Llama-3.2-1B-Instruct-GGUF:Llama-3.2-1B-Instruct-Q4_K_M.gguf'
);
console.log(result.canRun); // true
console.log(result.estimatedRAMHuman); // '900 MB'
console.log(result.deviceRAMHuman); // '8 GB'
console.log(result.estimatedSpeed); // '~30-50 tok/s multi-thread'
console.log(result.hasCORS); // true (multi-threading available)
console.log(result.warnings); // []
console.log(result.recommendations); // []
// Full metadata is also attached
console.log(result.metadata.architecture); // 'llama'Two-Step Workflow
You can also separate parsing from compatibility checking:
import { parseGGUFMetadata, checkGGUFBrowserCompat } from '@localmode/wllama';
// Step 1: Parse metadata (light HTTP Range request)
const metadata = await parseGGUFMetadata(modelUrl);
// Step 2: Check compatibility (no network, instant)
const compat = await checkGGUFBrowserCompat(metadata);Complete Workflow: Browse, Inspect, Check, Run
import {
parseGGUFMetadata,
checkGGUFBrowserCompat,
wllama,
} from '@localmode/wllama';
import { streamText } from '@localmode/core';
// 1. User selects a model from HuggingFace
const modelUrl = 'bartowski/Llama-3.2-1B-Instruct-GGUF:Llama-3.2-1B-Instruct-Q4_K_M.gguf';
// 2. Inspect metadata (~4KB download)
const metadata = await parseGGUFMetadata(modelUrl);
showModelInfo(metadata); // Display to user
// 3. Check compatibility (instant, no download)
const compat = await checkGGUFBrowserCompat(metadata);
if (!compat.canRun) {
showWarnings(compat.warnings);
showRecommendations(compat.recommendations);
return;
}
// 4. Download and run
const model = wllama.languageModel(modelUrl, {
onProgress: (p) => updateProgressBar(p.progress ?? 0),
});
const result = await streamText({ model, prompt: 'Hello!' });Recommended Models
These Q4_K_M quantized models are curated for browser use (compatible with wllama v2.3.7 / llama.cpp b7179):
Tiny (< 500MB)
| Model | Size | Context | Parameters | Best For |
|---|---|---|---|---|
| SmolLM2-135M Q4_K_M | 70MB | 8K | 135M | Instant loading, testing |
| SmolLM2-360M Q4_K_M | 234MB | 8K | 360M | Very small, fast responses |
| Qwen 2.5 0.5B Q4_K_M | 386MB | 4K | 500M | Tiny with great quality |
Small (500MB -- 1GB)
| Model | Size | Context | Parameters | Best For |
|---|---|---|---|---|
| TinyLlama 1.1B Chat Q4_K_M | 670MB | 2K | 1.1B | Classic, fast and reliable |
| Llama 3.2 1B Q4_K_M | 750MB | 128K | 1.2B | General purpose, huge context |
| Qwen 2.5 1.5B Q4_K_M | 986MB | 32K | 1.5B | Multilingual |
Medium (1 -- 2GB)
| Model | Size | Context | Parameters | Best For |
|---|---|---|---|---|
| Qwen 2.5 Coder 1.5B Q4_K_M | 1.0GB | 32K | 1.5B | Code-specialized, programming |
| SmolLM2 1.7B Q4_K_M | 1.06GB | 8K | 1.7B | Efficient, great per-param quality |
| Phi 3.5 Mini Q4_K_M | 1.24GB | 4K | 3.8B | Reasoning, coding |
| Gemma 2 2B IT Q4_K_M | 1.3GB | 8K | 2B | Instruction following |
| Llama 3.2 3B Q4_K_M | 1.93GB | 128K | 3.2B | High quality, huge context |
| Qwen 2.5 3B Q4_K_M | 1.94GB | 32K | 3B | High quality multilingual |
Large (2 -- 5GB)
| Model | Size | Context | Parameters | Best For |
|---|---|---|---|---|
| Phi-4 Mini Q4_K_M | 2.3GB | 4K | 3.8B | Strong reasoning and coding |
| Qwen 2.5 Coder 7B Q4_K_M | 4.5GB | 32K | 7B | Best code generation quality |
| Mistral 7B v0.3 Q4_K_M | 4.37GB | 32K | 7.2B | Strong general performance |
| Llama 3.1 8B Q4_K_M | 4.92GB | 128K | 8B | Best quality (8GB+ RAM) |
Access them programmatically:
import { WLLAMA_MODELS } from '@localmode/wllama';
for (const [id, info] of Object.entries(WLLAMA_MODELS)) {
console.log(`${info.name}: ${info.size}, ${info.quantization}, ${info.description}`);
}Quantization Comparison
| Format | Bits/Weight | Size (7B) | Quality | Speed |
|---|---|---|---|---|
| Q2_K | ~2.5 | ~2.5GB | Low | Fastest |
| Q3_K_M | ~3.5 | ~3.3GB | Fair | Fast |
| Q4_K_M | ~4.5 | ~4.1GB | Good | Good |
| Q5_K_M | ~5.5 | ~4.8GB | Very Good | Moderate |
| Q6_K | ~6.5 | ~5.5GB | Excellent | Slower |
| Q8_0 | ~8.5 | ~7.2GB | Near-FP16 | Slow |
| F16 | 16 | ~14GB | Full | Slowest |
Recommended Quantization
Q4_K_M is the sweet spot for browser use -- it provides good quality while keeping model sizes manageable. Use Q5_K_M for higher quality when your device has enough RAM.
Finding GGUF Models on HuggingFace
Search HuggingFace for GGUF models:
- Browse GGUF models
- Filter by model family (Llama, Mistral, Qwen, Phi, etc.)
- Look for repositories with "-GGUF" suffix (e.g.,
bartowski/Llama-3.2-1B-Instruct-GGUF) - Choose Q4_K_M files for the best size/quality balance
Popular GGUF Repositories
- bartowski -- High-quality quantizations of popular models
- TheBloke -- Extensive GGUF catalog
- Qwen -- Official Qwen GGUF releases
- meta-llama -- Official Meta Llama releases (often converted by community)
GGUFMetadata Fields
Prop
Type
GGUFBrowserCompat Fields
Prop
Type
Next Steps
wllama Overview
Installation, usage, and configuration guide.
Text Generation
Learn about streaming and generation options.