LocalMode
wllama

GGUF Models

Browse, inspect, and check browser compatibility of GGUF models before downloading.

GGUF Model Browsing and Compatibility

The GGUF format is the standard for quantized LLM models, with 135,000+ models on HuggingFace. The @localmode/wllama package provides tools to inspect model metadata and check browser compatibility before downloading multi-GB files.

See it in action

Try GGUF Explorer for a working demo.

What is GGUF?

GGUF (GPT-Generated Unified Format) is a binary format for storing quantized LLM models. It stores model weights, architecture metadata, tokenizer data, and configuration in a single file. The format is used by llama.cpp and supports many quantization levels (Q2_K through F32).

Inspect Models Before Download

Use parseGGUFMetadata() to read a model's metadata via HTTP Range requests. This downloads only ~4KB of header data, not the full model file.

import { parseGGUFMetadata } from '@localmode/wllama';

const metadata = await parseGGUFMetadata(
  'bartowski/Llama-3.2-1B-Instruct-GGUF:Llama-3.2-1B-Instruct-Q4_K_M.gguf'
);

console.log(metadata.architecture);   // 'llama'
console.log(metadata.contextLength);  // 131072
console.log(metadata.quantization);   // 'Q4_K_M'
console.log(metadata.parameterCount); // ~1.24 billion
console.log(metadata.vocabSize);      // 32000
console.log(metadata.layerCount);     // 22
console.log(metadata.headCount);      // 32
console.log(metadata.modelName);      // 'Llama 3.2 1B Instruct'

URL Formats

parseGGUFMetadata() accepts multiple URL formats:

// HuggingFace shorthand (repo:filename)
await parseGGUFMetadata('bartowski/Llama-3.2-1B-Instruct-GGUF:Llama-3.2-1B-Instruct-Q4_K_M.gguf');

// Full HuggingFace URL
await parseGGUFMetadata('https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q4_K_M.gguf');

// Any CDN that supports Range requests
await parseGGUFMetadata('https://your-cdn.com/models/custom-model.gguf');

Check Browser Compatibility

Use checkGGUFBrowserCompatFromURL() to check if a model can run on the current device:

import { checkGGUFBrowserCompatFromURL } from '@localmode/wllama';

const result = await checkGGUFBrowserCompatFromURL(
  'bartowski/Llama-3.2-1B-Instruct-GGUF:Llama-3.2-1B-Instruct-Q4_K_M.gguf'
);

console.log(result.canRun);             // true
console.log(result.estimatedRAMHuman);   // '900 MB'
console.log(result.deviceRAMHuman);      // '8 GB'
console.log(result.estimatedSpeed);      // '~30-50 tok/s multi-thread'
console.log(result.hasCORS);             // true (multi-threading available)
console.log(result.warnings);            // []
console.log(result.recommendations);     // []

// Full metadata is also attached
console.log(result.metadata.architecture); // 'llama'

Two-Step Workflow

You can also separate parsing from compatibility checking:

import { parseGGUFMetadata, checkGGUFBrowserCompat } from '@localmode/wllama';

// Step 1: Parse metadata (light HTTP Range request)
const metadata = await parseGGUFMetadata(modelUrl);

// Step 2: Check compatibility (no network, instant)
const compat = await checkGGUFBrowserCompat(metadata);

Complete Workflow: Browse, Inspect, Check, Run

import {
  parseGGUFMetadata,
  checkGGUFBrowserCompat,
  wllama,
} from '@localmode/wllama';
import { streamText } from '@localmode/core';

// 1. User selects a model from HuggingFace
const modelUrl = 'bartowski/Llama-3.2-1B-Instruct-GGUF:Llama-3.2-1B-Instruct-Q4_K_M.gguf';

// 2. Inspect metadata (~4KB download)
const metadata = await parseGGUFMetadata(modelUrl);
showModelInfo(metadata); // Display to user

// 3. Check compatibility (instant, no download)
const compat = await checkGGUFBrowserCompat(metadata);

if (!compat.canRun) {
  showWarnings(compat.warnings);
  showRecommendations(compat.recommendations);
  return;
}

// 4. Download and run
const model = wllama.languageModel(modelUrl, {
  onProgress: (p) => updateProgressBar(p.progress ?? 0),
});

const result = await streamText({ model, prompt: 'Hello!' });

These Q4_K_M quantized models are curated for browser use (compatible with wllama v2.3.7 / llama.cpp b7179):

Tiny (< 500MB)

ModelSizeContextParametersBest For
SmolLM2-135M Q4_K_M70MB8K135MInstant loading, testing
SmolLM2-360M Q4_K_M234MB8K360MVery small, fast responses
Qwen 2.5 0.5B Q4_K_M386MB4K500MTiny with great quality

Small (500MB -- 1GB)

ModelSizeContextParametersBest For
TinyLlama 1.1B Chat Q4_K_M670MB2K1.1BClassic, fast and reliable
Llama 3.2 1B Q4_K_M750MB128K1.2BGeneral purpose, huge context
Qwen 2.5 1.5B Q4_K_M986MB32K1.5BMultilingual

Medium (1 -- 2GB)

ModelSizeContextParametersBest For
Qwen 2.5 Coder 1.5B Q4_K_M1.0GB32K1.5BCode-specialized, programming
SmolLM2 1.7B Q4_K_M1.06GB8K1.7BEfficient, great per-param quality
Phi 3.5 Mini Q4_K_M1.24GB4K3.8BReasoning, coding
Gemma 2 2B IT Q4_K_M1.3GB8K2BInstruction following
Llama 3.2 3B Q4_K_M1.93GB128K3.2BHigh quality, huge context
Qwen 2.5 3B Q4_K_M1.94GB32K3BHigh quality multilingual

Large (2 -- 5GB)

ModelSizeContextParametersBest For
Phi-4 Mini Q4_K_M2.3GB4K3.8BStrong reasoning and coding
Qwen 2.5 Coder 7B Q4_K_M4.5GB32K7BBest code generation quality
Mistral 7B v0.3 Q4_K_M4.37GB32K7.2BStrong general performance
Llama 3.1 8B Q4_K_M4.92GB128K8BBest quality (8GB+ RAM)

Access them programmatically:

import { WLLAMA_MODELS } from '@localmode/wllama';

for (const [id, info] of Object.entries(WLLAMA_MODELS)) {
  console.log(`${info.name}: ${info.size}, ${info.quantization}, ${info.description}`);
}

Quantization Comparison

FormatBits/WeightSize (7B)QualitySpeed
Q2_K~2.5~2.5GBLowFastest
Q3_K_M~3.5~3.3GBFairFast
Q4_K_M~4.5~4.1GBGoodGood
Q5_K_M~5.5~4.8GBVery GoodModerate
Q6_K~6.5~5.5GBExcellentSlower
Q8_0~8.5~7.2GBNear-FP16Slow
F1616~14GBFullSlowest

Recommended Quantization

Q4_K_M is the sweet spot for browser use -- it provides good quality while keeping model sizes manageable. Use Q5_K_M for higher quality when your device has enough RAM.

Finding GGUF Models on HuggingFace

Search HuggingFace for GGUF models:

  • Browse GGUF models
  • Filter by model family (Llama, Mistral, Qwen, Phi, etc.)
  • Look for repositories with "-GGUF" suffix (e.g., bartowski/Llama-3.2-1B-Instruct-GGUF)
  • Choose Q4_K_M files for the best size/quality balance
  • bartowski -- High-quality quantizations of popular models
  • TheBloke -- Extensive GGUF catalog
  • Qwen -- Official Qwen GGUF releases
  • meta-llama -- Official Meta Llama releases (often converted by community)

GGUFMetadata Fields

Prop

Type

GGUFBrowserCompat Fields

Prop

Type

Next Steps

Showcase Apps

AppDescriptionLinks
GGUF ExplorerBrowse 135K+ GGUF models with metadata inspection and chatDemo · Source

On this page