What is GGUF and how many GGUF models are available for browser inference?

GGUF (GGML Universal File) is a self-contained format for quantized LLMs created by the developer of llama.cpp. A single GGUF file bundles model weights, architecture metadata, and tokenizer configuration. As of May 2026, HuggingFace hosts over 178,000 GGUF models, all runnable in the browser via @localmode/wllama.

Why does @localmode/wllama use WebAssembly instead of WebGPU?

WebAssembly has been supported in every major browser since 2017 (Chrome 57+, Firefox 52+, Safari 11+), while WebGPU is still not universally available (82% coverage in mid-2026). WASM runs at roughly 40-50% of WebGPU speed on CPU but works everywhere. Optional WebGPU acceleration is available when the browser supports it.

Can I inspect a GGUF model's metadata without downloading the full file?

Yes. The parseGGUFMetadata() function reads just the file header (roughly 4 KB) via an HTTP Range request. It returns architecture, context length, quantization type, parameter count, vocabulary size, layer count, and model name -- letting you check device compatibility before committing to a multi-gigabyte download.

How does @localmode/wllama compare to @localmode/webllm?

Wllama supports 30 curated plus 160,000+ GGUF models with universal browser support, 3 embedding models, tool calling, and vision via mmprojUrl. WebLLM offers 32 curated MLC-compiled models requiring WebGPU with native GPU speed. Wllama trades some speed for vastly broader model selection and compatibility.

GGUF Models in the Browser: 160,000+ Models Via llama.cpp WASM

If you follow r/LocalLLaMA or spend any time on HuggingFace, you have seen GGUF files everywhere. They are the standard format for running quantized LLMs locally, and there are now over 160,000 GGUF models on HuggingFace alone. Until recently, running them meant installing llama.cpp on your machine, fiddling with command-line flags, and hoping your GPU was compatible.

What if you could point your browser at any GGUF file on HuggingFace, inspect its metadata without downloading the full model, check whether your device can handle it, and then stream a conversation -- all without leaving a browser tab?

That is exactly what @localmode/wllama does. It wraps wllama (a WebAssembly binding for llama.cpp by ngxson) into the same LanguageModel interface used by @localmode/webllm and @localmode/transformers. The same streamText() and generateText() calls. The same hooks. The same middleware. But instead of requiring WebGPU, it runs on plain WASM -- meaning it works in every modern browser, including Firefox and Safari.

This post walks through the full stack: what GGUF is, why WASM matters, how to inspect and run models, and what tradeoffs to expect.

What Is GGUF?

GGUF stands for GGML Universal File. It was created by Georgi Gerganov (ggerganov), the developer behind llama.cpp, as a successor to the older GGML format. The format was introduced in August 2023 and has since become the dominant standard for distributing quantized LLMs.

Unlike tensor-only formats like safetensors, a single GGUF file bundles everything needed to run a model:

Model weights (quantized to various bit widths)
Architecture metadata (layer count, head count, embedding dimensions)
Tokenizer vocabulary and configuration
Context length, quantization type, and author information

This self-contained design is what makes the "inspect before download" workflow possible. The metadata is stored in the file header, so you can read it with a tiny HTTP Range request without downloading gigabytes of weights.

The GGUF ecosystem on HuggingFace has grown rapidly. As of May 2026, there are over 178,000 GGUF models on the platform, published by quantizers like bartowski (high-quality quantizations of popular models), TheBloke (one of the earliest and most extensive GGUF catalogs), and unsloth (dynamic quantizations with state-of-the-art accuracy). If a model exists, someone has probably published a GGUF of it.

Why WASM Matters: Universal Browser Support

WebGPU is fast -- but it is not everywhere. As of early 2026, WebGPU is supported in Chrome 113+, Edge 113+, and Safari 26+. Firefox has shipped WebGPU on Windows (Firefox 141+) and on macOS ARM64 (Firefox 145+), but it remains off by default in many configurations. That means a meaningful fraction of your users cannot run WebGPU-based inference at all.

WebAssembly, on the other hand, has been supported in every major browser since 2017. Chrome 57+, Edge 16+, Firefox 52+, Safari 11+, and every iOS browser. If your users have a browser made in the last eight years, they can run WASM.

The @localmode/wllama package compiles llama.cpp to WASM via wllama, ngxson's WebAssembly binding that has gained significant adoption -- Firefox now officially uses wllama as one of its inference engines for their Link Preview feature.

Feature	@localmode/wllama (WASM)	@localmode/webllm (WebGPU)
Browser support	All modern browsers	WebGPU-capable only
Available models	30 curated + 160K+ GGUF models	~32 curated MLC models
Embeddings	3 GGUF embedding models	--
Model format	GGUF (universal standard)	MLC (pre-compiled)
GPU required	No (optional WebGPU acceleration)	Yes
Tool calling	Yes (via providerOptions)	Yes
Vision	Holo2 4B/8B, Gemma 4 E2B/E4B (mmprojUrl)	Phi 3.5 Vision
Inference speed	~40-50% of WebGPU (CPU), faster with WebGPU	Native GPU speed
Multi-threading	With CORS headers	Automatic

The tradeoff is clear: WASM is slower, but it works everywhere and gives you access to vastly more models. For many applications -- summarization, question answering, writing assistance, code generation -- the latency difference is acceptable, especially when the alternative is "does not work at all" for users without WebGPU.

Inspecting Models Before Downloading

One of the most useful features in @localmode/wllama is the GGUF metadata parser. Before committing to a multi-gigabyte download, you can read the model's header with a single HTTP Range request that fetches roughly 4KB of data.

import { parseGGUFMetadata } from '@localmode/wllama';

const metadata = await parseGGUFMetadata(
  'bartowski/Llama-3.2-1B-Instruct-GGUF:Llama-3.2-1B-Instruct-Q4_K_M.gguf'
);

console.log(metadata.architecture);   // 'llama'
console.log(metadata.contextLength);  // 131072
console.log(metadata.quantization);   // 'Q4_K_M'
console.log(metadata.parameterCount); // ~1.24 billion
console.log(metadata.vocabSize);      // 128256
console.log(metadata.layerCount);     // 22
console.log(metadata.headCount);      // 32
console.log(metadata.modelName);      // 'Llama 3.2 1B Instruct'

The parser accepts three URL formats: full HuggingFace URLs, the repo/name:filename.gguf shorthand, and any CDN URL that supports Range requests. Under the hood, it uses the @huggingface/gguf library to parse the binary header, then maps the raw general.file_type integer to a human-readable quantization string (Q4_K_M, Q5_K_M, Q8_0, and so on) using a comprehensive lookup table covering 25 known GGUF quantization types.

Checking Browser Compatibility

Knowing a model's metadata is useful, but the real question is: "Can my device actually run this?" The compatibility checker cross-references the parsed metadata with the current device's capabilities:

import { checkGGUFBrowserCompatFromURL } from '@localmode/wllama';

const result = await checkGGUFBrowserCompatFromURL(
  'bartowski/Llama-3.2-1B-Instruct-GGUF:Llama-3.2-1B-Instruct-Q4_K_M.gguf'
);

console.log(result.canRun);             // true
console.log(result.estimatedRAMHuman);   // '900 MB'
console.log(result.deviceRAMHuman);      // '8 GB'
console.log(result.estimatedSpeed);      // '~30-50 tok/s multi-thread'
console.log(result.hasCORS);             // true (multi-threading available)
console.log(result.warnings);            // []
console.log(result.recommendations);     // []

The checker estimates RAM usage (file size multiplied by a 1.2x overhead factor for WASM memory and inference buffers), reads device RAM from navigator.deviceMemory where available, queries storage quota from navigator.storage.estimate(), and detects whether SharedArrayBuffer is available for multi-threaded execution. It generates human-readable warnings and actionable recommendations -- for example, suggesting a Q4_K_M quantization if you are trying to load a Q8_0 that exceeds your RAM, or recommending CORS headers for a 2-4x speed improvement.

This is the same pattern used in the GGUF Explorer showcase app, where users browse the curated catalog, inspect any model, see compatibility results, and then download and chat -- all in one flow.

The Curated Model Catalog

While you can load any of the 160,000+ GGUF models on HuggingFace, @localmode/wllama ships a curated catalog of 30 models (25 language + 3 embedding + 2 reranker) that have been tested and verified to work well with the wllama v3 runtime. All language models are Q4_K_M quantized for the best size-to-quality ratio in browser environments.

Tiny (under 500MB)

Model	Size	Context	Parameters	Best For
SmolLM2 135M	70MB	8K	135M	Instant loading, testing
SmolLM2 360M	234MB	8K	360M	Very small, fast responses
Qwen 2.5 0.5B	386MB	4K	500M	Tiny with great quality

Small (500MB - 1GB)

Model	Size	Context	Parameters	Best For
TinyLlama 1.1B Chat	670MB	2K	1.1B	Classic, fast and reliable
Llama 3.2 1B	750MB	128K	1.2B	General purpose, huge context
Qwen 2.5 1.5B	986MB	32K	1.5B	Multilingual

Medium (1GB - 2GB)

Model	Size	Context	Parameters	Best For
Qwen 2.5 Coder 1.5B	1.0GB	32K	1.5B	Code-specialized
SmolLM2 1.7B	1.06GB	8K	1.7B	Efficient per-param quality
Phi 3.5 Mini	1.24GB	4K	3.8B	Reasoning, coding
Gemma 2 2B IT	1.3GB	8K	2B	Instruction following
Llama 3.2 3B	1.93GB	128K	3.2B	High quality, huge context
Qwen 2.5 3B	1.94GB	32K	3B	High quality multilingual

Large (2GB+)

Model	Size	Context	Parameters	Best For
Phi-4 Mini	2.3GB	4K	3.8B	Strong reasoning and coding
Qwen 2.5 Coder 7B	4.5GB	32K	7B	Best code generation
Mistral 7B v0.3	4.37GB	32K	7.2B	Strong general performance
Llama 3.1 8B	4.92GB	128K	8B	Best quality (8GB+ RAM)

The catalog is accessible programmatically via WLLAMA_MODELS, and each entry includes the HuggingFace download URL, architecture, parameter count, and context length. The getModelCategory() helper classifies any model by size tier.

Running Models: Same Interface, Different Backend

The key design principle: @localmode/wllama implements the same LanguageModel interface as @localmode/webllm and @localmode/transformers. If you have code that calls streamText() or generateText(), switching to wllama is a one-line change.

Streaming generation

import { streamText } from '@localmode/core';
import { wllama } from '@localmode/wllama';

const model = wllama.languageModel(
  'bartowski/Llama-3.2-1B-Instruct-GGUF:Llama-3.2-1B-Instruct-Q4_K_M.gguf'
);

const result = await streamText({
  model,
  prompt: 'Explain quantum computing in simple terms.',
});

let fullText = '';
for await (const chunk of result.stream) {
  fullText += chunk.text;
  // Update your UI with each token
}

Non-streaming generation

import { generateText } from '@localmode/core';
import { wllama } from '@localmode/wllama';

const model = wllama.languageModel('Llama-3.2-1B-Instruct-Q4_K_M');

const { text, usage } = await generateText({
  model,
  prompt: 'What is the capital of France?',
});

console.log(text);
console.log('Tokens used:', usage.totalTokens);

Loading any arbitrary GGUF model

You are not limited to the curated catalog. Pass any HuggingFace GGUF URL:

// HuggingFace shorthand (repo:filename)
const model = wllama.languageModel(
  'bartowski/Phi-3.5-mini-instruct-GGUF:Phi-3.5-mini-instruct-Q5_K_M.gguf'
);

// Full URL
const model2 = wllama.languageModel(
  'https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf'
);

The model is downloaded, cached in the browser's Cache API, and loaded into WASM memory on the first call. Subsequent loads are instant if the model is still cached.

Understanding Quantization Types

If you browse GGUF repositories on HuggingFace, you will see filenames like Q4_K_M, Q5_K_M, Q8_0, and F16. These describe how aggressively the model weights have been compressed from their original floating-point precision.

Format	Bits/Weight	Size (7B model)	Quality	Speed	Use Case
Q2_K	~3.2	~2.5GB	Low	Fastest	Extreme compression, low quality
Q3_K_M	~4.0	~3.3GB	Fair	Fast	Tight memory budgets
Q4_K_M	~4.9	~4.1GB	Good	Good	Sweet spot for browser use
Q5_K_M	~5.7	~4.8GB	Very Good	Moderate	Higher quality when RAM allows
Q6_K	~6.5	~5.5GB	Excellent	Slower	Near-lossless quality
Q8_0	~8.5	~7.2GB	Near-FP16	Slow	Maximum quality, high RAM
F16	16	~14GB	Full	Slowest	Full precision, not for browsers

The "K" in Q4_K_M and Q5_K_M refers to the "K-quant" family introduced by llama.cpp, which uses per-block scaling and improved handling of weight outliers compared to older methods like Q4_0. The "M" suffix indicates a medium-aggressive tuning -- there are also "S" (small/more aggressive) and "L" (large/less aggressive) variants.

Practical guidance

Q4_K_M is the recommended default for browser use. It provides roughly 70% file size reduction with minimal perceptible quality loss for most tasks. Step up to Q5_K_M if you need better reasoning or coding accuracy and your device has the RAM. Q8_0 is the ceiling for when quality is paramount and memory is not a concern. Avoid Q2_K and Q3_K unless you are severely memory-constrained.

Multi-Threading and CORS

The wllama WASM runtime supports multi-threaded execution via SharedArrayBuffer, which requires the page to be cross-origin isolated. Without the proper headers, wllama falls back to single-threaded mode, which is roughly 2-4x slower.

To enable multi-threading, add these headers to your server:

Cross-Origin-Opener-Policy: same-origin
Cross-Origin-Embedder-Policy: require-corp

In Next.js, this goes in next.config.js:

async headers() {
  return [{
    source: '/(.*)',
    headers: [
      { key: 'Cross-Origin-Opener-Policy', value: 'same-origin' },
      { key: 'Cross-Origin-Embedder-Policy', value: 'require-corp' },
    ],
  }];
}

You can check the status programmatically:

import { isCrossOriginIsolated } from '@localmode/wllama';

if (isCrossOriginIsolated()) {
  console.log('Multi-threading enabled');
} else {
  console.log('Single-thread fallback -- add CORS headers for 2-4x speed');
}

Performance: Setting Expectations

Let's be direct about performance. WASM-based inference is slower than WebGPU. Benchmarks consistently show WebGPU delivering 2-5x higher throughput for autoregressive text generation. For a 1B parameter model, you might see 30-50 tokens per second multi-threaded via WASM versus 80-120 tokens per second via WebGPU on the same hardware.

But the comparison misses the point. WASM is not competing with WebGPU -- it is the fallback for when WebGPU is unavailable, and it is the gateway to 160,000+ models that are not available in pre-compiled MLC format. You can even use both together with a fallback pattern:

import { isWebGPUSupported } from '@localmode/core';
import { webllm } from '@localmode/webllm';
import { wllama } from '@localmode/wllama';

const hasGPU = await isWebGPUSupported();

const model = hasGPU
  ? webllm.languageModel('Llama-3.2-1B-Instruct-q4f16_1-MLC')
  : wllama.languageModel('bartowski/Llama-3.2-1B-Instruct-GGUF:Llama-3.2-1B-Instruct-Q4_K_M.gguf');

This gives your users WebGPU speed when available and WASM universality when not. Both providers implement the same LanguageModel interface, so the rest of your application code does not change.

The GGUF Explorer: Try It Live

The GGUF Explorer showcase app puts all of this together in a three-tab flow:

Browse -- View the curated catalog of 30 models (25 language + 3 embedding + 2 reranker) grouped by size tier, or paste any HuggingFace GGUF URL
Inspect -- See parsed metadata (architecture, parameters, context length, quantization) and browser compatibility (estimated RAM, device RAM, storage quota, speed estimate, warnings)
Chat -- Download the model and start a streaming conversation

The app demonstrates the complete @localmode/wllama API surface: WLLAMA_MODELS for the catalog, parseGGUFMetadata() for inspection, checkGGUFBrowserCompat() for compatibility, and wllama.languageModel() with streamText() for inference. Every operation runs entirely in the browser.

Getting Started

Install the package:

pnpm install @localmode/wllama @localmode/core

The minimal example is five lines:

import { streamText } from '@localmode/core';
import { wllama } from '@localmode/wllama';

const model = wllama.languageModel('Llama-3.2-1B-Instruct-Q4_K_M');
const result = await streamText({ model, prompt: 'Hello!' });

for await (const chunk of result.stream) {
  console.log(chunk.text);
}

For React applications, the same model works with every hook from @localmode/react -- useChat, useGenerateText, and the rest. No additional configuration needed.

Methodology

All code examples, model IDs, API exports, and catalog metadata were verified against packages/wllama/src/ (models.ts, gguf.ts, compat.ts, index.ts) in the LocalMode monorepo. The HuggingFace GGUF model count was verified by fetching huggingface.co/models?library=gguf in May 2026 (178,541 models). WebGPU browser support versions were verified against caniuse.com. GGUF format history and K-quant bit-depth data were cross-checked against the llama.cpp GitHub repository and the GGUF Wikipedia article.

Sources

llama.cpp -- The open-source C/C++ LLM inference framework by ggerganov that defines the GGUF format
GGUF specification on HuggingFace -- Official documentation on the GGUF file format and its metadata structure
GGUF - Wikipedia -- Introduction date (21 August 2023) and format history confirmed here
wllama -- WebAssembly binding for llama.cpp by ngxson, enabling browser-based inference; releases verified at github.com/ngxson/wllama/releases
HuggingFace GGUF model library -- Live GGUF model count: 178,541 as of May 2026
bartowski on HuggingFace -- Prolific GGUF quantizer whose models are used in the curated catalog
llama.cpp quantization discussion -- Community benchmarks comparing Q4_K_M, Q5_K_M, Q8_0, and other quantization types
WebGPU browser support - caniuse.com -- Chrome 113+, Edge 113+, Safari 26+ (partial); Firefox Windows 141+, macOS ARM64 145+
WebAssembly browser support - caniuse.com -- Supported since Chrome 57, Firefox 52, Safari 11, Edge 16 (all 2017)
WebGPU vs WebAssembly benchmarks -- Performance comparisons between WASM and WebGPU for browser-based ML inference
All code examples reference the @localmode/wllama package (wrapping @wllama/wllama ^3.2.3 and @huggingface/gguf ^0.1.14) and the GGUF Explorer showcase app in the LocalMode repository

Try it yourself

Visit localmode.ai to try 30+ AI demo apps running entirely in your browser. No sign-up, no API keys, no data leaves your device.

Read the Getting Started guide to add local AI to your application in under 5 minutes.

Frequently Asked Questions