What is the context window for Granite models?

Granite 4.0 350M supports 32,768 tokens and Granite 4.0 1B supports 128,000 tokens. The 1B variant's large context window makes it suitable for long-document tasks.

Does Granite require WebGPU?

Granite models run through Transformers.js with WebGPU acceleration where available. They can also fall back to WASM, though WebGPU is preferred for best performance with these ONNX models.

Granite Models in the Browser

Q: What is the smallest Granite model available for browser inference?

Granite 4.0 350M at approximately 350MB (q4f16 quantization) is the smallest ONNX-format LLM available through LocalMode's Transformers.js provider. It supports a 32K token context window and 12 languages.

Q: How many languages does Granite 4.0 support?

Both Granite 4.0 variants support 12 languages: English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese. Both are released under the Apache 2.0 license.

IBM's Granite 4.0 - ultra-compact 350M and 1B ONNX models for fast browser inference via Transformers.js v4.

Overview

The Granite family is available through Transformers.js in LocalMode. The primary task for these models is generation, and they can be used with any application built on the LocalMode SDK. Both Granite 4.0 variants support 12 languages (English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese) and are released under the Apache 2.0 license.

Running Granite models locally in the browser eliminates API costs, removes network latency, and keeps all user data on-device. After the initial model download, inference is instant and works offline. Each model variant targets a different trade-off between size, speed, and quality - choose based on your users' device capabilities and your application's requirements.

Architecture and History

IBM's Granite 4.0 350M is the smallest ONNX-format LLM available through LocalMode's Transformers.js provider. Released on October 28, 2025, it targets on-device and browser deployments with a 32K token context window. Its training on IBM's curated enterprise data gives it an edge in formal, structured outputs. It's particularly useful for generating short-form content: email subject lines, category labels, brief descriptions, and template completions.

The Granite 4.0 1B model is the larger option, with 1.6B actual parameters (the "1B" name refers to the size category) and a 128K token context window - making it suitable for long-document tasks. Both models use a decoder-only dense transformer architecture with GQA, SwiGLU, and RoPE position embeddings.

As an ONNX model running through Transformers.js v4, Granite works with WebGPU acceleration where available and falls back to WASM otherwise. It uses the same LanguageModel interface as WebLLM and wllama models, so it can be swapped in as a lightweight alternative with no code changes.

Variant Comparison

The following table lists every Granite variant available through LocalMode, across all supported providers. Click a model ID to view its HuggingFace model card.

Model ID	Provider	Size (q4f16)	Speed	Quality	Context	Device
onnx-community/granite-4.0-350m-ONNX-web	Transformers.js	~350MB	Fast	Basic	32,768 tokens	WEBGPU
onnx-community/granite-4.0-1b-ONNX-web	Transformers.js	~1.25GB	Moderate	Good	128,000 tokens	WEBGPU

Size Distribution (q4f16 quantization)

Size Range	Count
200MB–500MB	1	variant (350M)
1GB–2GB	1	variant (1B)

How to choose a variant: Start with the smallest model that meets your quality requirements. For prototyping and development, use the fastest variant (smallest size, "Fast" speed tier). For production, test your specific use case against 2–3 variants and measure the quality difference against user expectations. In many applications, users cannot distinguish between "Good" and "High" quality tiers - the smaller model saves download time and memory.

Provider-Specific Code Examples

All Granite variants use the same LanguageModel interface from @localmode/core. Switching between providers requires changing only the import and model ID - no application logic changes.

Transformers.js

Transformers.js runs ONNX-optimized models via ONNX Runtime Web. WebGPU acceleration where available, WASM fallback otherwise.

import { streamText } from '@localmode/core';
import { transformers } from '@localmode/transformers';

const model = transformers.languageModel('onnx-community/granite-4.0-350m-ONNX-web');

const result = await streamText({
  model,
  prompt: 'Hello, world!',
  maxTokens: 200,
});

Fallback Pattern

For maximum browser compatibility, wrap model loading in a try/catch: attempt the larger model first, and fall back to the smaller 350M variant if it fails to load (e.g. on low-memory devices).

import { transformers } from '@localmode/transformers';

// Try the preferred model, fall back to a smaller one on failure
let model;
try {
  model = transformers.languageModel('onnx-community/granite-4.0-1b-ONNX-web');
} catch (error) {
  console.warn('Primary model failed, using fallback:', error);
  model = transformers.languageModel('onnx-community/granite-4.0-350m-ONNX-web');
}

When to Use Granite

Granite models are a strong choice when:

You need text generation - Granite is optimized for generation tasks with models across multiple size tiers.
Browser compatibility matters - Available through 1 provider (transformers), ensuring coverage across Chrome, Firefox, Safari, and Edge.
Long context matters - With 32K tokens (350M) and 128K tokens (1B), both variants handle documents far larger than most comparable small models.
Offline functionality is required - All variants work offline after the initial download, cached in IndexedDB via LocalMode's model caching system.

HuggingFace Model Cards

Text Generation - task guide
Smollm2 - model guide
Qwen - model guide

Methodology

Provider availability and model IDs were verified against packages/transformers/src/models.ts in the LocalMode monorepo. Context lengths (32K for 350M, 128K for 1B) were confirmed from the official IBM Granite HuggingFace model cards. Download sizes reflect the q4f16 quantized ONNX data files as listed in the onnx/ directory of each ONNX-community repository. The 1B model's actual parameter count (1.6B) was confirmed from the IBM Granite 4.0 1B model card. Note that actual download size may vary depending on which dtype is loaded at runtime; sizes shown are for the q4f16 (smallest web-optimized) quantization.

Sources

onnx-community/granite-4.0-350m-ONNX-web model card - ONNX file sizes, context length, usage example
onnx-community/granite-4.0-1b-ONNX-web model card - ONNX file sizes, context length, usage example
ibm-granite/granite-4.0-350m model card - parameter count (350M), context length (32K), languages, benchmarks, release date (Oct 28, 2025)
ibm-granite/granite-4.0-1b model card - parameter count (1.6B), context length (128K), languages, benchmarks
LocalMode transformers provider catalog - model IDs and provider availability
Transformers.js documentation

Frequently Asked Questions