What is the difference between CLIP and SigLIP models?

CLIP by OpenAI (340MB, 512 dimensions) processes images in 32x32 patches and is faster for real-time applications. SigLIP by Google (400MB, 768 dimensions) uses finer 16x16 patches for better retrieval quality at the cost of more storage per vector.

What tasks can CLIP and SigLIP models perform in the browser?

They enable cross-modal search by mapping text and images into a shared vector space. You can index photos with visual embeddings, then search them with natural language queries like 'sunset over the ocean' or build visual similarity features.

Do CLIP and SigLIP models require WebGPU?

No. All CLIP and SigLIP variants run on WASM via Transformers.js, so they work in every modern browser without requiring WebGPU support.

CLIP & SigLIP Multimodal Models in the Browser

Q: How large is the smallest CLIP model download?

CLIP-ViT-Base-Patch32 is 340MB for the vision encoder. Transformers.js loads encoders lazily, so text-only users download only the text encoder. The largest variant (CLIP-ViT-Large-Patch14) is approximately 1.2GB.

OpenAI's CLIP and Google's SigLIP - multimodal embedding models that connect text and images in a shared vector space.

Overview

The CLIP & SigLIP Multimodal family is available through Transformers.js in LocalMode, with model sizes ranging from 340MB–1.2GB (vision encoder). The primary task for these models is multimodal-embedding, and they can be used with any application built on the LocalMode SDK.

Running CLIP & SigLIP Multimodal models locally in the browser eliminates API costs, removes network latency, and keeps all user data on-device. After the initial model download, inference is instant and works offline. Each model variant targets a different trade-off between size, speed, and quality - choose based on your users' device capabilities and your application's requirements.

Architecture and History

CLIP (Contrastive Language-Image Pre-training) by OpenAI and SigLIP (Sigmoid Language Image Pre-training) by Google are multimodal embedding models that map both text and images into a shared vector space. This enables cross-modal search: index photos with their visual embeddings, then search them using natural language queries like "sunset over the ocean" or "cat sitting on a laptop."

CLIP-ViT-Base-Patch32 (340MB, 512 dimensions) is the faster model - it processes images in 32x32 patches, making it suitable for real-time applications and larger image collections. SigLIP-Base-Patch16-224 (400MB, 768 dimensions) uses finer 16x16 patches and higher-dimensional embeddings for better retrieval quality at the cost of more storage per vector.

Both models use the embedImage() function from @localmode/core, which accepts images as URLs, Blobs, or File objects. The resulting embeddings are stored in the same VectorDB as text embeddings, enabling unified semantic search across text and images. A photo management app, an e-commerce product search with visual similarity, or a design inspiration tool - all can be built with these models running entirely in the browser.

Variant Comparison

The following table lists every CLIP & SigLIP Multimodal variant available through LocalMode, across all supported providers. Click a model ID to view its HuggingFace model card.

Model ID	Provider	Size	Speed	Quality	Context	Device
Xenova/clip-vit-base-patch32	Transformers.js	340MB	Fast	Good	512d	WASM
Xenova/siglip-base-patch16-224	Transformers.js	400MB	Medium	High	768d	WASM
Xenova/clip-vit-base-patch16	Transformers.js	340MB	Medium	Good	512d	WASM
Xenova/clip-vit-large-patch14	Transformers.js	~1.2GB	Slow	High	768d	WASM

Size Distribution

Size Range	Count
200MB–500MB	3	variants
1GB–1.5GB	1	variant

How to choose a variant: Start with the smallest model that meets your quality requirements. For prototyping and development, use the fastest variant (smallest size, "Fast" speed tier). For production, test your specific use case against 2–3 variants and measure the quality difference against user expectations. In many applications, users cannot distinguish between "Good" and "High" quality tiers - the smaller model saves download time and memory.

Provider-Specific Code Examples

All CLIP & SigLIP Multimodal variants use the same MultimodalEmbeddingModel interface from @localmode/core. Switching between providers requires changing only the import and model ID - no application logic changes.

Transformers.js

Transformers.js runs ONNX-optimized models via ONNX Runtime Web. WebGPU acceleration where available, WASM fallback otherwise.

import { transformers } from '@localmode/transformers';

const model = transformers.multimodalEmbedding('Xenova/clip-vit-base-patch32');
// Use the model with the corresponding @localmode/core function

Fallback Pattern

For maximum browser compatibility, wrap model loading in a try/catch: attempt the preferred model first, and fall back to a smaller variant if it fails to load.

import { transformers } from '@localmode/transformers';

// Try the preferred model, fall back to a smaller one on failure
let model;
try {
  model = transformers.multimodalEmbedding('Xenova/clip-vit-base-patch16');
} catch (error) {
  console.warn('Primary model failed, using fallback:', error);
  model = transformers.multimodalEmbedding('Xenova/clip-vit-base-patch32');
}

When to Use CLIP & SigLIP Multimodal

CLIP & SigLIP Multimodal models are a strong choice when:

You need multimodal-embedding - CLIP & SigLIP Multimodal is optimized for multimodal-embedding tasks with models across multiple size tiers.
Browser compatibility matters - Available through 1 provider (transformers), ensuring coverage across Chrome, Firefox, Safari, and Edge.
Size flexibility is important - The 340MB–1.2GB range means you can target everything from mobile devices to high-end desktops with the same model family.

HuggingFace Model Cards

Methodology

Model sizes and embedding dimensions were verified against HuggingFace model repositories for each Xenova ONNX model (files tab, onnx/ folder). Vision encoder sizes listed represent the full-precision vision_model.onnx file as the primary download for image-embedding workloads; Transformers.js loads encoders lazily so text-only users download only the text encoder. Embedding dimensions were confirmed from the Xenova model card code examples showing tensor dims output. LocalMode API patterns were verified against packages/transformers/src/implementations/clip-embedding.ts and packages/core/src/multimodal/. Performance tiers (speed/quality) are LocalMode's curated assessments based on patch size, parameter count, and embedding dimensionality. Always benchmark on your target devices before production deployment.

Sources

Xenova/clip-vit-base-patch32 on HuggingFace - ONNX file sizes verified in Files tab
Xenova/siglip-base-patch16-224 on HuggingFace - embedding dims (768) confirmed from model card code example
Xenova/clip-vit-base-patch16 on HuggingFace - embedding dims (512) confirmed from model card
Xenova/clip-vit-large-patch14 on HuggingFace - ONNX file sizes verified in Files tab
openai/clip-vit-base-patch32 on HuggingFace - original model card (architecture reference)
google/siglip-base-patch16-224 on HuggingFace - original SigLIP model card (architecture, 768d confirmed)
LocalMode Transformers multimodal embeddings docs - sizes cross-referenced with official docs

Frequently Asked Questions