← Back to Models

CLIP & SigLIP Multimodal Models in the Browser

OpenAI's CLIP and Google's SigLIP - multimodal embedding models that connect text and images in a shared vector space.

CLIP & SigLIP Multimodal Models in the Browser

OpenAI's CLIP and Google's SigLIP - multimodal embedding models that connect text and images in a shared vector space.

Overview

The CLIP & SigLIP Multimodal family is available through Transformers.js in LocalMode, with model sizes ranging from 340MB–1.2GB (vision encoder). The primary task for these models is multimodal-embedding, and they can be used with any application built on the LocalMode SDK.

Running CLIP & SigLIP Multimodal models locally in the browser eliminates API costs, removes network latency, and keeps all user data on-device. After the initial model download, inference is instant and works offline. Each model variant targets a different trade-off between size, speed, and quality - choose based on your users' device capabilities and your application's requirements.

Architecture and History

CLIP (Contrastive Language-Image Pre-training) by OpenAI and SigLIP (Sigmoid Language Image Pre-training) by Google are multimodal embedding models that map both text and images into a shared vector space. This enables cross-modal search: index photos with their visual embeddings, then search them using natural language queries like "sunset over the ocean" or "cat sitting on a laptop."

CLIP-ViT-Base-Patch32 (340MB, 512 dimensions) is the faster model - it processes images in 32x32 patches, making it suitable for real-time applications and larger image collections. SigLIP-Base-Patch16-224 (400MB, 768 dimensions) uses finer 16x16 patches and higher-dimensional embeddings for better retrieval quality at the cost of more storage per vector.

Both models use the embedImage() function from @localmode/core, which accepts images as URLs, Blobs, or File objects. The resulting embeddings are stored in the same VectorDB as text embeddings, enabling unified semantic search across text and images. A photo management app, an e-commerce product search with visual similarity, or a design inspiration tool - all can be built with these models running entirely in the browser.

Variant Comparison

The following table lists every CLIP & SigLIP Multimodal variant available through LocalMode, across all supported providers. Click a model ID to view its HuggingFace model card.

Model IDProviderSizeSpeedQualityContextDevice
Xenova/clip-vit-base-patch32Transformers.js340MBFastGood512dWASM
Xenova/siglip-base-patch16-224Transformers.js400MBMediumHigh768dWASM
Xenova/clip-vit-base-patch16Transformers.js340MBMediumGood512dWASM
Xenova/clip-vit-large-patch14Transformers.js~1.2GBSlowHigh768dWASM

Size Distribution

Size RangeCount
200MB–500MB3variants
1GB–1.5GB1variant

How to choose a variant: Start with the smallest model that meets your quality requirements. For prototyping and development, use the fastest variant (smallest size, "Fast" speed tier). For production, test your specific use case against 2–3 variants and measure the quality difference against user expectations. In many applications, users cannot distinguish between "Good" and "High" quality tiers - the smaller model saves download time and memory.

Provider-Specific Code Examples

All CLIP & SigLIP Multimodal variants use the same MultimodalEmbeddingModel interface from @localmode/core. Switching between providers requires changing only the import and model ID - no application logic changes.

Transformers.js

Transformers.js runs ONNX-optimized models via ONNX Runtime Web. WebGPU acceleration where available, WASM fallback otherwise.

import { transformers } from '@localmode/transformers';

const model = transformers.multimodalEmbedding('Xenova/clip-vit-base-patch32');
// Use the model with the corresponding @localmode/core function

Fallback Pattern

For maximum browser compatibility, wrap model loading in a try/catch: attempt the preferred model first, and fall back to a smaller variant if it fails to load.

import { transformers } from '@localmode/transformers';

// Try the preferred model, fall back to a smaller one on failure
let model;
try {
  model = transformers.multimodalEmbedding('Xenova/clip-vit-base-patch16');
} catch (error) {
  console.warn('Primary model failed, using fallback:', error);
  model = transformers.multimodalEmbedding('Xenova/clip-vit-base-patch32');
}

When to Use CLIP & SigLIP Multimodal

CLIP & SigLIP Multimodal models are a strong choice when:

  • You need multimodal-embedding - CLIP & SigLIP Multimodal is optimized for multimodal-embedding tasks with models across multiple size tiers.
  • Browser compatibility matters - Available through 1 provider (transformers), ensuring coverage across Chrome, Firefox, Safari, and Edge.
  • Size flexibility is important - The 340MB–1.2GB range means you can target everything from mobile devices to high-end desktops with the same model family.

HuggingFace Model Cards

Methodology

Model sizes and embedding dimensions were verified against HuggingFace model repositories for each Xenova ONNX model (files tab, onnx/ folder). Vision encoder sizes listed represent the full-precision vision_model.onnx file as the primary download for image-embedding workloads; Transformers.js loads encoders lazily so text-only users download only the text encoder. Embedding dimensions were confirmed from the Xenova model card code examples showing tensor dims output. LocalMode API patterns were verified against packages/transformers/src/implementations/clip-embedding.ts and packages/core/src/multimodal/. Performance tiers (speed/quality) are LocalMode's curated assessments based on patch size, parameter count, and embedding dimensionality. Always benchmark on your target devices before production deployment.

Sources