Transformers
Multimodal Embeddings
CLIP and SigLIP models for cross-modal text-image search with Transformers.js.
Use CLIP and SigLIP models to embed text and images into the same vector space, enabling cross-modal similarity search entirely in the browser.
Core API Reference
For full API reference, options, result types, and custom providers, see the Core Multimodal Embeddings guide.
Getting Started
pnpm install @localmode/core @localmode/transformersimport { embed, embedImage, cosineSimilarity } from '@localmode/core';
import { transformers } from '@localmode/transformers';
const model = transformers.multimodalEmbedding('Xenova/clip-vit-base-patch32');
// Text and images share the same vector space
const { embedding: textVec } = await embed({ model, value: 'a golden retriever' });
const { embedding: imgVec } = await embedImage({ model, image: dogPhoto });
const similarity = cosineSimilarity(textVec, imgVec);
console.log(`Match: ${(similarity * 100).toFixed(1)}%`);Recommended Models
| Model | Dimensions | Size | Notes |
|---|---|---|---|
Xenova/clip-vit-base-patch32 | 512 | ~340MB | Fast, good general quality |
Xenova/clip-vit-base-patch16 | 512 | ~340MB | Better accuracy, slightly slower |
Xenova/siglip-base-patch16-224 | 768 | ~400MB | Improved CLIP variant |
Provider Configuration
import { createTransformers } from '@localmode/transformers';
const myTransformers = createTransformers({
device: 'webgpu', // Use WebGPU for faster inference
quantized: true, // Use quantized models (default)
onProgress: (p) => console.log(`Loading: ${p.progress}%`),
});
const model = myTransformers.multimodalEmbedding('Xenova/clip-vit-base-patch32');Per-Model Settings
const model = transformers.multimodalEmbedding('Xenova/clip-vit-base-patch32', {
device: 'wasm', // Override device for this model
quantized: false, // Use full-precision weights
onProgress: (p) => updateUI(p),
});Lazy Encoder Loading
The CLIP model has separate text and vision encoders. They load lazily:
- Text encoder loads on the first
embed()/doEmbed()call - Vision encoder loads on the first
embedImage()/doEmbedImage()call
This means if you only use text embeddings, the vision encoder is never downloaded, saving bandwidth and memory.
Recipes
Image Search Engine
Index product images and search with natural language:
import { embed, embedImage, createVectorDB } from '@localmode/core';
import { transformers } from '@localmode/transformers';
const model = transformers.multimodalEmbedding('Xenova/clip-vit-base-patch32');
const db = await createVectorDB({ name: 'products', dimensions: 512 });
// Index: embed each product image
for (const product of catalog) {
const { embedding } = await embedImage({ model, image: product.imageBlob });
await db.add({
id: product.id,
vector: embedding,
metadata: { name: product.name, price: product.price },
});
}
// Search: find products matching text
const { embedding: queryVec } = await embed({ model, value: 'red leather handbag' });
const results = await db.search(queryVec, { k: 10 });Reverse Image Search
Find similar images using an image query:
import { embedImage, embedManyImages } from '@localmode/core';
const model = transformers.multimodalEmbedding('Xenova/clip-vit-base-patch32');
// Embed query image
const { embedding: queryVec } = await embedImage({ model, image: queryPhoto });
// Compare against indexed images
const results = await db.search(queryVec, { k: 5 });Batch Image Indexing
import { embedManyImages } from '@localmode/core';
const { embeddings } = await embedManyImages({
model,
images: productImages, // Array of Blobs or URLs
});
// Store all embeddings
for (let i = 0; i < embeddings.length; i++) {
await db.add({
id: `product-${i}`,
vector: embeddings[i],
metadata: { index: i },
});
}Image Input Formats
The ImageInput type accepts:
| Format | Example | Notes |
|---|---|---|
Blob | new Blob([data], { type: 'image/jpeg' }) | Most common from file inputs |
string (URL) | 'https://example.com/photo.jpg' | Fetched automatically |
string (data URI) | 'data:image/png;base64,...' | From canvas or file reader |
ArrayBuffer | await file.arrayBuffer() | Raw binary data |
Tips
- L2 normalization: Both text and image embeddings are L2-normalized by the provider, so cosine similarity gives well-calibrated cross-modal scores.
- Memory: CLIP loads two separate encoders. On memory-constrained devices, consider using only one modality at a time.
- Batch size: The default
maxEmbeddingsPerCallis 32. For large batches, useembedManyImages()which handles batching automatically.