Multimodal Embeddings
Embed text and images into the same vector space for cross-modal similarity search.
Multimodal embeddings map both text and images into the same vector space. This enables cross-modal search: find images using text queries, or find text descriptions that match an image.
See it in action
Try Cross-Modal Search for a working demo of these APIs.
Provider Required
Multimodal embedding requires a provider that implements MultimodalEmbeddingModel. Currently supported: @localmode/transformers with CLIP/SigLIP models.
embedImage()
Embed a single image:
import { embedImage } from '@localmode/core';
import { transformers } from '@localmode/transformers';
const model = transformers.multimodalEmbedding('Xenova/clip-vit-base-patch32');
const { embedding, usage, response } = await embedImage({
model,
image: imageBlob, // Blob, ArrayBuffer, URL string, or data URI
});
console.log('Dimensions:', embedding.length); // 512import { embed, embedImage, cosineSimilarity } from '@localmode/core';
import { transformers } from '@localmode/transformers';
const model = transformers.multimodalEmbedding('Xenova/clip-vit-base-patch32');
// Embed text and image with the SAME model
const { embedding: textVec } = await embed({
model,
value: 'a sunset over the ocean',
});
const { embedding: imgVec } = await embedImage({
model,
image: sunsetPhoto,
});
const similarity = cosineSimilarity(textVec, imgVec);
console.log(`Text-image similarity: ${(similarity * 100).toFixed(1)}%`);const controller = new AbortController();
setTimeout(() => controller.abort(), 10000);
const { embedding } = await embedImage({
model,
image: largeImage,
abortSignal: controller.signal,
});EmbedImageOptions
Prop
Type
EmbedImageResult
Prop
Type
embedManyImages()
Embed multiple images in a batch:
import { embedManyImages } from '@localmode/core';
const { embeddings, usage } = await embedManyImages({
model,
images: [image1, image2, image3],
});
console.log(embeddings.length); // 3
console.log(embeddings[0].length); // 512EmbedManyImagesOptions
Prop
Type
MultimodalEmbeddingModel Interface
The MultimodalEmbeddingModel interface extends EmbeddingModel<string>, meaning it works as a standard text embedding model everywhere embed() is used. It adds image (and optionally audio) embedding capabilities.
interface MultimodalEmbeddingModel extends EmbeddingModel<string> {
readonly supportedModalities: ('text' | 'image' | 'audio')[];
doEmbedImage(options: DoEmbedImageOptions): Promise<DoEmbedResult>;
doEmbedAudio?(options: DoEmbedAudioOptions): Promise<DoEmbedResult>;
}Custom Implementation
You can implement MultimodalEmbeddingModel to create your own providers:
import type { MultimodalEmbeddingModel } from '@localmode/core';
class MyMultimodalEmbedder implements MultimodalEmbeddingModel {
readonly modelId = 'custom:my-clip';
readonly provider = 'custom';
readonly dimensions = 512;
readonly maxEmbeddingsPerCall = 32;
readonly supportsParallelCalls = false;
readonly supportedModalities = ['text', 'image'] as const;
async doEmbed(options) {
// Text embedding implementation
}
async doEmbedImage(options) {
// Image embedding implementation
// Must produce vectors in the same space as doEmbed()
}
}Use Cases
Image Search
Store image embeddings in VectorDB and search with text queries.
Reverse Image Search
Find similar images by embedding a query image and searching the vector database.
Image Search with VectorDB
import { embed, embedImage, createVectorDB } from '@localmode/core';
import { transformers } from '@localmode/transformers';
const model = transformers.multimodalEmbedding('Xenova/clip-vit-base-patch32');
// Create a database for image embeddings
const db = await createVectorDB({ name: 'images', dimensions: 512 });
// Index images
for (const { id, blob, title } of photos) {
const { embedding } = await embedImage({ model, image: blob });
await db.add({ id, vector: embedding, metadata: { title } });
}
// Search with text
const { embedding: queryVec } = await embed({ model, value: 'sunset at the beach' });
const results = await db.search(queryVec, { k: 5 });Testing
Use createMockMultimodalEmbeddingModel() for testing:
import { createMockMultimodalEmbeddingModel, embedImage } from '@localmode/core';
const model = createMockMultimodalEmbeddingModel({ dimensions: 512 });
const { embedding } = await embedImage({ model, image: testBlob });
expect(embedding).toBeInstanceOf(Float32Array);
expect(embedding.length).toBe(512);Provider Guide
For recommended models, provider-specific options, and practical recipes, see the Transformers Multimodal Embeddings guide.