Multimodal Embeddings

Embed text and images into the same vector space for cross-modal similarity search.

Multimodal embeddings map both text and images into the same vector space. This enables cross-modal search: find images using text queries, or find text descriptions that match an image.

See it in action

Try Cross-Modal Search for a working demo of these APIs.

Provider Required

Multimodal embedding requires a provider that implements MultimodalEmbeddingModel. Currently supported: @localmode/transformers with CLIP/SigLIP models.

MultimodalEmbeddingModel Interface

The MultimodalEmbeddingModel interface extends EmbeddingModel<string>, meaning it works as a standard text embedding model everywhere embed() is used. It adds image (and optionally audio) embedding capabilities.

interface MultimodalEmbeddingModel extends EmbeddingModel<string> {
  readonly supportedModalities: ('text' | 'image' | 'audio')[];
  doEmbedImage(options: DoEmbedImageOptions): Promise<DoEmbedResult>;
  doEmbedAudio?(options: DoEmbedAudioOptions): Promise<DoEmbedResult>;
}

Custom Implementation

You can implement MultimodalEmbeddingModel to create your own providers:

import type { MultimodalEmbeddingModel } from '@localmode/core';

class MyMultimodalEmbedder implements MultimodalEmbeddingModel {
  readonly modelId = 'custom:my-clip';
  readonly provider = 'custom';
  readonly dimensions = 512;
  readonly maxEmbeddingsPerCall = 32;
  readonly supportsParallelCalls = false;
  readonly supportedModalities = ['text', 'image'] as const;

  async doEmbed(options) {
    // Text embedding implementation
  }

  async doEmbedImage(options) {
    // Image embedding implementation
    // Must produce vectors in the same space as doEmbed()
  }
}

import { embed, embedImage, createVectorDB } from '@localmode/core';
import { transformers } from '@localmode/transformers';

const model = transformers.multimodalEmbedding('Xenova/clip-vit-base-patch32');

// Create a database for image embeddings
const db = await createVectorDB({ name: 'images', dimensions: 512 });

// Index images
for (const { id, blob, title } of photos) {
  const { embedding } = await embedImage({ model, image: blob });
  await db.add({ id, vector: embedding, metadata: { title } });
}

// Search with text
const { embedding: queryVec } = await embed({ model, value: 'sunset at the beach' });
const results = await db.search(queryVec, { k: 5 });

Testing

Use createMockMultimodalEmbeddingModel() for testing:

import { createMockMultimodalEmbeddingModel, embedImage } from '@localmode/core';

const model = createMockMultimodalEmbeddingModel({ dimensions: 512 });

const { embedding } = await embedImage({ model, image: testBlob });
expect(embedding).toBeInstanceOf(Float32Array);
expect(embedding.length).toBe(512);

Provider Guide

For recommended models, provider-specific options, and practical recipes, see the Transformers Multimodal Embeddings guide.

Showcase Apps

App	Description	Links
Cross-Modal Search	Text-to-image and image-to-image search with CLIP	Demo · Source

Multimodal Embeddings

embedImage()

EmbedImageOptions

EmbedImageResult

embedManyImages()

EmbedManyImagesOptions

MultimodalEmbeddingModel Interface

Custom Implementation

Use Cases

Image Search

Reverse Image Search

Image Search with VectorDB

Testing

Showcase Apps

On this page