LocalMode
Core

Multimodal Embeddings

Embed text and images into the same vector space for cross-modal similarity search.

Multimodal embeddings map both text and images into the same vector space. This enables cross-modal search: find images using text queries, or find text descriptions that match an image.

See it in action

Try Cross-Modal Search for a working demo of these APIs.

Provider Required

Multimodal embedding requires a provider that implements MultimodalEmbeddingModel. Currently supported: @localmode/transformers with CLIP/SigLIP models.

embedImage()

Embed a single image:

import { embedImage } from '@localmode/core';
import { transformers } from '@localmode/transformers';

const model = transformers.multimodalEmbedding('Xenova/clip-vit-base-patch32');

const { embedding, usage, response } = await embedImage({
  model,
  image: imageBlob, // Blob, ArrayBuffer, URL string, or data URI
});

console.log('Dimensions:', embedding.length); // 512
const controller = new AbortController();

setTimeout(() => controller.abort(), 10000);

const { embedding } = await embedImage({
  model,
  image: largeImage,
  abortSignal: controller.signal,
});

EmbedImageOptions

Prop

Type

EmbedImageResult

Prop

Type

embedManyImages()

Embed multiple images in a batch:

import { embedManyImages } from '@localmode/core';

const { embeddings, usage } = await embedManyImages({
  model,
  images: [image1, image2, image3],
});

console.log(embeddings.length); // 3
console.log(embeddings[0].length); // 512

EmbedManyImagesOptions

Prop

Type

MultimodalEmbeddingModel Interface

The MultimodalEmbeddingModel interface extends EmbeddingModel<string>, meaning it works as a standard text embedding model everywhere embed() is used. It adds image (and optionally audio) embedding capabilities.

interface MultimodalEmbeddingModel extends EmbeddingModel<string> {
  readonly supportedModalities: ('text' | 'image' | 'audio')[];
  doEmbedImage(options: DoEmbedImageOptions): Promise<DoEmbedResult>;
  doEmbedAudio?(options: DoEmbedAudioOptions): Promise<DoEmbedResult>;
}

Custom Implementation

You can implement MultimodalEmbeddingModel to create your own providers:

import type { MultimodalEmbeddingModel } from '@localmode/core';

class MyMultimodalEmbedder implements MultimodalEmbeddingModel {
  readonly modelId = 'custom:my-clip';
  readonly provider = 'custom';
  readonly dimensions = 512;
  readonly maxEmbeddingsPerCall = 32;
  readonly supportsParallelCalls = false;
  readonly supportedModalities = ['text', 'image'] as const;

  async doEmbed(options) {
    // Text embedding implementation
  }

  async doEmbedImage(options) {
    // Image embedding implementation
    // Must produce vectors in the same space as doEmbed()
  }
}

Use Cases

Image Search with VectorDB

import { embed, embedImage, createVectorDB } from '@localmode/core';
import { transformers } from '@localmode/transformers';

const model = transformers.multimodalEmbedding('Xenova/clip-vit-base-patch32');

// Create a database for image embeddings
const db = await createVectorDB({ name: 'images', dimensions: 512 });

// Index images
for (const { id, blob, title } of photos) {
  const { embedding } = await embedImage({ model, image: blob });
  await db.add({ id, vector: embedding, metadata: { title } });
}

// Search with text
const { embedding: queryVec } = await embed({ model, value: 'sunset at the beach' });
const results = await db.search(queryVec, { k: 5 });

Testing

Use createMockMultimodalEmbeddingModel() for testing:

import { createMockMultimodalEmbeddingModel, embedImage } from '@localmode/core';

const model = createMockMultimodalEmbeddingModel({ dimensions: 512 });

const { embedding } = await embedImage({ model, image: testBlob });
expect(embedding).toBeInstanceOf(Float32Array);
expect(embedding.length).toBe(512);

Provider Guide

For recommended models, provider-specific options, and practical recipes, see the Transformers Multimodal Embeddings guide.

Showcase Apps

AppDescriptionLinks
Cross-Modal SearchText-to-image and image-to-image search with CLIPDemo · Source

On this page