What is the best model for multimodal search in the browser?

Xenova/clip-vit-base-patch32 (~340MB, 512 dimensions) is recommended for most applications. For higher quality, Xenova/siglip-base-patch16-224 (~400MB, 768 dimensions) offers improved accuracy at a slightly larger download.

Does browser-based multimodal search work offline?

Yes. After the initial model download (340-400MB depending on the model), multimodal embedding and search works completely offline. All images are processed on-device with no cloud upload required.

What are practical applications of multimodal embeddings in the browser?

Common uses include photo search with natural language queries, e-commerce visual similarity search, content recommendation across media types, image-text matching for accessibility, and reverse image search.

Multimodal Embeddings (CLIP/SigLIP) in the Browser

Q: How do CLIP and SigLIP multimodal embeddings work?

They map both text and images into the same vector space so matching pairs are close together. You can search an image collection using natural language queries like 'sunset over mountains' or find images visually similar to a reference image.

Embed text and images into a shared vector space - search photos with words, find similar images, cross-modal retrieval.

What Is Multimodal Embeddings (CLIP/SigLIP)?

Multimodal embeddings map both text and images into the same vector space, enabling cross-modal similarity. CLIP (OpenAI) and SigLIP (Google) encode images and text descriptions such that matching pairs are close together. You can search an image collection using natural language ("sunset over mountains") or find images similar to a reference image.

This capability is exposed through the embedImage() function in @localmode/core. All processing runs entirely in the browser - no server, no API key, no data leaves the device. After the initial model download, multimodal embeddings (clip/siglip) works completely offline.

Real-World Applications

Photo search with natural language queries. E-commerce visual similarity ("find products that look like this"). Content recommendation across media types. Image-text matching for accessibility. Design inspiration tools. Reverse image search.

These use cases all benefit from local, on-device processing: user data stays private, there are no per-request API costs, and the application works without internet after initial setup.

Getting Started

Install the required packages:

npm install @localmode/core @localmode/transformers

Import the core function and provider:

import { embedImage } from '@localmode/core';
import { transformers } from '@localmode/transformers';

The recommended starting model is Xenova/clip-vit-base-patch32 - it provides the best balance of quality, speed, and download size for most applications.

Code Example

import { embed, embedImage, createVectorDB } from '@localmode/core';
import { transformers } from '@localmode/transformers';

const model = transformers.multimodalEmbedding('Xenova/clip-vit-base-patch32');

// Embed an image
const { embedding: imageVec } = await embedImage({ model, image: photoFile });

// Embed a text query (same model, same vector space)
const { embedding: textVec } = await embed({ model, value: 'sunset over the ocean' });

// Store image embeddings and search with text
const db = await createVectorDB({ name: 'photos', dimensions: 512 });
await db.add({ id: 'photo1', vector: imageVec, metadata: { file: 'sunset.jpg' } });
const results = await db.search(textVec, { k: 5 });

This example demonstrates the core workflow: create a model instance from the provider, call the embedImage() function with your input, and receive structured results. The same pattern works identically across all 1 available provider: Transformers.js.

Available Models

The following models support multimodal embeddings (clip/siglip) through LocalMode. Choose based on your target device, acceptable download size, and quality requirements.

Model	Provider	Dimensions	Size	Speed	Quality
Xenova/clip-vit-base-patch32	Transformers.js	512	~340MB	Fast	Good
Xenova/clip-vit-base-patch16	Transformers.js	512	~340MB	Medium	Good
Xenova/siglip-base-patch16-224	Transformers.js	768	~400MB	Medium	High

Choosing a model: For most applications, start with the recommended model (Xenova/clip-vit-base-patch32). If download size is the primary constraint (e.g., mobile PWA, browser extension), pick the smallest model that meets your quality bar. If quality is the priority (e.g., enterprise search, content analysis), use the largest model your target devices can handle. Note that SigLIP produces 768-dimensional embeddings - set your VectorDB dimensions accordingly.

Cloud vs Local: Cost and Privacy Comparison

Running multimodal embeddings (clip/siglip) locally eliminates per-request API costs and keeps all data on-device. Here is how the economics compare:

Service	Cost / Notes
Google Multimodal Embedding	vary by usage
Running CLIP/SigLIP locally via LocalMode	$0 after model download (340-400MB)

OpenAI CLIP API is not publicly available. Google Multimodal Embedding costs vary by usage. Running CLIP/SigLIP locally via LocalMode costs $0 after model download (340-400MB). All images are processed on-device - no cloud upload required.

The break-even point for most applications is low: if you process more than a few hundred requests per day, local inference costs less than any cloud API within the first week. For privacy-sensitive applications (medical records, legal documents, financial data), the cost comparison is secondary - the ability to process data without it ever leaving the device is the primary value.

Available Providers

Transformers.js - ONNX-optimized models via ONNX Runtime Web. Supports both WebGPU and WASM backends. Broadest model catalog for non-LLM tasks.

AbortSignal Support

All embedImage() calls support cancellation through the standard AbortSignal API:

const controller = new AbortController();

const promise = embedImage({
  model,
  image: imageFile,
  abortSignal: controller.signal,
});

// Cancel if needed (e.g., user navigates away)
controller.abort();

This is essential for responsive UIs - cancel in-flight operations when the user navigates away, submits a new query, or closes a dialog. The underlying model inference stops immediately, freeing memory and compute resources.

React Integration

If you are building a React application, @localmode/react provides hooks that manage loading states, error handling, and cancellation automatically:

npm install @localmode/react

import { useEmbedImage } from '@localmode/react';

The hook returns { data, error, isLoading, execute, cancel, reset } - providing everything a UI component needs to display progress, handle errors, offer cancellation, and reset state.

Clip Siglip - model guide
Text Generation - task guide
Text Embeddings - task guide

Methodology

This guide is based on LocalMode's source code and curated model catalog (packages/transformers/src/models.ts, MULTIMODAL_EMBEDDING_MODELS), the CLIP implementation in packages/transformers/src/implementations/clip-embedding.ts, and the useEmbedImage hook in packages/react/src/hooks/use-embed-image.ts. Embedding dimensions were verified against HuggingFace model cards for each listed model. Cloud pricing figures are subject to change - verify current pricing with the provider before making cost decisions. Quality and performance comparisons are general guidance; benchmark with your own data for production use.

Sources

LocalMode core multimodal embeddings API reference
LocalMode transformers multimodal embeddings guide
Xenova/clip-vit-base-patch32 - HuggingFace model card - 512-dimensional ONNX CLIP model
Xenova/clip-vit-base-patch16 - HuggingFace model card - 512-dimensional ONNX CLIP model
Xenova/siglip-base-patch16-224 - HuggingFace model card - 768-dimensional ONNX SigLIP model
Google Cloud Vertex AI Multimodal Embedding pricing

Frequently Asked Questions