LocalMode
Core

Semantic Cache

Cache LLM responses using embedding similarity for instant repeat queries.

Cache LLM responses based on prompt embedding similarity. When a user asks a question that is semantically similar to a previously answered question, the cached response is returned in under 50ms instead of waiting 5-30 seconds for model inference.

See it in action

Try PDF Search and LLM Chat for working demos of these APIs.

createSemanticCache()

Create a semantic cache instance:

import { createSemanticCache } from '@localmode/core';
import { transformers } from '@localmode/transformers';

const cache = await createSemanticCache({
  embeddingModel: transformers.embedding('Xenova/bge-small-en-v1.5'),
  threshold: 0.92,   // Cosine similarity threshold for cache hits
  maxEntries: 100,    // Maximum cached entries (LRU eviction)
  ttlMs: 3600000,     // 1 hour TTL
});

SemanticCacheConfig

Prop

Type

lookup()

Look up a cached response for a prompt:

const result = await cache.lookup({
  prompt: 'Explain machine learning',
  modelId: 'llama-3.2',
});

if (result.hit) {
  console.log(result.response);  // Cached response text
  console.log(result.score);     // Similarity score (e.g., 0.95)
  console.log(result.durationMs); // Lookup time in ms
}

CacheLookupResult

Prop

Type

Lookup behavior

  1. Exact match fast path -- If the normalized prompt exactly matches a cached entry, the response is returned without embedding.
  2. Embedding similarity -- The prompt is embedded and compared against all cached entries via HNSW index.
  3. Model filtering -- Only entries from the same modelId are considered.
  4. TTL check -- Expired entries are removed lazily and treated as misses.

store()

Store a prompt-response pair:

const { entryId } = await cache.store({
  prompt: 'What is machine learning?',
  response: 'Machine learning is a subset of AI that...',
  modelId: 'llama-3.2',
});

When the cache reaches maxEntries, the least-recently-accessed entry is evicted before the new entry is stored.

clear()

Clear cached entries:

// Clear all entries
await cache.clear();

// Clear entries for a specific model only
await cache.clear({ modelId: 'llama-3.2' });

stats()

Get cache statistics:

const stats = cache.stats();
console.log(stats.entries);   // Number of cached entries
console.log(stats.hits);      // Total cache hits
console.log(stats.misses);    // Total cache misses
console.log(stats.hitRate);   // Hit rate (0-1)

CacheStats

Prop

Type

destroy()

Release all resources:

await cache.destroy();
// All subsequent operations will throw SemanticCacheError

semanticCacheMiddleware()

Transparently cache generateText() and streamText() calls using Language Model Middleware:

import {
  createSemanticCache,
  semanticCacheMiddleware,
  wrapLanguageModel,
  generateText,
} from '@localmode/core';
import { transformers } from '@localmode/transformers';
import { webllm } from '@localmode/webllm';

// Create cache
const cache = await createSemanticCache({
  embeddingModel: transformers.embedding('Xenova/bge-small-en-v1.5'),
});

// Wrap model with cache middleware
const cachedModel = wrapLanguageModel({
  model: webllm.languageModel('Llama-3.2-1B-Instruct-q4f16'),
  middleware: semanticCacheMiddleware(cache),
});

// First call: model generates, result is cached (~5-30s)
const result1 = await generateText({ model: cachedModel, prompt: 'What is AI?' });

// Second call with similar prompt: cached response (~50ms)
const result2 = await generateText({ model: cachedModel, prompt: 'Explain AI' });

How middleware works

OperationCache HitCache Miss
doGenerateReturns cached text with finishReason: 'stop' and zero-token usageCalls model, stores result, returns normally
doStreamYields single chunk with full cached text and done: trueStreams from model, buffers text, stores after completion

TTL and LRU Behavior

  • TTL expiration is checked lazily during lookup(). Expired entries are removed when accessed.
  • LRU eviction occurs during store() when maxEntries is reached. The least-recently-accessed entry is evicted.
  • Expired entries still occupy memory until accessed. This is bounded by maxEntries regardless.

Prompt Normalization

When normalize: true (default), prompts are normalized before embedding:

InputNormalized
" What is AI? ""what is ai?"
"EXPLAIN\nMACHINE\tLEARNING""explain machine learning"

This prevents trivial mismatches from whitespace, casing, or formatting differences.

Events

The cache emits events via globalEventBus:

import { globalEventBus } from '@localmode/core';

globalEventBus.on('cacheHit', ({ prompt, score, modelId, entryId }) => {
  console.log(`Cache hit: ${prompt} (score: ${score})`);
});

globalEventBus.on('cacheMiss', ({ prompt, modelId }) => {
  console.log(`Cache miss: ${prompt}`);
});

globalEventBus.on('cacheStore', ({ prompt, modelId, entryId }) => {
  console.log(`Cached: ${prompt}`);
});

globalEventBus.on('cacheEvict', ({ entryId, reason }) => {
  console.log(`Evicted: ${entryId} (${reason})`);
});

globalEventBus.on('cacheClear', ({ entriesRemoved }) => {
  console.log(`Cleared ${entriesRemoved} entries`);
});

Error Handling

import { SemanticCacheError } from '@localmode/core';

try {
  await cache.lookup({ prompt: 'test', modelId: 'model' });
} catch (error) {
  if (error instanceof SemanticCacheError) {
    console.log(error.code);
    // 'CACHE_DESTROYED' | 'CACHE_LOOKUP_FAILED' | 'CACHE_STORE_FAILED'
    console.log(error.hint);
  }
}

React Hook

Use useSemanticCache() from @localmode/react for automatic lifecycle management:

import { useSemanticCache } from '@localmode/react';
import { transformers } from '@localmode/transformers';

function CachedApp() {
  const { cache, stats, isLoading } = useSemanticCache({
    embeddingModel: transformers.embedding('Xenova/bge-small-en-v1.5'),
  });

  if (isLoading || !cache) return <p>Initializing cache...</p>;

  return (
    <div>
      <p>Entries: {stats.entries} | Hit rate: {(stats.hitRate * 100).toFixed(1)}%</p>
    </div>
  );
}

The hook creates the cache on mount and calls destroy() on unmount.

Recommended Configuration

  • Use the same embedding model you already have loaded for search/RAG to avoid loading a second model
  • Start with the default threshold (0.92) and adjust based on your use case
  • Set maxEntries based on expected unique queries per session (100 is suitable for most apps)
  • Use storage: 'indexeddb' only if you need cache persistence across page reloads

Showcase Apps

AppDescriptionLinks
PDF SearchCache repeated PDF queries for faster responsesDemo · Source
LLM ChatSemantic caching for LLM conversation responsesDemo · Source

On this page