LocalMode
Transformers

Reranking

Improve RAG accuracy with cross-encoder reranking.

Reranking uses cross-encoder models to improve the relevance of search results. It's particularly useful for RAG pipelines.

For full API reference (rerank(), options, result types, and custom providers), see the Core Reranking guide.

See it in action

Try Semantic Search for a working demo.

Why Reranking?

Bi-encoder (embedding) models are fast but may miss subtle relevance signals. Cross-encoder rerankers consider query-document pairs together for better accuracy.

Query: "How does photosynthesis work?"

Initial Ranking (embeddings):
1. "Photosynthesis is a process used by plants" ✓
2. "The synthesis of proteins requires energy" ✗
3. "Plants convert sunlight into chemical energy" ✓

After Reranking:
1. "Plants convert sunlight into chemical energy" ✓ (more specific)
2. "Photosynthesis is a process used by plants" ✓
3. "The synthesis of proteins requires energy" ✗
ModelSizeSpeedQuality
Xenova/ms-marco-MiniLM-L-6-v2~23MB⚡⚡⚡Good (recommended for browser)
Xenova/ms-marco-MiniLM-L-12-v2~33MB⚡⚡Better
Xenova/bge-reranker-base~279MBBest quality (large download)

Typical RAG reranking pattern:

import { semanticSearch, rerank } from '@localmode/core';

async function searchWithReranking(query: string) {
  // Step 1: Fast semantic search (retrieve many candidates)
  const candidates = await semanticSearch({
    db,
    model: embeddingModel,
    query,
    k: 20,  // Get more candidates than needed
  });

  // Step 2: Rerank for accuracy (keep top results)
  const { results } = await rerank({
    model: rerankerModel,
    query,
    documents: candidates.map((c) => c.metadata.text as string),
    topK: 5,
  });

  // Step 3: Map back to original results with metadata
  return results.map((r) => ({
    ...candidates[r.index],
    rerankerScore: r.score,
  }));
}

Complete RAG Example

import {
  createVectorDB,
  semanticSearch,
  rerank,
  streamText,
} from '@localmode/core';
import { transformers } from '@localmode/transformers';
import { webllm } from '@localmode/webllm';

// Setup models
const embeddingModel = transformers.embedding('Xenova/bge-small-en-v1.5');
const rerankerModel = transformers.reranker('Xenova/ms-marco-MiniLM-L-6-v2');
const llm = webllm.languageModel('Llama-3.2-1B-Instruct-q4f16_1-MLC');

// Setup database
const db = await createVectorDB({ name: 'docs', dimensions: 384 });

// RAG query function
async function ragQuery(question: string) {
  // 1. Retrieve (fast, approximate)
  const candidates = await semanticSearch({
    db,
    model: embeddingModel,
    query: question,
    k: 15,
  });

  // 2. Rerank (slower, accurate)
  const { results: reranked } = await rerank({
    model: rerankerModel,
    query: question,
    documents: candidates.map((c) => c.metadata.text as string),
    topK: 3,
  });

  // 3. Generate answer
  const context = reranked.map((r) => r.text).join('\n\n');

  const stream = await streamText({
    model: llm,
    prompt: `Answer based on the context:

Context:
${context}

Question: ${question}

Answer:`,
  });

  return stream;
}

When to Use Reranking

Use Reranking When

  • Building Q&A or chatbot applications
  • Initial search returns many similar results
  • Accuracy matters more than latency
  • Documents have subtle relevance differences

Skip Reranking When

  • Latency is critical (real-time applications)
  • Results are clearly distinct
  • Simple keyword matching is sufficient
  • Processing very large result sets

Performance Optimization

// Balance between accuracy and speed
const reranked = await rerank({
  model: rerankerModel,
  query,
  documents: candidates.slice(0, 10),  // Limit candidates
  topK: 3,
});

// For large result sets, rerank in batches
async function rerankLargeResultSet(query: string, documents: string[], topK: number) {
  const batchSize = 50;
  const batches = [];

  for (let i = 0; i < documents.length; i += batchSize) {
    const batch = documents.slice(i, i + batchSize);
    const { results: batchResults } = await rerank({
      model: rerankerModel,
      query,
      documents: batch,
      topK: Math.min(topK, batch.length),
    });
    batches.push(batchResults.map((r) => ({
      ...r,
      index: r.index + i,
    })));
  }

  // Merge and re-sort
  return batches
    .flat()
    .sort((a, b) => b.score - a.score)
    .slice(0, topK);
}

Best Practices

Reranking Tips

  1. Retrieve more, rerank less — Get 3-5x more candidates than needed
  2. Use appropriate topK — 3-5 is usually enough for RAG context
  3. Cache reranker model — Load once, reuse for all queries
  4. Consider latency budget — Reranking adds 50-200ms per query
  5. Test with/without — Measure accuracy improvement for your use case

Showcase Apps

AppDescriptionLinks
Semantic SearchRerank search results for improved relevanceDemo · Source

Next Steps

On this page