LocalMode
Core

RAG

Build retrieval-augmented generation pipelines with chunking, ingestion, and hybrid search.

RAG (Retrieval-Augmented Generation) combines vector search with language models to answer questions from your documents. LocalMode provides all the building blocks: chunking, ingestion, semantic search, reranking, and hybrid search.

See it in action

Try Semantic Search and PDF Search for working demos of these APIs.

RAG Pipeline Overview

Chunk Documents

Split documents into smaller, semantically meaningful pieces.

import { chunk } from '@localmode/core';

const chunks = chunk(documentText, {
  strategy: 'recursive',
  size: 512,
  overlap: 50,
});

Generate Embeddings & Store

Create embeddings and store in a vector database.

import { ingest, createVectorDB } from '@localmode/core';

const db = await createVectorDB({ name: 'docs', dimensions: 384 });
await ingest({ db, model: embeddingModel, documents: chunks });

Search & Retrieve

Find relevant chunks using semantic search.

import { semanticSearch } from '@localmode/core';

const results = await semanticSearch({
  db,
  model: embeddingModel,
  query: userQuestion,
  k: 10,
});

Rerank for Precision

Optionally rerank results for better accuracy.

import { rerank } from '@localmode/core';

const reranked = await rerank({
  model: rerankerModel,
  query: userQuestion,
  documents: results.map((r) => r.metadata.text),
  topK: 5,
});

Generate Answer

Use an LLM to generate an answer from the context.

import { streamText } from '@localmode/core';

const stream = await streamText({
  model: llm,
  prompt: `Context:\n${context}\n\nQuestion: ${userQuestion}`,
});

Chunking

Split documents into smaller pieces for better retrieval:

ChunkOptions

Prop

Type

The separators and language options are strategy-specific. separators only applies to the 'recursive' strategy. language only applies to the 'code' strategy. Both are ignored when using other strategies.

Semantic Chunking

Semantic chunking uses an embedding model to split text at actual topic boundaries, rather than relying on syntactic markers like paragraphs or sentences. This produces chunks that are internally coherent and separated at natural meaning shifts.

import { semanticChunk } from '@localmode/core';
import { transformers } from '@localmode/transformers';

const model = transformers.embedding('Xenova/bge-small-en-v1.5');

const chunks = await semanticChunk({
  text: longDocument,
  model,
  size: 1000,       // Max chunk size (default: 2000)
  minSize: 100,     // Min chunk size (default: 100)
  threshold: 0.4,   // Optional: explicit similarity threshold
});

chunks.forEach((c) => {
  console.log(`Chunk ${c.index}: ${c.text.substring(0, 60)}...`);
  console.log(`  Boundaries:`, c.metadata?.semanticBoundaries);
});

How It Works

  1. Pre-split — Text is split into sentence-level segments using the recursive chunker
  2. Embed — Each segment is embedded using the provided model (single embedMany() call)
  3. Compare — Cosine similarity is computed between adjacent segment embeddings
  4. Detect — Breakpoints are identified where similarity drops below the threshold
  5. Merge — Segments between breakpoints are merged into final chunks with size constraints

Auto-Threshold Detection

When no threshold is specified, the algorithm computes it automatically from the similarity distribution using mean - standard deviation. This adapts to any embedding model's similarity range.

// Auto-threshold (recommended for most use cases)
const chunks = await semanticChunk({ text, model });

// Explicit threshold (when you know your model's similarity range)
const chunks = await semanticChunk({ text, model, threshold: 0.5 });
  • threshold: 0 produces a single chunk (no breakpoints)
  • threshold: 1.0 produces maximum splitting (nearly every boundary is a break)

SemanticChunkOptions

Prop

Type

Boundary Metadata

Each chunk includes semantic boundary scores in metadata.semanticBoundaries:

interface SemanticBoundaries {
  leftSimilarity: number | null;  // null for first chunk
  rightSimilarity: number | null; // null for last chunk
}

Lower similarity at a boundary means a stronger topic shift. Use this for debugging chunk quality or implementing adaptive retrieval strategies.

Reusable Chunker Factory

import { createSemanticChunker } from '@localmode/core';

const chunker = createSemanticChunker({
  model,
  threshold: 0.4,
  size: 1000,
});

const chunks1 = await chunker('First document...');
const chunks2 = await chunker('Second document...', { threshold: 0.6 });

Pipeline Integration

Use pipelineSemanticChunkStep() in pipeline workflows:

import { createPipeline, pipelineSemanticChunkStep, pipelineEmbedManyStep } from '@localmode/core';

const pipeline = createPipeline('semantic-rag')
  .step('chunk', pipelineSemanticChunkStep(embeddingModel, { size: 1000 }))
  .step('embed', pipelineEmbedManyStep(embeddingModel))
  .build();

const { result } = await pipeline.run(documentText);

React Hook

Use useSemanticChunk from @localmode/react for component-level semantic chunking with built-in loading, error, and cancellation state:

import { useSemanticChunk } from '@localmode/react';
import { transformers } from '@localmode/transformers';

const model = transformers.embedding('Xenova/bge-small-en-v1.5');

function SemanticChunker() {
  const { data: chunks, isLoading, error, execute, cancel } = useSemanticChunk({
    model,
    threshold: 0.4,
    size: 1000,
  });

  return (
    <div>
      <button onClick={() => execute('Long document text...')} disabled={isLoading}>
        Chunk
      </button>
      {isLoading && <button onClick={cancel}>Cancel</button>}
      {chunks?.map((c) => <p key={c.index}>{c.text.substring(0, 80)}...</p>)}
    </div>
  );
}

Semantic vs. Recursive Chunking

AspectRecursiveSemantic
SpeedFast (synchronous)Slower (requires embedding)
Topic awarenessNone (syntactic only)High (embedding-based)
DependenciesNoneRequires EmbeddingModel
Best forGeneral documents, code, markdownLong documents with topic shifts
APIchunk(text, options)semanticChunk(options)

Semantic chunking is async because it requires embedding each segment. For a 5,000-word document (~200 sentences at 384 dimensions), this involves a single embedMany() call. Use abortSignal for cancellation on long documents.

Ingestion

Ingest documents into a vector database:

import { createVectorDB, ingest } from '@localmode/core';
import { transformers } from '@localmode/transformers';

const model = transformers.embedding('Xenova/bge-small-en-v1.5');
const db = await createVectorDB({ name: 'docs', dimensions: 384 });

await ingest({
  db,
  model,
  documents: [
    { text: 'First document...', metadata: { source: 'doc1.txt' } },
    { text: 'Second document...', metadata: { source: 'doc2.txt' } },
  ],
});

With Automatic Chunking

await ingest({
  db,
  model,
  documents: [{ text: longDocument, metadata: { source: 'book.txt' } }],
  chunkOptions: {
    strategy: 'recursive',
    size: 512,
    overlap: 50,
  },
});

With Progress Tracking

await ingest({
  db,
  model,
  documents: largeDocumentArray,
  onProgress: (progress) => {
    console.log(`Ingested ${progress.completed}/${progress.total} documents`);
  },
});

Search for relevant chunks:

import { semanticSearch } from '@localmode/core';

const results = await semanticSearch({
  db,
  model,
  query: 'What are the benefits of machine learning?',
  k: 5,
});

results.forEach((r) => {
  console.log(`Score: ${r.score.toFixed(3)}`);
  console.log(`Text: ${r.metadata.text}`);
});

Reranking

Improve results with cross-encoder reranking:

import { rerank } from '@localmode/core';
import { transformers } from '@localmode/transformers';

const rerankerModel = transformers.reranker('Xenova/ms-marco-MiniLM-L-6-v2');

// Get initial results
const results = await semanticSearch({ db, model, query, k: 20 });

// Rerank for better accuracy
const { results: reranked } = await rerank({
  model: rerankerModel,
  query,
  documents: results.map((r) => r.metadata.text as string),
  topK: 5,
});

reranked.forEach((r) => {
  console.log(`Score: ${r.score.toFixed(3)}`);
  console.log(`Text: ${r.text.substring(0, 100)}...`);
});

When to Use Reranking

Reranking improves accuracy but adds latency. Use it when: - Accuracy is more important than speed

  • You're building a Q&A system - Initial results may have false positives

For exact keyword matching:

import { createBM25 } from '@localmode/core';

const bm25 = createBM25(documents.map((d) => d.text));

const keywordResults = bm25.search('machine learning');

keywordResults.forEach((r) => {
  console.log(`Score: ${r.score.toFixed(3)}, Index: ${r.index}`);
});

Combine semantic and keyword search:

import { semanticSearch, createBM25, hybridFuse } from '@localmode/core';

// Semantic search
const semanticResults = await semanticSearch({ db, model, query, k: 20 });

// BM25 keyword search
const bm25 = createBM25(documents.map((d) => d.text));
const keywordResults = bm25.search(query);

// Combine with fusion
const hybridResults = hybridFuse({
  semantic: semanticResults.map((r) => ({
    id: r.id,
    score: r.score,
  })),
  keyword: keywordResults.map((r) => ({
    id: documents[r.index].id,
    score: r.score,
  })),
  k: 10,
  alpha: 0.7, // Weight for semantic (0.7 = 70% semantic, 30% keyword)
});

Reciprocal Rank Fusion

Alternative fusion method:

import { reciprocalRankFusion } from '@localmode/core';

const fused = reciprocalRankFusion({
  rankings: [semanticResults.map((r) => r.id), keywordResults.map((r) => documents[r.index].id)],
  k: 10,
  constant: 60, // RRF constant (default: 60)
});

Complete RAG Pipeline

Here's a complete example:

import { createVectorDB, chunk, ingest, semanticSearch, rerank, streamText } from '@localmode/core';
import { transformers } from '@localmode/transformers';
import { webllm } from '@localmode/webllm';

// 1. Setup models
const embeddingModel = transformers.embedding('Xenova/bge-small-en-v1.5');
const rerankerModel = transformers.reranker('Xenova/ms-marco-MiniLM-L-6-v2');
const llm = webllm.languageModel('Llama-3.2-1B-Instruct-q4f16_1-MLC');

// 2. Create database
const db = await createVectorDB({ name: 'knowledge-base', dimensions: 384 });

// 3. Ingest documents
async function ingestDocuments(documents: Array<{ text: string; source: string }>) {
  for (const doc of documents) {
    const chunks = chunk(doc.text, {
      strategy: 'recursive',
      size: 512,
      overlap: 50,
    });

    await ingest({
      db,
      model: embeddingModel,
      documents: chunks.map((c) => ({
        text: c.text,
        metadata: {
          source: doc.source,
          start: c.start,
          end: c.end,
        },
      })),
    });
  }
}

// 4. Query function
async function query(question: string) {
  // Retrieve
  const results = await semanticSearch({
    db,
    model: embeddingModel,
    query: question,
    k: 10,
  });

  // Rerank
  const { results: reranked } = await rerank({
    model: rerankerModel,
    query: question,
    documents: results.map((r) => r.metadata.text as string),
    topK: 3,
  });

  // Generate
  const context = reranked.map((r) => r.text).join('\n\n---\n\n');

  const stream = await streamText({
    model: llm,
    prompt: `You are a helpful assistant. Answer based only on the context provided.
If the answer is not in the context, say "I don't have that information."

Context:
${context}

Question: ${question}

Answer:`,
  });

  return stream;
}

// Usage
const stream = await query('What is machine learning?');
for await (const chunk of stream) {
  process.stdout.write(chunk.text);
}

Composable Pipelines

For a builder API that chains these steps with progress tracking and cancellation, see the Pipelines guide. Pre-built step factories like pipelineEmbedStep, pipelineSearchStep, and pipelineRerankStep let you assemble RAG workflows declaratively.

Document Loaders

Load documents from various formats:

import { TextLoader, JSONLoader, CSVLoader, HTMLLoader } from '@localmode/core';
import { PDFLoader } from '@localmode/pdfjs';

// Text files
const textLoader = new TextLoader();
const { documents: textDocs } = await textLoader.load(textBlob);

// JSON
const jsonLoader = new JSONLoader({ textField: 'content' });
const { documents: jsonDocs } = await jsonLoader.load(jsonBlob);

// CSV
const csvLoader = new CSVLoader({ textColumn: 'description' });
const { documents: csvDocs } = await csvLoader.load(csvBlob);

// HTML
const htmlLoader = new HTMLLoader({ selector: 'article' });
const { documents: htmlDocs } = await htmlLoader.load(htmlBlob);

// PDF
const pdfLoader = new PDFLoader({ splitByPage: true });
const { documents: pdfDocs } = await pdfLoader.load(pdfBlob);

Stream results progressively for real-time UI updates:

import { streamSemanticSearch } from '@localmode/core';

for await (const result of streamSemanticSearch({
  db,
  model,
  query: 'authentication',
  k: 100,
})) {
  // Render each result as it arrives
  console.log(result.id, result.score);
}

streamSemanticSearch() accepts the same options as semanticSearch() but returns an AsyncGenerator that yields results one at a time.

Chunker Factories

Create reusable, pre-configured chunkers:

import {
  createChunker,
  createRecursiveChunker,
  createMarkdownChunker,
  createCodeChunker,
} from '@localmode/core';
FactoryDescription
createChunker(options)Generic chunker with configurable strategy
createRecursiveChunker(options)Recursive text splitting with custom separators
createMarkdownChunker(options)Markdown-aware chunking that respects headings and structure
createCodeChunker(options)Code-aware chunking that respects functions and blocks
// Create a reusable markdown chunker
const chunker = createMarkdownChunker({ size: 500, overlap: 50 });

// Use it on multiple documents
const chunks1 = chunker('# Document 1\n\nContent here...');
const chunks2 = chunker('# Document 2\n\nMore content...');

Chunk Statistics

import { getChunkStats, estimateChunkCount } from '@localmode/core';

// Estimate how many chunks a text will produce
const count = estimateChunkCount(longText, { strategy: 'recursive', size: 500 });
console.log(`Will produce ~${count} chunks`);

// Get statistics from existing chunks
const stats = getChunkStats(chunks);
// { count, totalLength, avgLength, minLength, maxLength }

Batch Processing & Ingestion Pipeline

For large-scale document processing, use the ingestion utilities:

chunkDocuments

Chunk multiple source documents at once:

import { chunkDocuments } from '@localmode/core';

const chunks = chunkDocuments(documents, {
  chunking: { strategy: 'recursive', size: 500, overlap: 50 },
  idPrefix: 'doc',
});
// Returns array of { id, text, sourceDocId, chunkIndex, start, end, metadata }

ingest

Full ingestion pipeline — chunk, embed, and store in one call:

import { ingest } from '@localmode/core';

const result = await ingest(db, documents, {
  chunking: { strategy: 'recursive', size: 500 },
  generateEmbeddings: true,
  embedder: async (texts) => {
    const { embeddings } = await embedMany({ model, values: texts });
    return embeddings;
  },
  batchSize: 50,
  onProgress: (progress) => {
    console.log(`${progress.phase}: ${progress.chunksProcessed}/${progress.totalChunks}`);
  },
});

createIngestPipeline

Create a reusable pipeline with default options:

import { createIngestPipeline } from '@localmode/core';

const pipeline = createIngestPipeline(db, {
  chunking: { strategy: 'recursive', size: 500 },
  generateEmbeddings: true,
  embedder: myEmbedder,
});

// Ingest multiple batches with same config
await pipeline(batch1);
await pipeline(batch2);

estimateIngestion

Estimate ingestion stats before processing:

import { estimateIngestion } from '@localmode/core';

const estimate = estimateIngestion(documents, {
  chunking: { strategy: 'recursive', size: 500 },
});

console.log(`Documents: ${estimate.totalDocuments}`);
console.log(`Estimated chunks: ${estimate.estimatedChunks}`);
console.log(`Avg chunk size: ${estimate.avgChunkSize} chars`);

Best Practices

RAG Best Practices

  1. Chunk size - 256-512 chars works well for most cases
  2. Overlap - 10-20% overlap helps maintain context
  3. Reranking - Always rerank for Q&A applications
  4. Hybrid search - Combine semantic + keyword for robust results
  5. Context window - Don't exceed LLM's context limit
  6. Use estimateIngestion - Check chunk counts before large ingestions
  7. Streaming search - Use streamSemanticSearch for progressive UI rendering

Next Steps

Showcase Apps

AppDescriptionLinks
Semantic SearchRecursive and semantic chunking for document searchDemo · Source
PDF SearchSemantic chunking of PDF content for Q&ADemo · Source
LangChain RAGEnd-to-end RAG with chunking and retrievalDemo · Source

On this page