Can you build a RAG pipeline entirely in the browser without a backend?

Yes. LocalMode provides the complete pipeline in the browser: PDF text extraction via PDF.js, recursive text chunking, 384-dimensional embeddings with BGE-small (33MB), vector storage in IndexedDB with HNSW indexing, semantic search, and streaming LLM answers -- all without a server or API key.

What models are needed for a private browser-based RAG chat?

The pipeline uses BGE-small-en-v1.5 (33MB) for embeddings and a local LLM like Llama 3.2 1B (712MB) or Qwen3 1.7B (~1.1GB) for answer generation. Total model footprint is under 1.2GB, downloaded once and cached in the browser.

How does browser-based RAG keep documents private?

Every step runs in the browser tab. PDFs are parsed locally, text is chunked and embedded on-device, vectors are stored in IndexedDB, and the LLM generates answers from the browser's WebGPU or WASM runtime. No document text ever leaves the user's device.

Does browser-based semantic search work offline?

Yes. After the initial model download, the entire RAG pipeline -- embedding, vector storage, search, and LLM generation -- works offline. Vectors persist in IndexedDB across browser sessions, so the user's document corpus remains searchable without a network connection.

Build a Private RAG Chat Over Your Documents - No Backend Required

RAG - retrieval-augmented generation - is one of the most useful patterns in applied AI. Feed a language model the right context from your documents, and it can answer domain-specific questions without hallucinating. The catch is that traditional RAG requires a backend: a vector database service, an embedding API, an LLM endpoint, and network calls that send your private documents to third-party servers.

What if the entire pipeline ran in the browser tab?

This post walks through building a complete, private RAG chat - from PDF ingestion to streaming LLM answers - using only client-side JavaScript. No server, no API keys, no data ever leaving the device. Every code example uses real LocalMode APIs pulled directly from the codebase.

Working demo

The PDF Search showcase app implements everything in this tutorial. Open it, drop a PDF, and start asking questions - entirely offline after the initial model download.

What You'll Build

┌──────────────────────────────────────────────────────────────────┐
│                        Browser Tab                               │
│                                                                  │
│  ┌──────────┐   ┌───────────┐   ┌─────────────┐   ┌────────────┐ │
│  │  PDF.js  │──▶│ Chunker   │──▶ Embedding    │──▶ VectorDB    │ │
│  │(extract) │   │(recursive)│   │(BGE-small)  │   │ (IndexedDB)│ │
│  └──────────┘   └───────────┘   └─────────────┘   └────┬───────┘ │
│                                                        │         │
│                                   ┌────────────────────┘         │
│                                   ▼                              │
│  ┌───────────┐   ┌────────────┐   ┌─────────────┐                │
│  │ Streaming │◀──│   Prompt   │◀──│ Semantic    │                │
│  │   LLM     │   │  Assembly  │   │  Search     │                │
│  └───────────┘   └────────────┘   └─────────────┘                │
│                                                                  │
│  Models: BGE-small-en-v1.5 (33.4M params, ~33MB quantized)       │
│          Llama 3.2 1B or Qwen 3 1.7B (712MB–1.1GB)               │
└──────────────────────────────────────────────────────────────────┘

The pipeline has six stages:

Extract - Pull text from a PDF using PDF.js
Chunk - Split text into overlapping segments with chunk()
Embed - Generate 384-dimensional vectors with embedMany()
Store - Index vectors in an HNSW graph backed by IndexedDB
Search - Find relevant chunks with semanticSearch()
Generate - Stream an answer from a local LLM with streamText()

Every stage runs in the same browser process. Let's build each one.

Step 1: Extract Text from a PDF

LocalMode's @localmode/pdfjs package wraps Mozilla's PDF.js to extract text from any PDF file - password-protected documents included.

import { extractPDFText } from '@localmode/pdfjs';

// Accept a File from an <input type="file">
async function loadPDF(file: File) {
  const arrayBuffer = await file.arrayBuffer();

  const result = await extractPDFText(arrayBuffer, {
    includePageNumbers: true,
  });

  console.log(`Extracted ${result.pageCount} pages, ${result.text.length} characters`);
  return result;
}

extractPDFText returns the full text, an array of per-page content, and PDF metadata (title, author, creation date). For large documents you can limit extraction with maxPages and cancel with an AbortSignal.

Step 2: Chunk the Text

Raw PDF text is too long to embed in one pass. Chunking splits it into overlapping segments that each fit within the embedding model's 512-token context window.

import { chunk } from '@localmode/core';

const chunks = chunk(result.text, {
  strategy: 'recursive',
  size: 512,
  overlap: 50,
});

console.log(`Created ${chunks.length} chunks`);
// Each chunk: { text, start, end, index }

The recursive strategy splits on paragraph breaks first, then sentences, then words - preserving natural boundaries. LocalMode also provides markdown, code, and sentence strategies via chunk(). For topic-boundary detection, use semanticChunk() directly - it uses embedding similarity to detect natural topic shifts, producing higher-quality chunks at the cost of an extra embedding pass.

Choosing a chunk size

For BGE-small-en-v1.5 (512-token context), a chunk size of 400-512 characters works well. Overlap of 50 characters ensures context is not lost at boundaries. The PDF Search demo defaults to 512 with 50 overlap.

Step 3: Generate Embeddings

Each chunk needs a vector representation. BGE-small-en-v1.5 is the recommended embedding model for browser RAG: 33.4 million parameters, 384 dimensions, and a 62.17 average score across 56 MTEB tasks - strong retrieval quality in a model that downloads in seconds.

import { embedMany } from '@localmode/core';
import { transformers } from '@localmode/transformers';

// Create the embedding model (downloaded and cached on first use)
const embeddingModel = transformers.embedding('Xenova/bge-small-en-v1.5', {
  quantized: true,  // ~33MB download instead of ~127MB
});

// Embed all chunks in one call (auto-batched internally)
const { embeddings, usage } = await embedMany({
  model: embeddingModel,
  values: chunks.map((c) => c.text),
});

console.log(`Generated ${embeddings.length} embeddings, ${usage.tokens} tokens`);
// embeddings[0] is a Float32Array of length 384

embedMany automatically batches values up to the model's maxEmbeddingsPerCall limit (128 for Transformers.js) and retries on transient failures. For very large documents, streamEmbedMany yields results incrementally with a progress callback:

import { streamEmbedMany } from '@localmode/core';

for await (const { embedding, index } of streamEmbedMany({
  model: embeddingModel,
  values: chunks.map((c) => c.text),
  batchSize: 32,
  onBatch: ({ index, count, total }) => {
    console.log(`Embedded ${index + count}/${total}`);
  },
})) {
  // Process each embedding as it arrives
}

Step 4: Store Vectors in IndexedDB

LocalMode's createVectorDB builds an HNSW (Hierarchical Navigable Small World) index on top of IndexedDB. Vectors are persisted across page reloads - no re-embedding needed when the user returns.

import { createVectorDB } from '@localmode/core';

const db = await createVectorDB({
  name: 'rag-docs',
  dimensions: 384,
  compression: { type: 'sq8' }, // 4x storage reduction via scalar quantization
});

// Store all chunks with their embeddings and metadata
const documents = chunks.map((c, i) => ({
  id: `chunk-${i}`,
  vector: embeddings[i],
  metadata: {
    text: c.text,
    start: c.start,
    end: c.end,
    chunkIndex: c.index,
  },
}));

await db.addMany(documents);
console.log(`Indexed ${documents.length} vectors`);

Enabling compression: { type: 'sq8' } applies scalar quantization to stored vectors, reducing IndexedDB disk usage by approximately 4x with negligible impact on recall. For WebGPU-capable browsers, you can also pass enableGPU: true to accelerate distance computations during search.

Step 5: Search for Relevant Chunks

When the user asks a question, semanticSearch embeds the query and searches the vector database in a single call:

import { semanticSearch } from '@localmode/core';

const { results, usage } = await semanticSearch({
  db,
  model: embeddingModel,
  query: 'What are the key findings of the report?',
  k: 5,
});

for (const result of results) {
  console.log(`[${result.score.toFixed(3)}] ${result.text?.substring(0, 80)}...`);
}

console.log(
  `Query embedded in ${usage.embedDurationMs.toFixed(0)}ms, ` +
  `searched in ${usage.searchDurationMs.toFixed(0)}ms`
);

semanticSearch returns results sorted by cosine similarity, each with the document's text extracted from metadata. For production use, you can add a threshold to filter low-confidence matches, or a filter object for metadata-based filtering (e.g., only search chunks from a specific document).

For even better results, add a reranking pass. A cross-encoder reranker re-scores the top candidates with full query-document attention, catching nuances that embedding similarity misses:

import { rerank } from '@localmode/core';

const rerankerModel = transformers.reranker('Xenova/bge-reranker-base');

const reranked = await rerank({
  model: rerankerModel,
  query: 'What are the key findings?',
  documents: results.map((r) => r.text ?? ''),
  topK: 3,
});
// reranked.results: [{ index, score, text }]

Step 6: Generate an Answer with a Local LLM

Now feed the retrieved context to a language model running entirely in the browser. LocalMode supports four downloadable LLM providers - all using the same streamText function:

Provider	Engine	Best For	Example Model
`@localmode/webllm`	WebGPU	Fastest inference, modern browsers	Llama 3.2 1B (712MB)
`@localmode/wllama`	WASM (llama.cpp)	Universal browser support, 160K+ GGUF models	Llama 3.2 1B Q4_K_M
`@localmode/transformers`	ONNX	Lightweight tasks, 16 curated models	Qwen3 0.6B (ONNX)
`@localmode/litert`	LiteRT-LM (WebGPU + WASM)	Google's on-device engine, early preview	Qwen3 0.6B (.litertlm)

Here is the complete RAG answer generation with streaming:

import { streamText } from '@localmode/core';
import { webllm } from '@localmode/webllm';

// Create the LLM (downloaded and cached on first use)
const llm = webllm.languageModel('Qwen3-1.7B-q4f16_1-MLC');

// Build the prompt with retrieved context
const context = results
  .map((r, i) => `[${i + 1}] ${r.text}`)
  .join('\n\n');

const result = await streamText({
  model: llm,
  systemPrompt:
    'You are a helpful assistant. Answer the user\'s question based only on the ' +
    'provided context. If the context does not contain the answer, say so.',
  prompt: `Context:\n${context}\n\nQuestion: What are the key findings of the report?`,
  maxTokens: 500,
  temperature: 0.3,
});

// Stream the response token by token
for await (const chunk of result.stream) {
  process.stdout.write(chunk.text); // or append to your UI
}

const fullText = await result.text;
const usage = await result.usage;
console.log(`\nGenerated ${usage.outputTokens} tokens in ${usage.durationMs.toFixed(0)}ms`);

For browsers without WebGPU (Firefox, older Safari), swap in the wllama provider for universal WASM support - no code changes beyond the import:

import { wllama } from '@localmode/wllama';

const llm = wllama.languageModel(
  'bartowski/Llama-3.2-1B-Instruct-GGUF:Llama-3.2-1B-Instruct-Q4_K_M.gguf'
);
// Same streamText() call works identically

The Pipeline Builder Approach

The step-by-step approach above gives you full control. For a more declarative style, LocalMode's pipeline builder chains steps into a single executable unit:

import {
  createPipeline,
  pipelineEmbedStep,
  pipelineSearchStep,
} from '@localmode/core';

const ragPipeline = createPipeline('rag-query')
  .step('embed-query', pipelineEmbedStep(embeddingModel))
  .step('search', pipelineSearchStep(db, { k: 5 }))
  .build();

const { result } = await ragPipeline.run(userQuery, {
  abortSignal: controller.signal,
  onProgress: (p) => console.log(`${p.currentStep}: ${p.completed}/${p.total}`),
});

For the ingestion side, the ingest function handles chunking, embedding, and storage in a single call with progress tracking:

import { ingest } from '@localmode/core';

const result = await ingest(db, [{ text: pdfText, id: 'report.pdf' }], {
  chunking: { strategy: 'recursive', size: 512, overlap: 50 },
  generateEmbeddings: true,
  embedder: async (texts) => {
    const { embeddings } = await embedMany({ model: embeddingModel, values: texts });
    return embeddings;
  },
  onProgress: (p) => console.log(`${p.phase}: ${p.chunksProcessed}/${p.totalChunks}`),
});

console.log(`Ingested ${result.documentsProcessed} docs, ${result.chunksCreated} chunks`);

Performance Notes

All measurements from Chrome 125 on an M2 MacBook Air (8GB RAM).

Stage	Model	Download Size	Cold Start	Per-Operation
Embedding	BGE-small-en-v1.5 (quantized)	~33MB	~2s	~15ms per chunk
Vector search	HNSW (IndexedDB)	-	-	<5ms for 10K vectors
LLM (WebGPU)	Qwen 3 1.7B q4f16	~1.1GB	~8-12s	~30 tokens/s
LLM (WebGPU)	Llama 3.2 1B q4f16	~712MB	~5-8s	~40 tokens/s
LLM (WASM)	Llama 3.2 1B Q4_K_M	~750MB	~6-10s	~8-12 tokens/s

Models are cached in the browser after the first download. Subsequent visits load from cache with near-instant startup for the embedding model and a few seconds for LLMs.

Memory usage: BGE-small-en-v1.5 uses approximately 80-120MB of RAM. A 1B-parameter LLM adds 700MB-1.2GB. Total peak memory for a full RAG pipeline is typically 1-2GB, well within the limits of modern laptops and recent mobile devices.

Privacy Guarantees

This architecture provides privacy properties that no cloud RAG solution can match:

Zero network requests after model download - verified by checking the Network tab
No telemetry - the @localmode/core package has zero dependencies and makes no outbound calls
Data stays in IndexedDB - vectors and metadata never leave the browser's origin-scoped storage
Optional encryption - createVectorDB supports AES-GCM encryption via the Web Crypto API
PII redaction - built-in piiRedactionMiddleware can scrub sensitive text before embedding

For regulated industries (healthcare, legal, finance), browser-local RAG eliminates the compliance burden of sending documents to third-party APIs entirely.

Methodology

All LocalMode API names, function signatures, option shapes, and model IDs in this post were verified directly against the codebase (packages/core/src/, packages/pdfjs/src/, packages/webllm/src/, packages/wllama/src/, packages/transformers/src/). Model benchmark scores and parameter counts were verified against primary HuggingFace model cards fetched during authorship. Performance measurements (cold start, tokens/s, memory) are from internal testing on Chrome 125 / macOS with an Apple M2 chip; results vary by device, browser, and concurrent workload.

Sources

BAAI/bge-small-en-v1.5 - HuggingFace model card: 33.4M parameters, 384 dimensions, 512-token sequence length, 62.17 MTEB average (56 tasks), 51.68 retrieval average (15 tasks)
Xenova/bge-small-en-v1.5 - ONNX quantized: Transformers.js-compatible ONNX export used by @localmode/transformers
Qwen3-1.7B - HuggingFace model card: 1.7B parameters (1.4B non-embedding), dual thinking/non-thinking modes
Llama 3.2 1B - Meta AI blog: 1.23B parameters, 128K context, pruned and distilled from larger Llama models
MTEB leaderboard: Massive Text Embedding Benchmark across retrieval, classification, clustering, STS, and reranking tasks
Qwen3 announcement - Qwen blog: Architecture details and benchmark results for the Qwen 3 model family
bartowski/Llama-3.2-1B-Instruct-GGUF - HuggingFace: Q4_K_M quantized GGUF used by @localmode/wllama

Try it yourself

Visit localmode.ai to try 30+ AI demo apps running entirely in your browser. No sign-up, no API keys, no data leaves your device.

Read the Getting Started guide to add local AI to your application in under 5 minutes.

Frequently Asked Questions