What file formats can be loaded into a local semantic search pipeline with LocalMode?

Four formats are supported with built-in loaders: CSV (via CSVLoader with auto-detection of text columns), JSON (via JSONLoader with nested path navigation), HTML (via HTMLLoader with CSS selector targeting and tag stripping), and PDF (via @localmode/pdfjs wrapping PDF.js). The loadDocument() function auto-detects format from file name and content.

How does the ingest() function work for building a local RAG pipeline?

ingest() runs three phases: it chunks documents using a chosen strategy (recursive, markdown, code, sentence, or paragraph), embeds chunks in batches via your embedder function, and stores vectors with metadata in the VectorDB. It supports adaptive batching, progress callbacks, and optional BM25 index building in a single call.

What is the shortest path from a CSV file to working semantic search?

Twelve lines of code: create a VectorDB with 384 dimensions, create a BGE-small embedding model (33 MB), load the CSV with CSVLoader specifying the text column, call ingest() with recursive chunking and the embedder function, then search with semanticSearch(). The entire pipeline runs in the browser with no server.

From CSV to Semantic Search in 60 Seconds: Loading Any Document Into a Local RAG Pipeline

You have data. Maybe it is a CSV export from your CRM, a JSON dump from an API, a folder of saved HTML pages, or a stack of PDF invoices. You want to search it by meaning, not by keywords. You want a user to type "budget concerns" and find the row about "Q3 financial projections."

The gap between "I have a file" and "I have semantic search" feels large. It is not. With LocalMode, the entire path - load, chunk, embed, search - fits in a single code block and runs entirely in the browser. No server. No API key. No data leaves the device.

This post shows the shortest path for four common formats, then introduces the ingest() shortcut that collapses the whole pipeline into one function call.

The Pipeline at a Glance

Every format follows the same four steps:

File  -->  Load  -->  Chunk + Embed  -->  VectorDB  -->  Search
          (loader)     (ingest)          (IndexedDB)    (semanticSearch)

LocalMode provides a built-in loader for each format, an ingest() function that handles chunking and embedding in one call, and a semanticSearch() function that embeds a query and searches the database in a single step.

Here is the shared setup that every example below uses:

import {
  createVectorDB,
  ingest,
  semanticSearch,
} from '@localmode/core';
import { transformers } from '@localmode/transformers';

// 1. Create a vector database (stored in IndexedDB)
const db = await createVectorDB({ name: 'my-docs', dimensions: 384 });

// 2. Create an embedding model (33MB, downloads once, cached forever)
const model = transformers.embedding('Xenova/bge-small-en-v1.5');

// 3. Build an embedder function for ingest()
const embedder = async (texts: string[]) => {
  const { embeddings } = await embedMany({ model, values: texts });
  return embeddings;
};

That is the foundation. Now drop in any file.

1. CSV: Rows to Search Results

A CSV with a description column - product catalog, support tickets, survey responses, anything tabular.

import { CSVLoader } from '@localmode/core';

// Load the CSV (auto-detects 'description' column, or specify one)
const loader = new CSVLoader();
const docs = await loader.load(csvString, { textColumn: 'description' });

// Ingest: chunk + embed + store in one call
await ingest(db, docs, {
  chunking: { strategy: 'recursive', size: 500 },
  generateEmbeddings: true,
  embedder,
});

// Search
const { results } = await semanticSearch({
  db, model, query: 'budget concerns', k: 5,
});

The CSVLoader turns each row into a document. Every column value is preserved as metadata, so you can filter results later (filter: { category: 'finance' }). If you do not specify textColumn, it looks for columns named text, content, body, or description automatically. You can also combine multiple columns:

const docs = await loader.load(csvFile, {
  textColumns: ['title', 'description'],
  columnSeparator: ' - ',
});

2. JSON: Extract the Fields That Matter

API responses, exported records, structured datasets - JSON comes in many shapes. The JSONLoader handles objects, arrays, and nested paths.

import { JSONLoader } from '@localmode/core';

const loader = new JSONLoader();
const docs = await loader.load(jsonString, {
  recordsPath: 'data.articles',     // navigate to the array
  textFields: ['title', 'body'],    // combine these fields
  fieldSeparator: '\n',
});

await ingest(db, docs, {
  chunking: { strategy: 'recursive', size: 500 },
  generateEmbeddings: true,
  embedder,
});

const { results } = await semanticSearch({
  db, model, query: 'machine learning trends', k: 5,
});

recordsPath uses dot notation to reach nested arrays - data.items, response.results, or just omit it for top-level arrays. If you skip textFields, the loader searches for common field names (text, content, body, description) automatically. Need everything? Set extractAllStrings: true and it recursively pulls every string value from each record.

3. HTML: Strip the Tags, Keep the Meaning

Saved web pages, documentation exports, scraped content - HTML is noisy. The HTMLLoader strips scripts, styles, navigation, and tags, leaving clean text.

import { HTMLLoader } from '@localmode/core';

const loader = new HTMLLoader();
const docs = await loader.load(htmlString, {
  selector: 'article.content',   // target specific content
  preserveFormatting: true,       // keep paragraph breaks
  extractMetadata: true,          // pull title, description, author
});

await ingest(db, docs, {
  chunking: { strategy: 'recursive', size: 500 },
  generateEmbeddings: true,
  embedder,
});

const { results } = await semanticSearch({
  db, model, query: 'authentication setup', k: 5,
});

The selector option accepts CSS selectors - #main, .post-body, article - so you can skip headers, footers, and sidebars. Without a selector, it extracts text from the entire <body>. In browser environments it uses DOMParser for accurate extraction; in other environments it falls back to a regex-based parser that handles the common cases.

4. PDF: Pages to Searchable Chunks

PDFs require the @localmode/pdfjs package, which wraps PDF.js for browser-native extraction.

import { extractPDFText } from '@localmode/pdfjs';

// Extract text from a PDF file (File, Blob, ArrayBuffer, or URL)
const { text, pageCount } = await extractPDFText(pdfFile);

// Turn extracted text into a source document
const docs = [{ id: 'report-q3', text, metadata: { source: pdfFile.name } }];

// Ingest with markdown-aware chunking (good for structured PDFs)
await ingest(db, docs, {
  chunking: { strategy: 'recursive', size: 500, overlap: 50 },
  generateEmbeddings: true,
  embedder,
});

const { results } = await semanticSearch({
  db, model, query: 'revenue growth Q3', k: 5,
});

extractPDFText returns page-level content too, so you can ingest each page as a separate document if you prefer finer-grained results:

const { pages } = await extractPDFText(pdfFile, { includePageNumbers: true });

const docs = pages.map((page) => ({
  id: `page-${page.pageNumber}`,
  text: page.text,
  metadata: { page: page.pageNumber, source: pdfFile.name },
}));

The Auto-Detect Shortcut

Do not know the file type in advance? loadDocument() auto-detects the format and routes to the right loader:

import { loadDocument } from '@localmode/core';

// Auto-detect: CSV, JSON, HTML, or plain text
const docs = await loadDocument(file);

// Then ingest as usual
await ingest(db, docs, {
  chunking: { strategy: 'recursive', size: 500 },
  generateEmbeddings: true,
  embedder,
});

It checks the File object's name and MIME type, or inspects the string content for CSV commas, JSON braces, or HTML tags. For multiple files at once, use loadDocuments():

import { loadDocuments } from '@localmode/core';

const docs = await loadDocuments([file1, file2, file3]);

Drag-and-Drop File Handling

Most real applications let users drop files into the browser. Here is the pattern that connects a drop event to the full pipeline:

function handleDrop(event: React.DragEvent) {
  event.preventDefault();
  const files = Array.from(event.dataTransfer.files);

  for (const file of files) {
    let docs;

    if (file.name.endsWith('.pdf')) {
      const { text } = await extractPDFText(file);
      docs = [{ id: file.name, text, metadata: { source: file.name } }];
    } else {
      docs = await loadDocument(file);
    }

    await ingest(db, docs, {
      chunking: { strategy: 'recursive', size: 500 },
      generateEmbeddings: true,
      embedder,
      onProgress: (p) => {
        console.log(`${p.phase}: ${p.chunksProcessed}/${p.totalChunks}`);
      },
    });
  }
}

The onProgress callback reports four phases - chunking, embedding, indexing, complete - so you can drive a progress bar. Each phase reports chunksProcessed and totalChunks.

The ingest() Function in Detail

ingest() is the workhorse that makes everything above concise. Here is what it does under the hood:

Chunk - Splits every document using your chosen strategy (recursive, markdown, code, sentence, paragraph). Each chunk gets metadata linking it back to its source document.
Embed - Calls your embedder function in batches. You control the batch size, or set adaptiveBatching: true to let LocalMode pick an optimal size based on the device's capabilities.
Index - Stores vectors and metadata in the VectorDB via addMany().

const result = await ingest(db, docs, {
  chunking: { strategy: 'markdown', size: 1000, overlap: 100 },
  generateEmbeddings: true,
  embedder,
  adaptiveBatching: true,
  onProgress: (p) => updateUI(p),
});

console.log(result.documentsProcessed); // 42
console.log(result.chunksCreated);      // 318
console.log(result.duration);           // 2450 (ms)

If you want to run the same configuration on multiple batches, createIngestPipeline() gives you a reusable function:

import { createIngestPipeline } from '@localmode/core';

const ingestDocs = createIngestPipeline(db, {
  chunking: { strategy: 'recursive', size: 500 },
  generateEmbeddings: true,
  embedder,
});

// Now ingest anything with one call
await ingestDocs(csvDocs);
await ingestDocs(jsonDocs);
await ingestDocs(pdfDocs);

Searching With semanticSearch()

Once data is ingested, semanticSearch() wraps embed() + db.search() into a single call:

const { results, usage } = await semanticSearch({
  db,
  model,
  query: 'quarterly revenue growth',
  k: 10,
  filter: { source: 'report-q3.pdf' },
});

for (const r of results) {
  console.log(`${r.score.toFixed(3)} - ${r.text?.substring(0, 80)}...`);
}

console.log(`Embed: ${usage.embedDurationMs}ms, Search: ${usage.searchDurationMs}ms`);

The filter option narrows results by metadata - any field your loader or ingest step attached. The threshold option sets a minimum similarity score to exclude weak matches.

Putting It All Together

Here is the complete, minimal path from "I have a CSV" to "I have semantic search" - twelve lines of working code:

import { CSVLoader, createVectorDB, ingest, semanticSearch, embedMany } from '@localmode/core';
import { transformers } from '@localmode/transformers';

const db = await createVectorDB({ name: 'products', dimensions: 384 });
const model = transformers.embedding('Xenova/bge-small-en-v1.5');
const embedder = async (texts: string[]) =>
  (await embedMany({ model, values: texts })).embeddings;

const docs = await new CSVLoader().load(csvFile, { textColumn: 'description' });
await ingest(db, docs, {
  chunking: { strategy: 'recursive', size: 500 },
  generateEmbeddings: true,
  embedder,
});

const { results } = await semanticSearch({ db, model, query: 'wireless noise canceling', k: 5 });

Swap CSVLoader for JSONLoader, HTMLLoader, or extractPDFText. The rest stays the same. That is the point - one pipeline shape handles any data you throw at it.

Try it live

The Semantic Search and PDF Search showcase apps implement everything in this post. Drop your own files in and search them - entirely in the browser, entirely offline after the first model download.

What Comes Next

Once you have data in a VectorDB, the entire LocalMode toolkit opens up:

RAG chat - Pipe search results into a local LLM with generateText() for grounded answers
Hybrid search - Combine semantic results with BM25 keyword search via buildBM25Index: true in ingest()
Metadata filters - Narrow results by any field: date ranges, categories, sources
Import/export - Move vectors between browsers or migrate from Pinecone/ChromaDB with importFrom() and exportToCSV()
React hooks - useSemanticSearch(), useIngest(), and usePipeline() for reactive UI integration

Every piece runs locally. Your users' data never touches a server. And it all starts with loading a file.

Frequently Asked Questions