Does browser-based text-to-speech work offline?

Yes. After the initial 86MB model download, text-to-speech works completely offline. No server, no API key, and text data never leaves the device. Audio is generated at 24kHz sample rate.

How many voices are available for browser text-to-speech?

Kokoro provides 29 English voices covering both American English and British English accents. Speed can be adjusted from 0.5x to 2.0x, and streaming playback allows audio to begin playing before the full utterance is synthesized.

How does browser TTS cost compare to cloud services?

Google Cloud TTS costs $4-160 per million characters, Amazon Polly costs $4-100 per million characters, and ElevenLabs plans range from $6-990 per month. LocalMode with Kokoro costs $0 after the 86MB download with unlimited characters.

Text-to-Speech in the Browser

Q: What is the best model for text-to-speech in the browser?

onnx-community/Kokoro-82M-v1.0-ONNX (86MB) is recommended. It offers phonemizer-backed synthesis for accurate pronunciation, 29 English voices (American and British), speed control from 0.5x to 2.0x, and streaming playback support.

Generate natural-sounding speech from text using Kokoro - phonemizer-backed synthesis with 29 English voices, speed control, and streaming playback in the browser.

What Is Text-to-Speech?

Text-to-speech (TTS) converts written text into spoken audio. Kokoro 82M uses phonemizer-backed synthesis for accurate pronunciation - converting text to phonemes before generating 24kHz audio with natural prosody, appropriate pauses, emphasis, and intonation. The model ships with 29 English voices (American English and British English), supports speed control from 0.5x to 2.0x, and enables streaming playback so audio begins playing before the full utterance is synthesized. The model processes text and returns an audio Blob that can be played via the Web Audio API or downloaded as a WAV/MP3 file.

This capability is exposed through the synthesizeSpeech() function in @localmode/core. All processing runs entirely in the browser - no server, no API key, no data leaves the device. After the initial model download, text-to-speech works completely offline.

Real-World Applications

Voice studio applications for exploring voices, languages, and speed settings. Audiobook generation from text content. Accessibility features reading page content aloud. Language learning tools with pronunciation examples. Content creation: turn blog posts into podcasts with voice and speed selection. Notification systems with audio alerts. Screen readers with natural-sounding voice.

These use cases all benefit from local, on-device processing: user data stays private, there are no per-request API costs, and the application works without internet after initial setup.

Getting Started

Install the required packages:

npm install @localmode/core @localmode/transformers

Import the core function and provider:

import { synthesizeSpeech } from '@localmode/core';
import { transformers } from '@localmode/transformers';

The recommended starting model is onnx-community/Kokoro-82M-v1.0-ONNX - it provides the best balance of quality, speed, and download size for most applications. It includes phonemizer-backed synthesis for better pronunciation, 29 English voices, speed control (0.5x-2.0x), and streaming playback support.

Code Example

import { synthesizeSpeech } from '@localmode/core';
import { transformers } from '@localmode/transformers';

const model = transformers.textToSpeech('onnx-community/Kokoro-82M-v1.0-ONNX');

const { audio } = await synthesizeSpeech({
  model,
  text: 'Welcome to LocalMode. All AI runs in your browser.',
  voice: 'af_heart',   // 29 English voices available
  speed: 1.0,          // Speed control: 0.5x to 2.0x
});

// Play the audio
const url = URL.createObjectURL(audio);
const audioEl = new Audio(url);
audioEl.play();

This example demonstrates the core workflow: create a model instance from the provider, call the synthesizeSpeech() function with your input text, voice, and speed, and receive structured results. Kokoro's phonemizer-backed pipeline converts text to phonemes before synthesis, producing more accurate pronunciation than raw character-level approaches. Streaming playback allows audio to begin playing before the full utterance is synthesized, reducing perceived latency. The same pattern works identically across all 1 available provider: Transformers.js.

Available Models

The following models support text-to-speech through LocalMode. Choose based on your target device, acceptable download size, and quality requirements.

Model	Provider	Size	Speed	Quality	Voices	Languages
onnx-community/Kokoro-82M-v1.0-ONNX	Transformers.js	86MB	Medium	High	29	English (US & GB)
Xenova/speecht5_tts	Transformers.js	100MB	Medium	Basic	1	1 (EN)

Choosing a model: For most applications, start with the recommended model (onnx-community/Kokoro-82M-v1.0-ONNX). It offers phonemizer-backed synthesis for accurate pronunciation, 29 English voices, adjustable speed (0.5x-2.0x), and streaming playback. If download size is the primary constraint (e.g., mobile PWA, browser extension), pick the smallest model that meets your quality bar. If quality and English voice variety are the priority, Kokoro is the clear choice.

Cloud vs Local: Cost and Privacy Comparison

Running text-to-speech locally eliminates per-request API costs and keeps all data on-device. Here is how the economics compare:

Service	Cost / Notes
Google Cloud TTS	$4-16 per million characters (Standard/WaveNet $4, Neural2 $16; Studio voices $160/M, Chirp 3 HD $30/M)
Amazon Polly	$4-30 per million characters (Standard $4, Neural $16, Generative $30; Long-Form voices $100/M)
ElevenLabs	$6-99/month (Starter to Pro plans; Scale and Business plans range up to $990/month)
LocalMode TTS with Kokoro	$0 after 86MB download, with unlimited characters, 29 English voices, and no per-usage fees

Google Cloud TTS costs $4/M for Standard and WaveNet voices, $16/M for Neural2, $160/M for Studio, and $30/M for Chirp 3 HD. Amazon Polly costs $4/M for Standard, $16/M for Neural, $30/M for Generative, and $100/M for Long-Form voices. ElevenLabs plans range from $6/month (Starter) to $990/month (Business). LocalMode TTS with Kokoro costs $0 after the initial ~86MB model download, with unlimited characters, 29 English voices, speed control, and no per-usage fees.

The break-even point for most applications is low: if you process more than a few hundred requests per day, local inference costs less than any cloud API within the first week. For privacy-sensitive applications (medical records, legal documents, financial data), the cost comparison is secondary - the ability to process data without it ever leaving the device is the primary value.

Available Providers

Transformers.js - ONNX-optimized models via ONNX Runtime Web. Supports both WebGPU and WASM backends. Broadest model catalog for non-LLM tasks.

AbortSignal Support

All synthesizeSpeech() calls support cancellation through the standard AbortSignal API:

const controller = new AbortController();

const promise = synthesizeSpeech({
  model,
  text: 'input text',
  abortSignal: controller.signal,
});

// Cancel if needed (e.g., user navigates away)
controller.abort();

This is essential for responsive UIs - cancel in-flight operations when the user navigates away, submits a new query, or closes a dialog. The underlying model inference stops immediately, freeing memory and compute resources.

React Integration

If you are building a React application, @localmode/react provides hooks that manage loading states, error handling, and cancellation automatically:

npm install @localmode/react

import { useSynthesizeSpeech } from '@localmode/react';

The hook returns { data, error, isLoading, execute, cancel, reset } - providing everything a UI component needs to display progress, handle errors, and offer cancellation.

Showcase Apps

Voice Studio - Explore all 29 English voices, and speed settings with live preview and streaming playback.
Audiobook Creator - Generate long-form audio from text content with voice and speed selection.

Kokoro Tts - model guide
Text Generation - task guide
Text Embeddings - task guide

Methodology

All function signatures, hook return shapes, model IDs, voice counts, and sample rates were verified directly against the LocalMode monorepo source: packages/core/src/audio/, packages/react/src/hooks/use-synthesize-speech.ts, packages/transformers/src/kokoro-voices.ts, and packages/transformers/src/models.ts. Cloud pricing figures were fetched from the official provider pricing pages in May 2026; they are subject to change - verify with the provider before making cost decisions. Model file sizes reflect the quantized (q8) ONNX build.

Sources

LocalMode documentation
onnx-community/Kokoro-82M-v1.0-ONNX model card - 82M params, 24 kHz, ~86MB q8
Xenova/speecht5_tts model card - 16 kHz output, requires HiFiGAN vocoder
phonemizer npm package - eSpeak-NG WASM for text-to-phoneme conversion used by Kokoro
Google Cloud Text-to-Speech pricing - Standard/WaveNet $4/M, Neural2 $16/M, Studio $160/M, Chirp 3 HD $30/M (May 2026)
Amazon Polly pricing - Standard $4/M, Neural $16/M, Generative $30/M, Long-Form $100/M (May 2026)
ElevenLabs pricing - Starter $6/mo, Creator $22/mo, Pro $99/mo, Scale $299/mo, Business $990/mo (May 2026)

Frequently Asked Questions