Audio

Transcribe speech to text and synthesize speech from text, entirely in the browser. No servers, no API keys — audio never leaves the device.

See it in action

Try Voice Notes and Meeting Assistant for working demos of these APIs.

transcribe()

Real-time transcription

For streaming microphone-driven speech-to-text with VAD, push-to-talk, open-mic, and conversational voice loops, see createLiveTranscriber() and createTurnTaker().

Transcribe audio to text using a speech-to-text model:

import { transcribe } from '@localmode/core';
import { transformers } from '@localmode/transformers';

const model = transformers.speechToText('onnx-community/moonshine-tiny-ONNX');

const { text, usage, response } = await transcribe({
  model,
  audio: audioBlob,
});

console.log(text); // "Hello, world!"
console.log(usage.audioDurationSec); // 3.5
console.log(response.modelId); // 'onnx-community/moonshine-tiny-ONNX'

import { transcribe } from '@localmode/core';
import { transformers } from '@localmode/transformers';

const model = transformers.speechToText('onnx-community/moonshine-tiny-ONNX');

const { text, segments } = await transcribe({
  model,
  audio: audioBlob,
  returnTimestamps: true,
});

segments?.forEach((seg) => {
  console.log(`[${seg.start}s - ${seg.end}s] ${seg.text}`);
});

const controller = new AbortController();

setTimeout(() => controller.abort(), 30000); // Cancel after 30s

try {
  const { text } = await transcribe({
    model,
    audio: longAudioBlob,
    abortSignal: controller.signal,
  });
} catch (error) {
  if (error.name === 'AbortError') {
    console.log('Transcription was cancelled');
  }
}

Language & Translation

Specify a language hint or translate non-English audio to English:

// Transcribe German audio with language hint
const { text, language } = await transcribe({
  model,
  audio: germanAudioBlob,
  language: 'de',
});

// Translate French audio to English
const { text: englishText } = await transcribe({
  model,
  audio: frenchAudioBlob,
  task: 'translate',
});

For real-time voice loops where users need to hear the first words while later clauses are still synthesizing, use streamSynthesizeSpeech() and playStreamedSpeech() instead. They yield clause-by-clause Float32Array audio and play it gap-free on a Web Audio queue.

Synthesize speech from text using a text-to-speech model.

When a Kokoro model ID is used, the Transformers provider automatically activates phonemizer-backed synthesis with 29 named voices across 2 English dialects (EN-US, EN-GB) and speed control. Pass a voice ID (e.g. 'af_heart', 'bf_emma', 'am_michael') and an optional speed multiplier to control output.

import { synthesizeSpeech } from '@localmode/core';
import { transformers } from '@localmode/transformers';

const model = transformers.textToSpeech('onnx-community/Kokoro-82M-v1.0-ONNX');

const { audio, sampleRate, usage } = await synthesizeSpeech({
  model,
  text: 'Hello, how are you today?',
  voice: 'af_heart',
  speed: 1.0,
});

// Play the audio
const audioUrl = URL.createObjectURL(audio);
const audioElement = new Audio(audioUrl);
audioElement.play();

console.log(`Sample rate: ${sampleRate}Hz`);
console.log(`Generated ${usage.characterCount} characters in ${usage.durationMs}ms`);

import { synthesizeSpeech } from '@localmode/core';
import { transformers } from '@localmode/transformers';

const model = transformers.textToSpeech('onnx-community/Kokoro-82M-v1.0-ONNX');

const { audio } = await synthesizeSpeech({
  model,
  text: 'Welcome to the application.',
  voice: 'af_heart',
  speed: 1.2,
  pitch: 1.0,
});

const audioUrl = URL.createObjectURL(audio);
const audioElement = new Audio(audioUrl);
audioElement.play();

const controller = new AbortController();

setTimeout(() => controller.abort(), 10000); // Cancel after 10s

try {
  const { audio } = await synthesizeSpeech({
    model,
    text: longArticleText,
    abortSignal: controller.signal,
  });
} catch (error) {
  if (error.name === 'AbortError') {
    console.log('Speech synthesis was cancelled');
  }
}

SynthesizeSpeechOptions

Prop

Type

SynthesizeSpeechResult

Prop

Type

Audio Input Types

The AudioInput type accepts three formats: Blob | ArrayBuffer | Float32Array. Use whichever is most convenient for your source — file inputs produce Blob, fetch responses give ArrayBuffer, and the Web Audio API provides Float32Array.

// From a file input
const blob = fileInput.files[0]; // Blob
await transcribe({ model, audio: blob });

// From a fetch response
const buffer = await fetch('/audio.wav').then((r) => r.arrayBuffer()); // ArrayBuffer
await transcribe({ model, audio: buffer });

// From Web Audio API
const audioContext = new AudioContext();
const samples = audioBuffer.getChannelData(0); // Float32Array
await transcribe({ model, audio: samples });

Custom Provider

Implement the SpeechToTextModel and TextToSpeechModel interfaces to create custom audio providers:

SpeechToTextModel

import type { SpeechToTextModel, DoTranscribeOptions, DoTranscribeResult } from '@localmode/core';

class MySpeechToText implements SpeechToTextModel {
  readonly modelId = 'custom:my-stt';
  readonly provider = 'custom';
  readonly languages = ['en', 'de', 'fr'];

  async doTranscribe(options: DoTranscribeOptions): Promise<DoTranscribeResult> {
    const { audio, language, returnTimestamps, abortSignal } = options;

    // Your transcription logic here
    const startTime = performance.now();

    return {
      text: 'Transcribed text...',
      segments: returnTimestamps
        ? [{ text: 'Transcribed text...', start: 0, end: 2.5, confidence: 0.95 }]
        : undefined,
      language: language ?? 'en',
      usage: {
        audioDurationSec: 2.5,
        durationMs: performance.now() - startTime,
      },
    };
  }
}

TextToSpeechModel

import type { TextToSpeechModel, DoSynthesizeOptions, DoSynthesizeResult } from '@localmode/core';

class MyTextToSpeech implements TextToSpeechModel {
  readonly modelId = 'custom:my-tts';
  readonly provider = 'custom';
  readonly voices = ['af_heart', 'bf_emma', 'am_michael'];

  async doSynthesize(options: DoSynthesizeOptions): Promise<DoSynthesizeResult> {
    const { text, voice, speed, abortSignal } = options;

    // Your synthesis logic here
    const startTime = performance.now();
    const audioBlob = new Blob([], { type: 'audio/wav' });

    return {
      audio: audioBlob,
      sampleRate: 24000,
      usage: {
        characterCount: text.length,
        durationMs: performance.now() - startTime,
      },
    };
  }
}

For recommended models, provider-specific options, and practical recipes, see the Speech-to-Text and Text-to-Speech Transformers provider guides.

App	Description	Links
Voice Notes	Record and transcribe audio with Moonshine models	Demo · Source
Meeting Assistant	Transcribe and summarize meeting recordings	Demo · Source
Audiobook Creator	Generate speech audio from text	Demo · Source

Audio

transcribe()

TranscribeOptions

TranscribeResult

TranscriptionSegment

Language & Translation

synthesizeSpeech()

SynthesizeSpeechOptions

SynthesizeSpeechResult

Audio Input Types

Custom Provider

SpeechToTextModel

TextToSpeechModel

Next Steps

Speech-to-Text (Transformers)

Text-to-Speech (Transformers)

Text Generation

Streaming Speech

Live Transcribe

Middleware

Showcase Apps

On this page