Live Transcribe

Build voice-first interfaces — push-to-talk, open-mic, and full conversational loops — entirely in the browser. Layers on top of any SpeechToTextModel (onnx-community/moonshine-tiny-ONNX, Whisper, custom) with built-in VAD and barge-in.

Live transcription is additive — the existing transcribe() function and SpeechToTextModel interface are unchanged. createLiveTranscriber() calls model.doTranscribe() per chunk, so any existing custom model class works.

Overview

createLiveTranscriber() returns a controller that:

Acquires the microphone via getUserMedia (with echoCancellation and noiseSuppression).
Captures audio at 16 kHz Float32Array PCM through an AudioWorkletNode (or ScriptProcessorNode fallback).
Runs a Voice Activity Detector (energy or silero) to auto-segment utterances.
Calls model.doTranscribe() at a configurable cadence to emit partial chunks while the user is still speaking.
Detects barge-in (user speaks during external TTS playback) and aborts in-flight work.

Two modes: 'push-to-talk' (caller drives start() / stop()) and 'open-mic' (VAD drives utterance boundaries).

createLiveTranscriber()

import { createLiveTranscriber } from '@localmode/core';
import { transformers } from '@localmode/transformers';

const transcriber = await createLiveTranscriber({
  model: transformers.speechToText('onnx-community/moonshine-tiny-ONNX'),
  mode: 'push-to-talk',
  vad: 'energy',
});

transcriber.onChunk((chunk) => {
  console.log(chunk.text, chunk.isFinal);
});
transcriber.onUtteranceEnd((u) => {
  console.log('Final:', u.text);
});

button.addEventListener('mousedown', () => transcriber.start());
button.addEventListener('mouseup', () => transcriber.stop());

import { createLiveTranscriber } from '@localmode/core';
import { transformers } from '@localmode/transformers';

const transcriber = await createLiveTranscriber({
  model: transformers.speechToText('onnx-community/moonshine-tiny-ONNX'),
  mode: 'open-mic',
  vad: transformers.vad('onnx-community/silero-vad-ONNX'),
});

transcriber.onUtteranceEnd((u) => {
  // Triggered automatically when the user stops speaking.
  console.log('User said:', u.text);
});

await transcriber.start();

import type { VADProvider, VADStartOptions } from '@localmode/core';

class MyVAD implements VADProvider {
  readonly provider = 'custom';
  readonly frameSize = 512;
  readonly sampleRate = 16000;

  async start(options: VADStartOptions) { /* ... */ }
  processFrame(samples: Float32Array) { /* ... */ }
  async stop() { /* ... */ }
  async dispose() { /* ... */ }
}

const transcriber = await createLiveTranscriber({
  model,
  mode: 'open-mic',
  vad: new MyVAD(),
});

Options

Prop

Type

Events

The controller exposes .onChunk, .onUtteranceEnd, .onBargeIn, .onError, and .onStateChange. Each returns an unsubscribe function.

const off = transcriber.onChunk((chunk) => { /* ... */ });
// Later:
off();

VAD strategies

Strategy	Pros	Cons
`'energy'` (built-in)	Zero downloads, ~150 lines, AudioWorklet-based	RMS thresholding can be fooled by background noise
`transformers.vad(...)` (silero)	De-facto standard, works in noise	~1.8 MB ONNX model download

The energy VAD is the default. For production-quality open-mic, prefer silero:

import { transformers } from '@localmode/transformers';
import { warmUpModel } from '@localmode/core';

const vad = transformers.vad('onnx-community/silero-vad-ONNX');
await vad.warmUp(); // Pre-load before the user clicks "start"

Barge-in

When the agent is speaking (TTS playback) and the user starts speaking again, bargeInWhilePlaying lets the transcriber stop the audio and capture the new utterance immediately.

const audioElement = new Audio(ttsBlobUrl);
const handle = {
  isPlaying: () => !audioElement.paused && !audioElement.ended,
  stop: () => audioElement.pause(),
};

const transcriber = await createLiveTranscriber({
  model,
  mode: 'open-mic',
  bargeInWhilePlaying: handle,
});

transcriber.onBargeIn((event) => {
  console.log('User interrupted at', event.audioLevelDb, 'dBFS');
});

The end-to-end latency from VAD speechStart to handle.stop() is typically under 50 ms on the AudioWorklet path.

AbortSignal

createLiveTranscriber({ abortSignal }) honors the signal at construction and during the session. Aborting cancels in-flight doTranscribe() calls, releases the MediaStream, and closes the AudioContext.

const ac = new AbortController();
const transcriber = await createLiveTranscriber({ model, abortSignal: ac.signal });
await transcriber.start();
// ...
ac.abort(); // Releases everything.

createTurnTaker()

Higher-level orchestrator for full voice loops:

import { createLiveTranscriber, createTurnTaker } from '@localmode/core';
import { transformers } from '@localmode/transformers';

const transcriber = await createLiveTranscriber({
  model: transformers.speechToText('onnx-community/moonshine-tiny-ONNX'),
  mode: 'open-mic',
  vad: transformers.vad('onnx-community/silero-vad-ONNX'),
});

const turn = await createTurnTaker({
  transcriber,
  planner: transformers.languageModel('onnx-community/Qwen3-0.6B-ONNX'),
  voice: transformers.textToSpeech('onnx-community/Kokoro-82M-v1.0-ONNX'),
  systemPrompt: 'Be concise.',
});

turn.onUserUtterance((t) => console.log('user:', t));
turn.onAgentResponse((t) => console.log('agent:', t));
turn.onStateTransition((e) => console.log(e.from, '→', e.to));

await turn.start();

State machine: idle → listening → planning → speaking → listening. Barge-in aborts the planner and TTS and returns to 'listening'. turn.interrupt() is the programmatic equivalent.

React hooks

import { useLiveTranscribe } from '@localmode/react';

function VoiceInput() {
  const { state, currentUtterance, lastUtterance, start, stop } = useLiveTranscribe({
    model: transformers.speechToText('onnx-community/moonshine-tiny-ONNX'),
    mode: 'push-to-talk',
  });

  return (
    <>
      <button onMouseDown={start} onMouseUp={stop} disabled={state === 'disposed'}>
        {state === 'listening' ? 'Listening…' : 'Hold to talk'}
      </button>
      {currentUtterance && <p className="opacity-60">{currentUtterance}</p>}
      {lastUtterance && <p>{lastUtterance.text}</p>}
    </>
  );
}

useTurnTaker follows the same pattern. Both hooks lazy-construct on first start() call so the getUserMedia permission prompt happens during a user gesture.

Concurrency

Browsers expose a single microphone per tab. Constructing two LiveTranscriber instances in the same tab will not throw, but both will share degraded audio quality as they compete for the same hardware. Recommended: one transcriber per tab. The library does not enforce this via Web Locks — keep it simple, document the constraint.

Browser compatibility

Path	Requirements
AudioWorklet (preferred)	Chrome 66+, Firefox 76+, Safari 14.1+
ScriptProcessorNode (fallback)	All modern browsers (deprecated; logs a warning)
`getUserMedia`	Secure context (HTTPS or `localhost`)

createLiveTranscriber() automatically falls back to ScriptProcessorNode when AudioWorklet is unavailable. In strict-CSP environments (Chrome MV3 service workers, sites blocking blob: URLs), pass workletUrl: '/path/to/energy-vad.worklet.js' and serve the worklet source yourself.

The energy VAD worklet ships as an inline string (registered via URL.createObjectURL) so no separate .worklet.js asset is required from your bundler. Source is available via the exported ENERGY_VAD_WORKLET_SOURCE constant if you need to inspect or self-host it.

Capability detection

import { isLiveTranscribeSupported, isAudioWorkletSupported, isMediaCaptureSupported } from '@localmode/core';

if (!isLiveTranscribeSupported()) {
  // Show a fallback UI.
}

createCapabilityReport() includes a liveTranscribe section with { getUserMedia, audioWorklet, scriptProcessor, crossOriginIsolated }.

Live Transcribe

On this page