LocalMode
MediaPipe

Overview

MediaPipe Tasks provider for LocalMode — hand, pose, and face landmark detection, gesture recognition, audio classification, language detection, and more via Google's on-device WASM runtime. Works in every browser.

@localmode/mediapipe

Run Google's MediaPipe Tasks directly in the browser. @localmode/mediapipe wraps @mediapipe/tasks-vision, @mediapipe/tasks-audio, and @mediapipe/tasks-text as a single unified LocalMode provider — hand, pose, and face landmarks, gesture recognition, image and audio classification, image embeddings, language detection, and text embeddings, plus real-time streaming trackers for video.

Privacy-first by design

Every task runs entirely on-device via WebAssembly. The camera, microphone, and text never leave the browser. The only network requests are the one-time model and WASM downloads.

Features

  • 13 model catalog — landmarks, gestures, classification, detection, segmentation, embeddings, and language detection, all curated and verified against Google's CDN
  • Real-time streamingcreateHandTracker / createPoseTracker / createFaceTracker / createGestureTracker run over a <video> element at 30–60fps
  • Universal browser support — pure WebAssembly + WebGL; no WebGPU required
  • Tiny models — most are under 10MB; face detection and selfie segmentation are ~250KB
  • Unified LocalMode interface — landmark tasks call detectHands(), detectPose(), etc.; classification/detection/embedding tasks reuse the existing classifyImage(), detectObjects(), embed() functions
  • AbortSignal cancellation — every single-frame function supports cancellation
  • GPU or CPU delegate — choose the WebGL GPU delegate (default) or the CPU delegate per provider or per model

What is not included

Two MediaPipe capabilities are intentionally excluded: the @mediapipe/tasks-audio package ships only an AudioClassifier — there is no audio embedder in the JS SDK. The @mediapipe/tasks-genai (on-device LLM) is also not wrapped — use @localmode/litert or @localmode/wllama for local language model inference instead.

Installation

bash pnpm install @localmode/mediapipe @localmode/core
bash npm install @localmode/mediapipe @localmode/core
bash yarn add @localmode/mediapipe @localmode/core
bash bun add @localmode/mediapipe @localmode/core

The package depends on @mediapipe/tasks-vision, @mediapipe/tasks-audio, and @mediapipe/tasks-text — all installed automatically. The WASM runtime loads from the jsDelivr CDN by default; see WASM Runtime to self-host it.

Quick Start

Single-frame detection

Detect hand landmarks in a still image:

import { detectHands } from '@localmode/core';
import { mediapipe } from '@localmode/mediapipe';

const { hands } = await detectHands({
  model: mediapipe.handLandmarker(),
  image: imageBlob,
  numHands: 2,
});

for (const hand of hands) {
  console.log(`${hand.handedness} hand — ${hand.landmarks.length} landmarks`);
}

Real-time streaming

Track hands live from a webcam <video> element:

import { mediapipe } from '@localmode/mediapipe';

const tracker = mediapipe.createHandTracker({
  video: videoElement,
  numHands: 2,
  onResults: (hands, timestampMs) => {
    // Called once per processed frame (up to ~60fps)
    drawHands(hands);
  },
});

await tracker.start();

// later
tracker.stop();
await tracker.close();

Tasks

The provider exposes a factory method per task. Landmark and gesture tasks return MediaPipe-specific model interfaces used with new core functions; the remaining tasks return standard LocalMode interfaces and reuse existing core functions.

Landmarks & Gestures

MethodInterfaceCore functionDocs
mediapipe.handLandmarker()HandLandmarkModeldetectHands()Guide
mediapipe.poseLandmarker()PoseLandmarkModeldetectPose()Guide
mediapipe.faceLandmarker()FaceLandmarkModeldetectFaceLandmarks()Guide
mediapipe.faceDetector()FaceDetectionModeldetectFace()Guide
mediapipe.gestureRecognizer()GestureRecognitionModelrecognizeGesture()Guide

Vision (standard core interfaces)

MethodInterfaceCore function
mediapipe.imageClassifier()ImageClassificationModelclassifyImage()
mediapipe.objectDetector()ObjectDetectionModeldetectObjects()
mediapipe.imageSegmenter()SegmentationModelsegmentImage()
mediapipe.imageEmbedder()ImageFeatureModelextractImageFeatures()

Audio & Text

MethodInterfaceCore functionDocs
mediapipe.audioClassifier()AudioClassificationModelclassifyAudio()Guide
mediapipe.textEmbedder()EmbeddingModelembed() / embedMany()Guide
mediapipe.languageDetector()LanguageDetectionModeldetectLanguage()Guide
mediapipe.textClassifier(modelPath)ClassificationModelclassify()Guide

textClassifier requires a custom model

MediaPipe ships no default text classifier — mediapipe.textClassifier() requires an explicit custom-trained .tflite model URL (built with MediaPipe Model Maker). Calling it without a path throws a ValidationError. See the Text guide.

Streaming trackers

FactoryPer-frame callbackDocs
mediapipe.createHandTracker()(hands: HandLandmarkResultItem[], timestampMs)Guide
mediapipe.createPoseTracker()(poses: PoseLandmarkResultItem[], timestampMs)Guide
mediapipe.createFaceTracker()(faces: FaceLandmarkResultItem[], timestampMs)Guide
mediapipe.createGestureTracker()(gestures: GestureResultItem[], timestampMs)Guide

Model Catalog

MEDIAPIPE_MODELS ships 13 curated models. Every model is verified against Google's public CDN (storage.googleapis.com). Vision models use the .task bundle format; audio and text models are raw .tflite files.

import { MEDIAPIPE_MODELS } from '@localmode/mediapipe';
Catalog IDModelDomainSize
hand_landmarkerHand Landmarkervision7.8MB
pose_landmarkerPose Landmarker (Lite)vision5.8MB
pose_landmarker_fullPose Landmarker (Full)vision9.4MB
face_landmarkerFace Landmarkervision3.8MB
face_detectorFace Detector (BlazeFace)vision230KB
gesture_recognizerGesture Recognizervision8.4MB
image_classifierImage Classifier (EfficientNet-Lite0)vision18.6MB
object_detectorObject Detector (EfficientDet-Lite0)vision7.3MB
image_segmenterImage Segmenter (Selfie)vision250KB
image_embedderImage Embedder (MobileNet-V3 Small)vision4.1MB
audio_classifierAudio Classifier (YAMNet)audio4.1MB
language_detectorLanguage Detectortext315KB
text_embedderText Embedder (Universal Sentence Encoder)text6.1MB

Each factory method uses its catalog default. Pass a catalog ID or a direct model URL to override:

// Catalog default
const lite = mediapipe.poseLandmarker();

// Catalog ID — higher-accuracy pose model
const full = mediapipe.poseLandmarker('pose_landmarker_full');

// Direct URL to any compatible MediaPipe model file
const custom = mediapipe.handLandmarker('https://your-cdn.com/hand_landmarker.task');

You can also set an explicit modelPath in the per-model settings, which always takes precedence:

const model = mediapipe.handLandmarker('hand_landmarker', {
  modelPath: 'https://your-cdn.com/hand_landmarker.task',
});

Provider Configuration

Custom provider

Use createMediaPipe() for a provider instance with custom settings, or import the default mediapipe singleton.

import { createMediaPipe } from '@localmode/mediapipe';

const myMediaPipe = createMediaPipe({
  delegate: 'CPU',
  wasmBasePath: '/wasm/mediapipe',
});

const model = myMediaPipe.handLandmarker();

Prop

Type

Per-model settings

Every factory method accepts a second settings argument that overrides the provider defaults for that model only:

const model = mediapipe.handLandmarker('hand_landmarker', {
  modelPath: 'https://your-cdn.com/hand_landmarker.task',
  delegate: 'CPU',
  wasmBasePath: '/wasm/mediapipe/vision',
});

Prop

Type

WASM Runtime

MediaPipe Tasks ships a WebAssembly runtime — one binary set per domain (vision, audio, text). By default, @localmode/mediapipe loads it from the jsDelivr CDN, so no extra setup is needed.

To run fully offline, self-host the runtime. Copy the wasm directory from each @mediapipe/tasks-* package into your public assets and point wasmBasePath at it:

const myMediaPipe = createMediaPipe({
  wasmBasePath: {
    vision: '/wasm/tasks-vision',
    audio: '/wasm/tasks-audio',
    text: '/wasm/tasks-text',
  },
});

Isolate concurrent audio and vision tasks

The MediaPipe audio and vision WASM runtimes can conflict if run concurrently in the same thread (mediapipe#4737). If your app uses audio classification and a vision task at the same time, run one of them in a Web Worker so each runtime has its own thread. Sequential use (one finishing before the next starts) is unaffected.

Browser Compatibility

MediaPipe Tasks runs on pure WebAssembly with a WebGL GPU delegate — it does not require WebGPU.

BrowserWASMWebGL (GPU delegate)Notes
Chrome 80+Full support
Edge 80+Full support
Firefox 75+Full support
Safari 14+Full support; module workers 16.4+

If the WebGL GPU delegate is unavailable, set delegate: 'CPU' to fall back to CPU inference.

Error Handling

import { detectHands, ModelLoadError, VisionError, ValidationError } from '@localmode/core';

try {
  const { hands } = await detectHands({
    model: mediapipe.handLandmarker(),
    image: imageBlob,
  });
} catch (error) {
  if (error instanceof ModelLoadError) {
    console.error('Failed to load model:', error.hint);
  } else if (error instanceof VisionError) {
    console.error('Detection failed:', error.hint);
  } else if (error instanceof ValidationError) {
    console.error('Invalid input:', error.hint);
  }
}

Single-frame functions retry transient failures up to maxRetries (default 2) before throwing. Streaming trackers report per-frame errors through the onError callback instead of throwing.

Next Steps

Showcase App

AppDescriptionLinks
MediaPipe StudioLive hand, pose, face, and gesture tracking from your webcamDemo · Source

On this page