Overview
MediaPipe Tasks provider for LocalMode — hand, pose, and face landmark detection, gesture recognition, audio classification, language detection, and more via Google's on-device WASM runtime. Works in every browser.
@localmode/mediapipe
Run Google's MediaPipe Tasks directly in the browser. @localmode/mediapipe wraps @mediapipe/tasks-vision, @mediapipe/tasks-audio, and @mediapipe/tasks-text as a single unified LocalMode provider — hand, pose, and face landmarks, gesture recognition, image and audio classification, image embeddings, language detection, and text embeddings, plus real-time streaming trackers for video.
Privacy-first by design
Every task runs entirely on-device via WebAssembly. The camera, microphone, and text never leave the browser. The only network requests are the one-time model and WASM downloads.
Features
- 13 model catalog — landmarks, gestures, classification, detection, segmentation, embeddings, and language detection, all curated and verified against Google's CDN
- Real-time streaming —
createHandTracker/createPoseTracker/createFaceTracker/createGestureTrackerrun over a<video>element at 30–60fps - Universal browser support — pure WebAssembly + WebGL; no WebGPU required
- Tiny models — most are under 10MB; face detection and selfie segmentation are ~250KB
- Unified LocalMode interface — landmark tasks call
detectHands(),detectPose(), etc.; classification/detection/embedding tasks reuse the existingclassifyImage(),detectObjects(),embed()functions - AbortSignal cancellation — every single-frame function supports cancellation
- GPU or CPU delegate — choose the WebGL GPU delegate (default) or the CPU delegate per provider or per model
What is not included
Two MediaPipe capabilities are intentionally excluded: the @mediapipe/tasks-audio package ships only an AudioClassifier — there is no audio embedder in the JS SDK. The @mediapipe/tasks-genai (on-device LLM) is also not wrapped — use @localmode/litert or @localmode/wllama for local language model inference instead.
Installation
bash pnpm install @localmode/mediapipe @localmode/core bash npm install @localmode/mediapipe @localmode/core bash yarn add @localmode/mediapipe @localmode/core bash bun add @localmode/mediapipe @localmode/core The package depends on @mediapipe/tasks-vision, @mediapipe/tasks-audio, and @mediapipe/tasks-text — all installed automatically. The WASM runtime loads from the jsDelivr CDN by default; see WASM Runtime to self-host it.
Quick Start
Single-frame detection
Detect hand landmarks in a still image:
import { detectHands } from '@localmode/core';
import { mediapipe } from '@localmode/mediapipe';
const { hands } = await detectHands({
model: mediapipe.handLandmarker(),
image: imageBlob,
numHands: 2,
});
for (const hand of hands) {
console.log(`${hand.handedness} hand — ${hand.landmarks.length} landmarks`);
}Real-time streaming
Track hands live from a webcam <video> element:
import { mediapipe } from '@localmode/mediapipe';
const tracker = mediapipe.createHandTracker({
video: videoElement,
numHands: 2,
onResults: (hands, timestampMs) => {
// Called once per processed frame (up to ~60fps)
drawHands(hands);
},
});
await tracker.start();
// later
tracker.stop();
await tracker.close();Tasks
The provider exposes a factory method per task. Landmark and gesture tasks return MediaPipe-specific model interfaces used with new core functions; the remaining tasks return standard LocalMode interfaces and reuse existing core functions.
Landmarks & Gestures
| Method | Interface | Core function | Docs |
|---|---|---|---|
mediapipe.handLandmarker() | HandLandmarkModel | detectHands() | Guide |
mediapipe.poseLandmarker() | PoseLandmarkModel | detectPose() | Guide |
mediapipe.faceLandmarker() | FaceLandmarkModel | detectFaceLandmarks() | Guide |
mediapipe.faceDetector() | FaceDetectionModel | detectFace() | Guide |
mediapipe.gestureRecognizer() | GestureRecognitionModel | recognizeGesture() | Guide |
Vision (standard core interfaces)
| Method | Interface | Core function |
|---|---|---|
mediapipe.imageClassifier() | ImageClassificationModel | classifyImage() |
mediapipe.objectDetector() | ObjectDetectionModel | detectObjects() |
mediapipe.imageSegmenter() | SegmentationModel | segmentImage() |
mediapipe.imageEmbedder() | ImageFeatureModel | extractImageFeatures() |
Audio & Text
| Method | Interface | Core function | Docs |
|---|---|---|---|
mediapipe.audioClassifier() | AudioClassificationModel | classifyAudio() | Guide |
mediapipe.textEmbedder() | EmbeddingModel | embed() / embedMany() | Guide |
mediapipe.languageDetector() | LanguageDetectionModel | detectLanguage() | Guide |
mediapipe.textClassifier(modelPath) | ClassificationModel | classify() | Guide |
textClassifier requires a custom model
MediaPipe ships no default text classifier — mediapipe.textClassifier() requires an explicit custom-trained .tflite model URL (built with MediaPipe Model Maker). Calling it without a path throws a ValidationError. See the Text guide.
Streaming trackers
| Factory | Per-frame callback | Docs |
|---|---|---|
mediapipe.createHandTracker() | (hands: HandLandmarkResultItem[], timestampMs) | Guide |
mediapipe.createPoseTracker() | (poses: PoseLandmarkResultItem[], timestampMs) | Guide |
mediapipe.createFaceTracker() | (faces: FaceLandmarkResultItem[], timestampMs) | Guide |
mediapipe.createGestureTracker() | (gestures: GestureResultItem[], timestampMs) | Guide |
Model Catalog
MEDIAPIPE_MODELS ships 13 curated models. Every model is verified against Google's public CDN (storage.googleapis.com). Vision models use the .task bundle format; audio and text models are raw .tflite files.
import { MEDIAPIPE_MODELS } from '@localmode/mediapipe';| Catalog ID | Model | Domain | Size |
|---|---|---|---|
hand_landmarker | Hand Landmarker | vision | 7.8MB |
pose_landmarker | Pose Landmarker (Lite) | vision | 5.8MB |
pose_landmarker_full | Pose Landmarker (Full) | vision | 9.4MB |
face_landmarker | Face Landmarker | vision | 3.8MB |
face_detector | Face Detector (BlazeFace) | vision | 230KB |
gesture_recognizer | Gesture Recognizer | vision | 8.4MB |
image_classifier | Image Classifier (EfficientNet-Lite0) | vision | 18.6MB |
object_detector | Object Detector (EfficientDet-Lite0) | vision | 7.3MB |
image_segmenter | Image Segmenter (Selfie) | vision | 250KB |
image_embedder | Image Embedder (MobileNet-V3 Small) | vision | 4.1MB |
audio_classifier | Audio Classifier (YAMNet) | audio | 4.1MB |
language_detector | Language Detector | text | 315KB |
text_embedder | Text Embedder (Universal Sentence Encoder) | text | 6.1MB |
Each factory method uses its catalog default. Pass a catalog ID or a direct model URL to override:
// Catalog default
const lite = mediapipe.poseLandmarker();
// Catalog ID — higher-accuracy pose model
const full = mediapipe.poseLandmarker('pose_landmarker_full');
// Direct URL to any compatible MediaPipe model file
const custom = mediapipe.handLandmarker('https://your-cdn.com/hand_landmarker.task');You can also set an explicit modelPath in the per-model settings, which always takes precedence:
const model = mediapipe.handLandmarker('hand_landmarker', {
modelPath: 'https://your-cdn.com/hand_landmarker.task',
});Provider Configuration
Custom provider
Use createMediaPipe() for a provider instance with custom settings, or import the default mediapipe singleton.
import { createMediaPipe } from '@localmode/mediapipe';
const myMediaPipe = createMediaPipe({
delegate: 'CPU',
wasmBasePath: '/wasm/mediapipe',
});
const model = myMediaPipe.handLandmarker();Prop
Type
Per-model settings
Every factory method accepts a second settings argument that overrides the provider defaults for that model only:
const model = mediapipe.handLandmarker('hand_landmarker', {
modelPath: 'https://your-cdn.com/hand_landmarker.task',
delegate: 'CPU',
wasmBasePath: '/wasm/mediapipe/vision',
});Prop
Type
WASM Runtime
MediaPipe Tasks ships a WebAssembly runtime — one binary set per domain (vision, audio, text). By default, @localmode/mediapipe loads it from the jsDelivr CDN, so no extra setup is needed.
To run fully offline, self-host the runtime. Copy the wasm directory from each @mediapipe/tasks-* package into your public assets and point wasmBasePath at it:
const myMediaPipe = createMediaPipe({
wasmBasePath: {
vision: '/wasm/tasks-vision',
audio: '/wasm/tasks-audio',
text: '/wasm/tasks-text',
},
});Isolate concurrent audio and vision tasks
The MediaPipe audio and vision WASM runtimes can conflict if run concurrently in the same thread (mediapipe#4737). If your app uses audio classification and a vision task at the same time, run one of them in a Web Worker so each runtime has its own thread. Sequential use (one finishing before the next starts) is unaffected.
Browser Compatibility
MediaPipe Tasks runs on pure WebAssembly with a WebGL GPU delegate — it does not require WebGPU.
| Browser | WASM | WebGL (GPU delegate) | Notes |
|---|---|---|---|
| Chrome 80+ | ✅ | ✅ | Full support |
| Edge 80+ | ✅ | ✅ | Full support |
| Firefox 75+ | ✅ | ✅ | Full support |
| Safari 14+ | ✅ | ✅ | Full support; module workers 16.4+ |
If the WebGL GPU delegate is unavailable, set delegate: 'CPU' to fall back to CPU inference.
Error Handling
import { detectHands, ModelLoadError, VisionError, ValidationError } from '@localmode/core';
try {
const { hands } = await detectHands({
model: mediapipe.handLandmarker(),
image: imageBlob,
});
} catch (error) {
if (error instanceof ModelLoadError) {
console.error('Failed to load model:', error.hint);
} else if (error instanceof VisionError) {
console.error('Detection failed:', error.hint);
} else if (error instanceof ValidationError) {
console.error('Invalid input:', error.hint);
}
}Single-frame functions retry transient failures up to maxRetries (default 2) before throwing. Streaming trackers report per-frame errors through the onError callback instead of throwing.
Next Steps
Hand Tracking
21-point hand landmark detection and the HAND_CONNECTIONS drawing helper.
Real-Time Streaming
Run hand, pose, face, and gesture tracking live over a video element.
Gesture Recognition
Recognize 8 built-in hand gestures with landmark output.
Core Vision
API reference for the vision functions shared across providers.
Showcase App
Models
The LiteRT catalog ships three verified .litertlm models — Gemma 4 E2B, Gemma 4 E4B, and Qwen3 0.6B — plus instructions for loading gated Gemma models with your own HuggingFace token.
Hand Tracking
Detect 21-point hand landmarks in images and video with MediaPipe — handedness, world coordinates, and the HAND_CONNECTIONS helper for drawing the hand skeleton.