WhisperX

Fast, word-level ASR with speaker diarization and 70× realtime speed

WhisperX delivers rapid automatic speech recognition with precise word-level timestamps, speaker diarization, and GPU-efficient batching, supporting large-v2 models on modest hardware.

Overview

Highlights

70× realtime transcription with large‑v2 model via batched inference

Word‑level timestamps using wav2vec2 forced alignment

Multispeaker diarization with pyannote‑audio and speaker ID labels

GPU‑efficient runtime requiring <8 GB memory for large‑v2

Pros

Extremely fast inference suitable for real‑time applications
High timestamp precision improves subtitle and analysis quality
Built‑in speaker diarization provides named speaker segments
Low GPU memory footprint enables use on consumer‑grade GPUs

Considerations

Requires CUDA‑12.8 and compatible GPU for optimal performance
Speaker diarization depends on external Hugging Face models and tokens
Alignment model adds extra memory and compute overhead
Limited to languages supported by the underlying Whisper model

Managed products teams compare with

When teams consider WhisperX, these hosted platforms usually appear on the same shortlist.

Otter.ai

AI meeting assistant for transcription and automated note-taking

SuperWhisper

Real-time transcription and translation API

Willow

Voice AI and speech recognition technology

Looking for a hosted option? These are the services engineering teams benchmark against before choosing open source.

Fit guide

Great for

Real‑time meeting transcription and captioning
Podcast or lecture indexing with accurate timestamps
Multilingual content where speaker attribution is needed
Developers integrating ASR into custom pipelines

Not ideal when

Environments without GPU or CUDA support
Low‑resource devices lacking 8 GB GPU memory
Use cases requiring on‑device inference on mobile phones
Languages not covered by Whisper’s training data

How teams use it

Live meeting transcription

Generate near‑real‑time captions with speaker names for Zoom, Teams, or Google Meet recordings.

Podcast post‑production

Create word‑accurate transcripts and speaker labels to streamline editing and searchable archives.

Academic lecture indexing

Produce timestamped transcripts for automatic subtitle generation and topic navigation.

AI research data preparation

Batch‑process large audio corpora with precise alignment for training downstream models.

Tech snapshot

Python100%

Frequently asked questions

What hardware is needed for optimal performance?

A GPU with CUDA 12.8 and at least 8 GB of memory; CPU‑only works but is much slower.

How does WhisperX achieve word‑level timestamps?

It uses forced phoneme alignment with a wav2vec2 ASR model after the initial Whisper transcription.

Is speaker diarization automatic?

Yes, when you provide a Hugging Face token for the required pyannote‑audio models; the system assigns speaker IDs.

Can I run WhisperX without installing CUDA?

You can run on CPU, but you will lose the 70× speed advantage and experience longer processing times.

Which Whisper models are supported?

WhisperX works with the standard Whisper models (small, base, large, large‑v2); larger models improve accuracy but require more GPU memory.

Project at a glance

Active

View repo

Stars: 20,531
Watchers: 20,531
Forks: 2,170

LicenseBSD-2-Clause

Repo age3 years old

Last commit2 weeks ago

Primary languagePython

Last synced yesterday