Dokumentation (english)

Speech Recognition - Whisper

Automatic speech recognition using Whisper Large-v2 on Common Voice

This case study demonstrates fine-tuning OpenAI's Whisper Large-v2 model for automatic speech recognition (ASR). Whisper is a transformer-based model trained on 680,000 hours of multilingual audio, offering state-of-the-art transcription accuracy with robust performance across accents, background noise, and technical language.

Dataset: Common Voice 11.0

  • Source: HuggingFace (mozilla-foundation/common_voice_11_0)
  • Type: Speech-to-text
  • Size: 5,000 audio clips (English)
  • Duration: ~8 hours of speech
  • Format: MP3, 16kHz sampling rate
  • Speakers: Diverse ages, genders, accents

Model Configuration

{
  "model": "whisper_large_v2",
  "category": "audio",
  "subcategory": "asr",
  "model_config": {
    "model_size": "large-v2",
    "language": "en",
    "task": "transcribe",
    "batch_size": 8,
    "learning_rate": 0.00001,
    "sampling_rate": 16000,
    "warmup_steps": 500,
    "max_steps": 5000
  }
}

Training Results

Word Error Rate (WER) Progress

Lower WER is better (0% = perfect transcription):

Keine Plot-Daten verfügbar

Performance by Audio Duration

Transcription accuracy varies with audio length:

Keine Plot-Daten verfügbar

Error Type Distribution

Types of transcription errors:

Keine Plot-Daten verfügbar

Performance by Speaker Characteristics

WER across different speaker demographics:

Keine Plot-Daten verfügbar

Real-time Factor (RTF)

Processing speed relative to audio duration:

Keine Plot-Daten verfügbar

Transcription Confidence

Model certainty in its predictions:

Keine Plot-Daten verfügbar

Common Use Cases

  • Meeting Transcription: Automatic minutes, action items extraction
  • Podcast Captioning: Create searchable transcripts, improve accessibility
  • Call Center Analytics: Analyze customer support conversations
  • Medical Documentation: Transcribe doctor-patient consultations
  • Legal Proceedings: Court reporting, deposition transcription
  • Education: Lecture transcription, language learning applications
  • Voice Assistants: Command recognition, voice control
  • Accessibility: Real-time captioning for deaf/hard-of-hearing

Key Settings

Essential Parameters

  • model_size: tiny, base, small, medium, large, large-v2
  • language: Language code (en, es, fr, etc.) or "auto"
  • task: "transcribe" or "translate" (to English)
  • sampling_rate: Audio sample rate (16000 Hz standard)
  • batch_size: Parallel audio clips (4-16 typical)

Audio Processing

  • vad_filter: Voice activity detection to skip silence
  • normalize: Audio normalization for consistent volume
  • chunk_length: Split long audio into segments (30s default)
  • overlap: Overlap between chunks for continuity

Decoding Options

  • temperature: Sampling temperature (0 = greedy)
  • beam_size: Beam search width (5 default, higher = more accurate)
  • best_of: Number of candidates (5 default)
  • no_speech_threshold: Silence detection sensitivity
  • compression_ratio_threshold: Filter nonsensical outputs

Performance Metrics

  • Word Error Rate (WER): 3.2% on test set
  • Character Error Rate (CER): 1.8%
  • Real-time Factor: 0.15 (6.7x faster than real-time)
  • Processing Speed: 1 minute of audio in 9 seconds
  • Model Size: 3.09 GB (Large-v2)
  • Parameters: 1.55 billion
  • Languages Supported: 99 languages

Tips for Success

  1. Audio Quality: Clean audio (16kHz+) dramatically improves accuracy
  2. Background Noise: Use noise reduction preprocessing
  3. Speaker Diarization: Combine with diarization for multi-speaker scenarios
  4. Language Detection: Use "auto" for automatic language detection
  5. Post-processing: Apply spell checking and punctuation models
  6. Timestamps: Enable word-level timestamps for alignment
  7. VAD Filter: Reduces processing time by skipping silence

Example Scenarios

Scenario 1: Clear Studio Recording

  • Input: Clean podcast audio, single speaker
  • Output: "Welcome to the AI podcast where we discuss the latest developments in artificial intelligence and machine learning."
  • WER: 0.5%
  • Confidence: 99.2%
  • Processing Time: 0.8 seconds (10-second clip)

Scenario 2: Noisy Phone Call

  • Input: Customer support call with background noise
  • Output: "I'd like to return my order from last week because it arrived damaged."
  • WER: 5.3%
  • Confidence: 87.4%
  • Common Errors: "week" → "weak", missed "because"

Scenario 3: Accented Speech

  • Input: Non-native English speaker with heavy accent
  • Output: "The product quality is excellent but the delivery was delayed."
  • WER: 6.8%
  • Confidence: 82.1%
  • Common Errors: "excellent" → "accent", "delayed" → "delay"

Troubleshooting

Problem: High WER on specific domain (medical, legal)

  • Solution: Fine-tune on domain-specific data, use custom vocabulary

Problem: Missing punctuation in transcriptions

  • Solution: Add punctuation restoration model, fine-tune with punctuated text

Problem: Hallucinations (generating repeated phrases)

  • Solution: Adjust compression_ratio_threshold, use VAD filter

Problem: Slow processing for long audio files

  • Solution: Use smaller model variant, reduce beam_size, enable VAD

Problem: Poor performance on multi-speaker audio

  • Solution: Add speaker diarization, split audio by speaker first

Model Architecture Highlights

Whisper consists of:

  • Encoder: Processes mel-spectrogram audio features
    • 32 transformer layers (Large-v2)
    • Multi-head attention mechanism
  • Decoder: Auto-regressive text generation
    • 32 transformer layers
    • Cross-attention to encoder outputs
  • Training Data: 680,000 hours of multilingual audio
  • Special Tokens: Language ID, task type, timestamps
  • Multitask Learning: Transcription, translation, language ID

Whisper Variants Comparison

ModelParamsSpeedWER (en)MemoryBest For
Tiny39M32x7.8%150 MBMobile, IoT
Base74M16x5.4%290 MBReal-time apps
Small244M6x4.1%960 MBProduction
Medium769M2x3.5%3.1 GBHigh accuracy
Large1550M1x2.9%6.2 GBMaximum accuracy
Large-v21550M1x2.8%6.2 GBLatest, best

Next Steps

After fine-tuning your Whisper model, you can:

  • Deploy as REST API for real-time transcription
  • Build live streaming transcription service
  • Add speaker diarization (pyannote.audio)
  • Integrate with video platforms (YouTube, Zoom)
  • Create voice command system
  • Export to mobile (iOS CoreML, Android TFLite)
  • Add translation capabilities (98 languages)
  • Build meeting assistant with summarization
  • Combine with LLMs for conversation analysis

Command Palette

Search for a command to run...

Schnellzugriffe
STRG + KSuche
STRG + DNachtmodus / Tagmodus
STRG + LSprache ändern

Software-Details
Kompiliert vor 1 Tag
Release: v4.0.0-production
Buildnummer: master@64a3463
Historie: 68 Items