Speech Recognition - Whisper
Automatic speech recognition using Whisper Large-v2 on Common Voice
This case study demonstrates fine-tuning OpenAI's Whisper Large-v2 model for automatic speech recognition (ASR). Whisper is a transformer-based model trained on 680,000 hours of multilingual audio, offering state-of-the-art transcription accuracy with robust performance across accents, background noise, and technical language.
Dataset: Common Voice 11.0
- Source: HuggingFace (mozilla-foundation/common_voice_11_0)
- Type: Speech-to-text
- Size: 5,000 audio clips (English)
- Duration: ~8 hours of speech
- Format: MP3, 16kHz sampling rate
- Speakers: Diverse ages, genders, accents
Model Configuration
{
"model": "whisper_large_v2",
"category": "audio",
"subcategory": "asr",
"model_config": {
"model_size": "large-v2",
"language": "en",
"task": "transcribe",
"batch_size": 8,
"learning_rate": 0.00001,
"sampling_rate": 16000,
"warmup_steps": 500,
"max_steps": 5000
}
}Training Results
Word Error Rate (WER) Progress
Lower WER is better (0% = perfect transcription):
Keine Plot-Daten verfügbar
Performance by Audio Duration
Transcription accuracy varies with audio length:
Keine Plot-Daten verfügbar
Error Type Distribution
Types of transcription errors:
Keine Plot-Daten verfügbar
Performance by Speaker Characteristics
WER across different speaker demographics:
Keine Plot-Daten verfügbar
Real-time Factor (RTF)
Processing speed relative to audio duration:
Keine Plot-Daten verfügbar
Transcription Confidence
Model certainty in its predictions:
Keine Plot-Daten verfügbar
Common Use Cases
- Meeting Transcription: Automatic minutes, action items extraction
- Podcast Captioning: Create searchable transcripts, improve accessibility
- Call Center Analytics: Analyze customer support conversations
- Medical Documentation: Transcribe doctor-patient consultations
- Legal Proceedings: Court reporting, deposition transcription
- Education: Lecture transcription, language learning applications
- Voice Assistants: Command recognition, voice control
- Accessibility: Real-time captioning for deaf/hard-of-hearing
Key Settings
Essential Parameters
- model_size: tiny, base, small, medium, large, large-v2
- language: Language code (en, es, fr, etc.) or "auto"
- task: "transcribe" or "translate" (to English)
- sampling_rate: Audio sample rate (16000 Hz standard)
- batch_size: Parallel audio clips (4-16 typical)
Audio Processing
- vad_filter: Voice activity detection to skip silence
- normalize: Audio normalization for consistent volume
- chunk_length: Split long audio into segments (30s default)
- overlap: Overlap between chunks for continuity
Decoding Options
- temperature: Sampling temperature (0 = greedy)
- beam_size: Beam search width (5 default, higher = more accurate)
- best_of: Number of candidates (5 default)
- no_speech_threshold: Silence detection sensitivity
- compression_ratio_threshold: Filter nonsensical outputs
Performance Metrics
- Word Error Rate (WER): 3.2% on test set
- Character Error Rate (CER): 1.8%
- Real-time Factor: 0.15 (6.7x faster than real-time)
- Processing Speed: 1 minute of audio in 9 seconds
- Model Size: 3.09 GB (Large-v2)
- Parameters: 1.55 billion
- Languages Supported: 99 languages
Tips for Success
- Audio Quality: Clean audio (16kHz+) dramatically improves accuracy
- Background Noise: Use noise reduction preprocessing
- Speaker Diarization: Combine with diarization for multi-speaker scenarios
- Language Detection: Use "auto" for automatic language detection
- Post-processing: Apply spell checking and punctuation models
- Timestamps: Enable word-level timestamps for alignment
- VAD Filter: Reduces processing time by skipping silence
Example Scenarios
Scenario 1: Clear Studio Recording
- Input: Clean podcast audio, single speaker
- Output: "Welcome to the AI podcast where we discuss the latest developments in artificial intelligence and machine learning."
- WER: 0.5%
- Confidence: 99.2%
- Processing Time: 0.8 seconds (10-second clip)
Scenario 2: Noisy Phone Call
- Input: Customer support call with background noise
- Output: "I'd like to return my order from last week because it arrived damaged."
- WER: 5.3%
- Confidence: 87.4%
- Common Errors: "week" → "weak", missed "because"
Scenario 3: Accented Speech
- Input: Non-native English speaker with heavy accent
- Output: "The product quality is excellent but the delivery was delayed."
- WER: 6.8%
- Confidence: 82.1%
- Common Errors: "excellent" → "accent", "delayed" → "delay"
Troubleshooting
Problem: High WER on specific domain (medical, legal)
- Solution: Fine-tune on domain-specific data, use custom vocabulary
Problem: Missing punctuation in transcriptions
- Solution: Add punctuation restoration model, fine-tune with punctuated text
Problem: Hallucinations (generating repeated phrases)
- Solution: Adjust compression_ratio_threshold, use VAD filter
Problem: Slow processing for long audio files
- Solution: Use smaller model variant, reduce beam_size, enable VAD
Problem: Poor performance on multi-speaker audio
- Solution: Add speaker diarization, split audio by speaker first
Model Architecture Highlights
Whisper consists of:
- Encoder: Processes mel-spectrogram audio features
- 32 transformer layers (Large-v2)
- Multi-head attention mechanism
- Decoder: Auto-regressive text generation
- 32 transformer layers
- Cross-attention to encoder outputs
- Training Data: 680,000 hours of multilingual audio
- Special Tokens: Language ID, task type, timestamps
- Multitask Learning: Transcription, translation, language ID
Whisper Variants Comparison
| Model | Params | Speed | WER (en) | Memory | Best For |
|---|---|---|---|---|---|
| Tiny | 39M | 32x | 7.8% | 150 MB | Mobile, IoT |
| Base | 74M | 16x | 5.4% | 290 MB | Real-time apps |
| Small | 244M | 6x | 4.1% | 960 MB | Production |
| Medium | 769M | 2x | 3.5% | 3.1 GB | High accuracy |
| Large | 1550M | 1x | 2.9% | 6.2 GB | Maximum accuracy |
| Large-v2 | 1550M | 1x | 2.8% | 6.2 GB | Latest, best |
Next Steps
After fine-tuning your Whisper model, you can:
- Deploy as REST API for real-time transcription
- Build live streaming transcription service
- Add speaker diarization (pyannote.audio)
- Integrate with video platforms (YouTube, Zoom)
- Create voice command system
- Export to mobile (iOS CoreML, Android TFLite)
- Add translation capabilities (98 languages)
- Build meeting assistant with summarization
- Combine with LLMs for conversation analysis