Confusion Matrix
Evaluate classification model performance
Overview
A confusion matrix is a performance measurement tool for machine learning classification models. It displays the number of correct and incorrect predictions broken down by each class, showing true positives, true negatives, false positives, and false negatives in a matrix format.
Best used for:
- Evaluating classification model accuracy
- Understanding which classes are confused with each other
- Identifying bias in model predictions
- Comparing performance across different models
- Analyzing precision, recall, and F1-score by class
- Detecting overfitting or underfitting patterns
Common Use Cases
Machine Learning & AI
- Binary classification evaluation (spam/not spam, fraud/legitimate)
- Multi-class classification assessment
- Model comparison and selection
- Hyperparameter tuning evaluation
- Feature importance validation
Medical & Diagnostics
- Disease detection accuracy
- Test result validation (positive/negative)
- Screening program effectiveness
- Diagnostic tool comparison
Quality Control
- Defect detection system evaluation
- Automated inspection accuracy
- Classification system validation
- Process control monitoring
Understanding the Confusion Matrix
Binary Classification (2×2 Matrix)
Predicted
Negative Positive
Actual Negative TN FP
Positive FN TP- True Positive (TP): Correctly predicted positive
- True Negative (TN): Correctly predicted negative
- False Positive (FP): Incorrectly predicted positive (Type I error)
- False Negative (FN): Incorrectly predicted negative (Type II error)
Multi-Class Classification (N×N Matrix)
Each cell shows how many times class i was predicted as class j.
- Diagonal: Correct predictions
- Off-diagonal: Misclassifications
Key Metrics Derived
Accuracy
(TP + TN) / Total
Overall correctness of the model.
Precision
TP / (TP + FP)
Of all positive predictions, how many were correct?
Recall (Sensitivity)
TP / (TP + FN)
Of all actual positives, how many did we catch?
Specificity
TN / (TN + FP)
Of all actual negatives, how many were correctly identified?
F1-Score
2 × (Precision × Recall) / (Precision + Recall)
Harmonic mean of precision and recall.
Settings
Normalize
Optional - Display values as proportions instead of counts.
When enabled, shows percentages or proportions instead of raw counts, making it easier to compare models trained on different dataset sizes.
Options:
- Off: Show raw counts
- On: Show normalized values (0-1 or percentages)
Annotate Cells
Optional - Display values in each cell.
Shows the numerical value (count or percentage) in each cell of the matrix.
Default: On
Tips for Interpreting Confusion Matrices
-
Focus on Off-Diagonal Values:
- High off-diagonal values indicate confusion between classes
- Look for systematic patterns in misclassification
- Consider class similarity when evaluating errors
-
Check Class Balance:
- Imbalanced datasets can have misleading accuracy
- Look at per-class metrics, not just overall accuracy
- Consider using normalization for imbalanced data
-
Understand Cost of Errors:
- False positives vs false negatives have different costs
- Medical: False negatives (missing disease) often worse
- Spam: False positives (blocking real email) often worse
- Adjust decision threshold based on cost
-
Use Normalization Wisely:
- Normalize by row (true class) to see recall per class
- Normalize by column (predicted class) to see precision
- Normalize by total to see overall distribution
-
Compare Multiple Models:
- Same confusion matrix format makes comparison easy
- Look for improvements in specific error types
- Consider which errors matter most for your application
-
Combine with Other Metrics:
- Confusion matrix shows details, but not the full picture
- Use with ROC curves, precision-recall curves
- Consider business metrics alongside statistical ones
Example Scenarios
Binary Classification (Fraud Detection)
High recall is critical—missing fraud is costly.
Multi-Class Classification (Product Categories)
Shows which product categories are commonly confused.
Normalized Confusion Matrix
Easier to compare when classes have different frequencies.
Medical Diagnosis
False negatives (missing disease) are more serious than false positives.
When to Use Different Metrics
Use Accuracy When:
- Classes are balanced
- All errors have equal cost
- You need a simple single number
Use Precision When:
- False positives are costly
- You want confidence in positive predictions
- Examples: spam detection, fraud detection
Use Recall When:
- False negatives are costly
- You want to catch all positives
- Examples: disease screening, security threats
Use F1-Score When:
- You need balance between precision and recall
- Classes are imbalanced
- You want a single metric better than accuracy
Troubleshooting
Issue: Model has high accuracy but performs poorly
- Solution: Check if dataset is imbalanced. A model predicting all "negative" could have 95% accuracy if 95% of data is negative. Look at per-class metrics.
Issue: Can't see cell values clearly
- Solution: Enable "Annotate Cells" setting. Consider using normalization if numbers are very large or very small.
Issue: Hard to compare models with different sample sizes
- Solution: Enable "Normalize" to show proportions instead of raw counts. This makes models directly comparable.
Issue: Confusion between similar classes
- Solution: This is normal when classes are similar (e.g., "cat" vs "dog"). Consider combining similar classes or improving features that distinguish them.
Issue: Perfect diagonal (all correct)
- Solution: Might indicate overfitting, especially if validation performance is poor. Check if test data leaked into training.
Issue: Almost no true positives
- Solution: Model might be biased toward negative class. Check class balance, try resampling, or adjust decision threshold.