Image Classification
Train models to categorize images into predefined classes
Image classification is the task of assigning a label or category to an entire image. This is one of the most fundamental computer vision tasks, with applications ranging from medical diagnosis to product recognition and content moderation.
Learn About Image Classification
New to image classification? Visit our Image Classification Concepts Guide to learn about how these models work, common architectures, and best practices for data preparation.
Available Models
Vision Transformer (ViT) Models
Vision Transformers apply the transformer architecture to image classification by splitting images into patches and processing them as sequences.
- ViT Base - Balanced model with 86M parameters, good for most use cases
- ViT Large - Larger model with 304M parameters, higher accuracy but slower
- ViT Small MSN - Smaller variant with masked self-supervised learning, efficient and accurate
ResNet Models
Residual Networks use skip connections to enable training of very deep networks, providing excellent accuracy-to-efficiency ratios.
- ResNet-18 - Lightweight 18-layer model, fastest training and inference
- ResNet-50 - 50-layer model, excellent balance of speed and accuracy
- ResNet-101 - 101-layer model, highest accuracy in ResNet family
Efficient Models
Models optimized for speed, size, or mobile deployment while maintaining competitive accuracy.
- EfficientNet-B0 - Compound scaling for optimal efficiency, great accuracy with fewer parameters
- MobileNetV3-Small - Optimized for mobile and edge devices, minimal latency
Common Configuration
Training Images Folder Structure
All image classification models expect training images organized in class subfolders:
train_images/
├── class1/
│ ├── image1.jpg
│ ├── image2.jpg
│ └── ...
├── class2/
│ ├── image1.jpg
│ ├── image2.jpg
│ └── ...
└── class3/
├── image1.jpg
└── ...Key Training Parameters
Batch Size: Number of images processed together
- Larger batches: Faster training, more GPU memory
- Smaller batches: Less memory, potentially better generalization
- Typical values: 4-32 depending on model size and GPU
Epochs: Number of complete passes through the training data
- Too few: Underfitting, poor accuracy
- Too many: Overfitting, poor generalization
- Start with 1-10 epochs, adjust based on validation metrics
Learning Rate: Step size for model parameter updates
- Too high: Training instability, divergence
- Too low: Slow convergence, local minima
- Typical range: 1e-5 to 5e-4 for fine-tuning
Eval Steps: Frequency of validation evaluations
- Set to 1 to evaluate after each epoch
- Higher values for large datasets to reduce overhead
Fine-tuning vs Training from Scratch
Fine-tuning (Recommended)
- Uses pre-trained weights from ImageNet or similar datasets
- Requires less data (hundreds to thousands of images)
- Faster convergence (1-10 epochs typically sufficient)
- Better for most practical applications
Training from Scratch
- Starts with random initialization
- Requires large datasets (tens of thousands of images)
- Takes many more epochs to converge
- Only recommended when you have abundant data
Understanding Metrics
Accuracy: Percentage of correct predictions
- Primary metric for balanced datasets
- Can be misleading for imbalanced classes
Loss: Measures how wrong the predictions are
- Should decrease over training
- Sudden increases indicate learning rate issues
Confusion Matrix: Shows per-class performance
- Identifies which classes are confused with each other
- Helps diagnose dataset quality issues
Choosing the Right Model
By Priority
Maximum Accuracy
- ViT Large (best overall, but slowest)
- ResNet-101 (excellent CNN alternative)
- EfficientNet-B0 (best parameter efficiency)
Fastest Training
- ResNet-18 (quickest to fine-tune)
- MobileNetV3-Small (fast and lightweight)
- ViT Small MSN (efficient transformer)
Smallest Model Size
- MobileNetV3-Small (~5MB)
- ResNet-18 (~45MB)
- EfficientNet-B0 (~20MB)
Best for Mobile/Edge
- MobileNetV3-Small (designed for mobile)
- EfficientNet-B0 (excellent efficiency)
- ResNet-18 (lightweight and fast)
By Use Case
Medical Imaging
- ViT Large or ResNet-101 for maximum accuracy
- Use higher resolution images if possible
- Ensure balanced training data across classes
Product Recognition
- EfficientNet-B0 for good accuracy with reasonable speed
- ResNet-50 for production deployments
- Focus on data augmentation for variety
Real-time Applications
- MobileNetV3-Small for edge devices
- ResNet-18 for server-side real-time
- Consider quantization for further speedup
General Purpose
- ResNet-50 for most use cases
- ViT Base when you have sufficient data
- EfficientNet-B0 for cloud deployments
Best Practices
Data Preparation
- Balance your dataset: Ensure similar numbers of images per class
- Image quality: Use consistent image sizes and quality
- Data augmentation: Helps prevent overfitting (rotation, flipping, color jitter)
- Validation split: Hold out 10-20% of data for validation
Training Strategy
- Start with low learning rate: 1e-5 to 5e-5 for fine-tuning
- Monitor training loss: Should decrease steadily
- Check for overfitting: Validation accuracy should improve with training accuracy
- Use early stopping: Stop if validation accuracy plateaus or decreases
GPU Considerations
- ResNet models: Can train on CPU for small datasets, GPU recommended
- ViT models: GPU strongly recommended due to transformer architecture
- Batch size: Reduce if you encounter out-of-memory errors
- Mixed precision: Enable for faster training on modern GPUs
Dataset Size Guidelines
Small Dataset (<1,000 images)
- Use ResNet-18 or MobileNetV3-Small
- Lower learning rate (1e-5)
- More epochs (10-20)
- Heavy data augmentation
Medium Dataset (1,000-10,000 images)
- ResNet-50 or EfficientNet-B0 recommended
- Standard learning rate (5e-5)
- Moderate epochs (5-10)
- Standard augmentation
Large Dataset (>10,000 images)
- Any model works well
- ViT models particularly effective
- Can use higher learning rates
- Less aggressive augmentation needed
Common Pitfalls
Out of Memory Errors
Solution: Reduce batch size, use a smaller model, or enable gradient accumulation
Model Not Learning (Loss Not Decreasing)
Solution: Increase learning rate, check data preprocessing, ensure labels are correct
Overfitting (Training Accuracy High, Validation Low)
Solution: Add data augmentation, reduce model size, increase dataset, add regularization
Poor Accuracy on Certain Classes
Solution: Add more training examples for those classes, check for label errors, adjust class weights
Training Too Slow
Solution: Use a smaller model, increase batch size, use GPU, reduce image resolution
Predictions All the Same Class
Solution: Check class balance, reduce learning rate, verify data loading is working correctly