Image Classification

Image classification is the task of assigning a label or category to an entire image. This is one of the most fundamental computer vision tasks, with applications ranging from medical diagnosis to product recognition and content moderation.

Learn About Image Classification

New to image classification? Visit our Image Classification Concepts Guide to learn about how these models work, common architectures, and best practices for data preparation.

Available Models

Vision Transformer (ViT) Models

Vision Transformers apply the transformer architecture to image classification by splitting images into patches and processing them as sequences.

ViT Base - Balanced model with 86M parameters, good for most use cases
ViT Large - Larger model with 304M parameters, higher accuracy but slower
ViT Small MSN - Smaller variant with masked self-supervised learning, efficient and accurate

ResNet Models

Residual Networks use skip connections to enable training of very deep networks, providing excellent accuracy-to-efficiency ratios.

ResNet-18 - Lightweight 18-layer model, fastest training and inference
ResNet-50 - 50-layer model, excellent balance of speed and accuracy
ResNet-101 - 101-layer model, highest accuracy in ResNet family

Efficient Models

Models optimized for speed, size, or mobile deployment while maintaining competitive accuracy.

EfficientNet-B0 - Compound scaling for optimal efficiency, great accuracy with fewer parameters
MobileNetV3-Small - Optimized for mobile and edge devices, minimal latency

Common Configuration

Training Images Folder Structure

All image classification models expect training images organized in class subfolders:

train_images/
├── class1/
│   ├── image1.jpg
│   ├── image2.jpg
│   └── ...
├── class2/
│   ├── image1.jpg
│   ├── image2.jpg
│   └── ...
└── class3/
    ├── image1.jpg
    └── ...

Key Training Parameters

Batch Size: Number of images processed together

Larger batches: Faster training, more GPU memory
Smaller batches: Less memory, potentially better generalization
Typical values: 4-32 depending on model size and GPU

Epochs: Number of complete passes through the training data

Too few: Underfitting, poor accuracy
Too many: Overfitting, poor generalization
Start with 1-10 epochs, adjust based on validation metrics

Learning Rate: Step size for model parameter updates

Too high: Training instability, divergence
Too low: Slow convergence, local minima
Typical range: 1e-5 to 5e-4 for fine-tuning

Eval Steps: Frequency of validation evaluations

Set to 1 to evaluate after each epoch
Higher values for large datasets to reduce overhead

Fine-tuning vs Training from Scratch

Fine-tuning (Recommended)

Uses pre-trained weights from ImageNet or similar datasets
Requires less data (hundreds to thousands of images)
Faster convergence (1-10 epochs typically sufficient)
Better for most practical applications

Training from Scratch

Starts with random initialization
Requires large datasets (tens of thousands of images)
Takes many more epochs to converge
Only recommended when you have abundant data

Understanding Metrics

Accuracy: Percentage of correct predictions

Primary metric for balanced datasets
Can be misleading for imbalanced classes

Loss: Measures how wrong the predictions are

Should decrease over training
Sudden increases indicate learning rate issues

Confusion Matrix: Shows per-class performance

Identifies which classes are confused with each other
Helps diagnose dataset quality issues

Choosing the Right Model

By Priority

Maximum Accuracy

ViT Large (best overall, but slowest)
ResNet-101 (excellent CNN alternative)
EfficientNet-B0 (best parameter efficiency)

Fastest Training

ResNet-18 (quickest to fine-tune)
MobileNetV3-Small (fast and lightweight)
ViT Small MSN (efficient transformer)

Smallest Model Size

MobileNetV3-Small (~5MB)
ResNet-18 (~45MB)
EfficientNet-B0 (~20MB)

Best for Mobile/Edge

MobileNetV3-Small (designed for mobile)
EfficientNet-B0 (excellent efficiency)
ResNet-18 (lightweight and fast)

By Use Case

Medical Imaging

ViT Large or ResNet-101 for maximum accuracy
Use higher resolution images if possible
Ensure balanced training data across classes

Product Recognition

EfficientNet-B0 for good accuracy with reasonable speed
ResNet-50 for production deployments
Focus on data augmentation for variety

Real-time Applications

MobileNetV3-Small for edge devices
ResNet-18 for server-side real-time
Consider quantization for further speedup

General Purpose

ResNet-50 for most use cases
ViT Base when you have sufficient data
EfficientNet-B0 for cloud deployments

Best Practices

Data Preparation

Balance your dataset: Ensure similar numbers of images per class
Image quality: Use consistent image sizes and quality
Data augmentation: Helps prevent overfitting (rotation, flipping, color jitter)
Validation split: Hold out 10-20% of data for validation

Training Strategy

Start with low learning rate: 1e-5 to 5e-5 for fine-tuning
Monitor training loss: Should decrease steadily
Check for overfitting: Validation accuracy should improve with training accuracy
Use early stopping: Stop if validation accuracy plateaus or decreases

GPU Considerations

ResNet models: Can train on CPU for small datasets, GPU recommended
ViT models: GPU strongly recommended due to transformer architecture
Batch size: Reduce if you encounter out-of-memory errors
Mixed precision: Enable for faster training on modern GPUs

Dataset Size Guidelines

Small Dataset (<1,000 images)

Use ResNet-18 or MobileNetV3-Small
Lower learning rate (1e-5)
More epochs (10-20)
Heavy data augmentation

Medium Dataset (1,000-10,000 images)

ResNet-50 or EfficientNet-B0 recommended
Standard learning rate (5e-5)
Moderate epochs (5-10)
Standard augmentation

Large Dataset (>10,000 images)

Any model works well
ViT models particularly effective
Can use higher learning rates
Less aggressive augmentation needed

Common Pitfalls

Out of Memory Errors

Solution: Reduce batch size, use a smaller model, or enable gradient accumulation

Model Not Learning (Loss Not Decreasing)

Solution: Increase learning rate, check data preprocessing, ensure labels are correct

Overfitting (Training Accuracy High, Validation Low)

Solution: Add data augmentation, reduce model size, increase dataset, add regularization

Poor Accuracy on Certain Classes

Solution: Add more training examples for those classes, check for label errors, adjust class weights

Training Too Slow

Solution: Use a smaller model, increase batch size, use GPU, reduce image resolution

Predictions All the Same Class

Solution: Check class balance, reduce learning rate, verify data loading is working correctly

Image Classification

Available Models

Vision Transformer (ViT) Models

ResNet Models

Efficient Models

Common Configuration

Training Images Folder Structure

Key Training Parameters

Fine-tuning vs Training from Scratch

Understanding Metrics

Choosing the Right Model

By Priority

By Use Case

Best Practices

Data Preparation

Training Strategy

GPU Considerations

Dataset Size Guidelines

Common Pitfalls

Out of Memory Errors

Model Not Learning (Loss Not Decreasing)

Overfitting (Training Accuracy High, Validation Low)

Poor Accuracy on Certain Classes

Training Too Slow

Predictions All the Same Class

On this page

Sicherheit auf Enterprise-Niveau

In jeder Infrastruktur einsetzbar

DSGVO-konform

Image Classification

Available Models

Vision Transformer (ViT) Models

ResNet Models

Efficient Models

Common Configuration

Training Images Folder Structure

Key Training Parameters

Fine-tuning vs Training from Scratch

Understanding Metrics

Choosing the Right Model

By Priority

By Use Case

Best Practices

Data Preparation

Training Strategy

GPU Considerations

Dataset Size Guidelines

Common Pitfalls

Out of Memory Errors

Model Not Learning (Loss Not Decreasing)

Overfitting (Training Accuracy High, Validation Low)

Poor Accuracy on Certain Classes

Training Too Slow

Predictions All the Same Class

On this page

Command Palette