Object Detection
Train models to locate and classify multiple objects within images
Object detection combines classification and localization to identify where objects are in images and what they are. Unlike image classification which labels entire images, object detection outputs bounding boxes and class labels for each detected object. This task is fundamental to applications like autonomous driving, surveillance, robotics, and visual inspection.
Learn About Object Detection
New to object detection? Visit our Object Detection Concepts Guide to learn about bounding boxes, IoU metrics, anchor-free vs anchor-based methods, and annotation formats like COCO.
Available Models
DETR (Detection Transformer) Family
DETR revolutionized object detection by eliminating hand-crafted components like anchor boxes and non-maximum suppression through a transformer-based approach.
- DETR ResNet-50 - Standard DETR with ResNet-50 backbone, balanced performance
- DETR ResNet-101 - Deeper backbone for higher accuracy
- DETR ResNet-50 DC5 - Dilated convolutions for improved small object detection
- DETR ResNet-101 DC5 - Deepest DETR variant with dilated convolutions
Advanced DETR Variants
Improvements on the DETR architecture addressing convergence speed and accuracy.
- Deformable DETR - Deformable attention for faster convergence and better small object detection
- Conditional DETR - Conditional spatial queries for faster training
YOLO Family
You Only Look Once (YOLO) models prioritize real-time detection speed while maintaining competitive accuracy.
- YOLOv8-Nano - Ultra-fast and lightweight for edge devices and real-time applications
Common Configuration
Data Requirements
Training Images: Directory containing your object images
Annotations: JSON file in COCO format containing:
- Image information (filename, dimensions)
- Bounding boxes (x, y, width, height)
- Object categories/classes
- Instance IDs
COCO Annotation Format Example:
[
"images": [
["id": 1, "file_name": "image1.jpg", "height": 480, "width": 640]
],
"annotations": [
["id": 1, "image_id": 1, "category_id": 1,
"bbox": [100, 150, 200, 180], "area": 36000]
],
"categories": [
["id": 1, "name": "car"],
["id": 2, "name": "person"]
]
]Key Training Parameters
Batch Size: Number of images processed together
- DETR models: 2-8 (transformer overhead)
- YOLO models: 8-32 (more efficient architecture)
- Reduce if out-of-memory errors occur
Epochs: Complete passes through training data
- 1-5 epochs typical for fine-tuning
- More epochs for training from scratch or small datasets
- Object detection generally needs fewer epochs than classification
Learning Rate: Optimizer step size
- 5e-5 typical for DETR models
- Higher rates possible for YOLO (1e-3 to 1e-4)
- Lower rates for small datasets or when fine-tuning
Eval Steps: Evaluation frequency during training
Understanding Metrics
mAP (mean Average Precision): Primary metric for object detection
- mAP@0.5: Average Precision at IoU threshold 0.5 (lenient)
- mAP@0.5:0.95: Average over IoU thresholds 0.5 to 0.95 (strict, COCO standard)
- Higher is better, ranges from 0 to 1 (or 0% to 100%)
IoU (Intersection over Union): Overlap between predicted and ground truth boxes
- IoU > 0.5: Generally considered a correct detection
- IoU > 0.75: High-quality detection
Precision: Fraction of detections that are correct
- High precision: Few false positives
Recall: Fraction of ground truth objects that are detected
- High recall: Few missed objects
Loss Components:
- Classification loss: How well classes are predicted
- Bounding box regression loss: How accurately boxes are localized
- Should both decrease during training
Choosing the Right Model
By Priority
Maximum Accuracy
- DETR ResNet-101 DC5 (best overall)
- Deformable DETR (great for small objects)
- DETR ResNet-101
Fastest Training
- YOLOv8-Nano (quickest to converge)
- Conditional DETR (improved DETR convergence)
- DETR ResNet-50
Fastest Inference
- YOLOv8-Nano (real-time capable)
- DETR ResNet-50
- Conditional DETR
Best for Small Objects
- Deformable DETR (designed for this)
- DETR ResNet-50/101 DC5 (dilated convolutions help)
- YOLOv8-Nano (with appropriate input size)
Edge Deployment
- YOLOv8-Nano (only practical option)
- Consider quantization for other models
By Use Case
Autonomous Vehicles
- Deformable DETR or YOLOv8-Nano
- Need real-time performance and small object detection
- Large, well-annotated datasets available
Security/Surveillance
- DETR ResNet-101 DC5 for maximum accuracy
- YOLOv8-Nano if real-time processing required
- Often dealing with small, distant objects
Manufacturing Quality Control
- DETR ResNet-50 for balanced performance
- Controlled environment, good lighting
- Precision important, speed often secondary
Retail Analytics
- YOLOv8-Nano for real-time people counting
- Deformable DETR for product detection
- Need balance of speed and accuracy
Wildlife Monitoring
- DETR ResNet-101 or Deformable DETR
- Animals often small in frame
- Accuracy more important than speed
Best Practices
Data Preparation
-
Annotation Quality: Accurate bounding boxes are critical
- Tight boxes around objects (no excessive padding)
- Consistent annotation guidelines
- Include partially visible objects if relevant
-
Dataset Balance:
- Aim for balanced instances across classes
- At least 100 instances per class minimum
- More instances for difficult classes
-
Image Diversity:
- Various lighting conditions
- Different angles and scales
- Diverse backgrounds
- Include edge cases
-
Validation Split:
- 10-20% of data for validation
- Ensure validation set represents real-world distribution
Training Strategy
-
Start with Default Config: Use default learning rates and batch sizes initially
-
Monitor Training:
- Loss should decrease steadily
- Both classification and localization losses important
- Check mAP on validation set
-
Adjust Learning Rate:
- Reduce if loss oscillates or increases
- Increase if convergence very slow
- Consider learning rate scheduling
-
Augmentation:
- Less aggressive than classification (preserve spatial information)
- Common: horizontal flip, brightness/contrast adjustment
- Avoid: heavy rotation or cropping that cuts objects
Common Pitfalls
Small Objects Not Detected
- Use Deformable DETR or models with DC5
- Increase input resolution if possible
- Ensure small objects well-annotated in training data
Many False Positives
- Lower confidence threshold at inference
- Train longer for better classification
- Check if similar-looking objects confuse model
Poor Localization (Low IoU)
- Focus on bounding box loss during training
- Verify annotation quality and consistency
- May need more training data
Slow Convergence
- DETR models converge slower than YOLO
- Consider Conditional DETR or Deformable DETR
- Increase learning rate cautiously
Class Imbalance Issues
- Ensure adequate examples of rare classes
- Consider weighted sampling or loss reweighting
- May need to collect more data for rare classes
GPU Requirements
Memory Guidelines
DETR Models:
- 8GB minimum (batch_size=2)
- 12-16GB recommended (batch_size=4-8)
- Transformers memory-intensive
YOLO Models:
- 4-8GB sufficient
- More efficient architecture
- Can use larger batch sizes
Training Time Estimates
Small Dataset (1,000 images):
- DETR models: 30-60 minutes per epoch
- YOLO models: 10-20 minutes per epoch
Medium Dataset (5,000 images):
- DETR models: 2-5 hours per epoch
- YOLO models: 30-90 minutes per epoch
Large Dataset (20,000+ images):
- DETR models: 8+ hours per epoch
- YOLO models: 2-4 hours per epoch
Times assume modern GPU (RTX 3080/4080 or better)
Dataset Size Guidelines
Minimum: 500 annotated images with 50+ instances per class Good: 2,000-5,000 images with 200+ instances per class Excellent: 10,000+ images with 1,000+ instances per class
Object detection typically requires more data than classification due to the additional complexity of localization.