Dokumentation (english)

Object Detection

Locating and classifying multiple objects within images using bounding boxes

Object detection combines the tasks of localization and classification: it identifies where objects are in an image (with bounding boxes) and what categories they belong to. Unlike image classification which labels the entire image, object detection can find multiple objects of different types in a single image.

📚 Training Object Detection Models

Looking to train object detection models? Check out our comprehensive Object Detection Training Guide with detailed parameter documentation for all available models.

What is Object Detection?

Object detection answers two questions simultaneously:

  1. What objects are in the image? (Classification)
  2. Where are they located? (Localization)

For each detected object, the model outputs:

  • Bounding box coordinates: (x, y, width, height) defining the object's location
  • Class label: The object category (e.g., "person", "car", "dog")
  • Confidence score: The model's certainty in the detection (0-1)

Examples:

  • Autonomous vehicles detecting pedestrians, cars, and traffic signs
  • Surveillance systems identifying people and suspicious objects
  • Retail analytics counting customers and tracking product interactions
  • Medical imaging locating tumors or abnormalities

Key Concepts

Bounding Boxes

Rectangular regions defined by coordinates, with multiple representation formats:

Format variations:

  • (x₁, y₁, x₂, y₂): Top-left and bottom-right corners
  • (x_center, y_center, width, height): YOLO format
  • (x, y, width, height): COCO format with top-left corner
  • Normalized vs. absolute: Coordinates as pixels or fractions of image size

Intersection over Union (IoU)

A fundamental metric measuring overlap between predicted and ground-truth boxes:

IoU=Area of OverlapArea of Union\text{IoU} = \frac{\text{Area of Overlap}}{\text{Area of Union}}
  • IoU = 1.0: Perfect overlap
  • IoU = 0.0: No overlap
  • IoU ≥ 0.5: Commonly considered a "correct" detection
  • Used both for evaluation and during training (matching predictions to ground truth)

Anchor Boxes

Predefined boxes of various sizes and aspect ratios used as references:

  • Purpose: Simplify the detection problem by predicting offsets from anchors
  • Design: Typically chosen based on object statistics in your dataset
  • K-means clustering: Common method to determine optimal anchor sizes
  • Modern approaches: Anchor-free methods eliminate this requirement

Non-Maximum Suppression (NMS)

Post-processing to eliminate duplicate detections:

  1. Sort detections by confidence score
  2. Keep the highest-scoring detection
  3. Remove detections with IoU > threshold (typically 0.5) with kept detection
  4. Repeat for remaining detections

Variants:

  • Soft-NMS: Reduces scores instead of removing boxes
  • Class-aware NMS: Apply separately per class
  • Distance-based NMS: Consider geometric relationships

Detection Approaches

Two-Stage Detectors: R-CNN Family

These detectors first propose regions, then classify them:

R-CNN (2014): Regions with CNN features

  • Use selective search to generate ~2000 region proposals
  • Extract CNN features from each region
  • Classify with SVM
  • Slow but accurate

Fast R-CNN (2015): Faster processing

  • Process entire image with CNN once
  • Project region proposals onto feature map
  • RoI pooling for fixed-size features
  • Single-stage training

Faster R-CNN (2015): Learnable proposals

  • Replace selective search with Region Proposal Network (RPN)
  • End-to-end trainable
  • ~5-10 FPS inference
  • Strong baseline for accuracy

Mask R-CNN (2017): Instance segmentation extension

  • Adds mask prediction branch
  • Used for both detection and segmentation

Cascade R-CNN: Progressive refinement

  • Multiple detection heads with increasing IoU thresholds
  • Addresses mismatch between training and inference

One-Stage Detectors: Speed-Optimized

These detectors predict directly from feature maps without region proposals:

YOLO (You Only Look Once) series:

  • YOLOv1-v3: Pioneered single-shot detection, 30+ FPS
  • YOLOv4-v5: Improved accuracy and speed balance
  • YOLOv8: Latest with enhanced architecture and training
  • Divide image into grid, predict boxes per cell
  • Extremely fast, suitable for real-time applications

SSD (Single Shot MultiBox Detector):

  • Multi-scale feature maps for different object sizes
  • Faster than Faster R-CNN, more accurate than early YOLO
  • Good balance of speed and accuracy

RetinaNet:

  • Feature Pyramid Network (FPN) for multi-scale detection
  • Focal Loss: Addresses class imbalance in one-stage detectors
  • Competitive accuracy with two-stage methods

EfficientDet:

  • Compound scaling for detection networks
  • Weighted bi-directional FPN (BiFPN)
  • State-of-the-art efficiency across model sizes

Transformer-Based Detectors

Modern approaches using attention mechanisms:

DETR (Detection Transformer):

  • End-to-end detection without hand-designed components
  • No NMS or anchor boxes needed
  • Bipartite matching loss
  • Slower convergence than CNN-based methods

Deformable DETR:

  • Addresses DETR's slow convergence and high memory usage
  • Deformable attention for efficient multi-scale features
  • 10× faster convergence

DINO (DETR with Improved DeNoising Anchor Boxes):

  • Enhanced training techniques
  • State-of-the-art accuracy
  • Better small object detection

Speed vs. Accuracy Trade-offs

Real-time (30+ FPS):

  • YOLOv5/v8 (small/medium variants)
  • SSD with lightweight backbones
  • Use for: Video processing, edge devices, robotics

Balanced (5-15 FPS):

  • YOLOv8 (large variants)
  • EfficientDet
  • RetinaNet
  • Use for: General applications, moderate real-time needs

High accuracy (< 5 FPS):

  • Cascade R-CNN
  • Large DETR variants
  • Ensemble methods
  • Use for: Offline processing, critical accuracy requirements

Evaluation Metrics

Mean Average Precision (mAP)

The primary metric for object detection, combining precision and recall across classes:

Calculation steps:

  1. For each class, compute Precision-Recall curve
  2. Calculate Average Precision (AP) as area under PR curve
  3. Average AP across all classes to get mAP

Common variants:

  • mAP@0.5: IoU threshold of 0.5 (PASCAL VOC metric)
  • mAP@0.5:0.95: Average mAP at IoU thresholds from 0.5 to 0.95 (COCO metric, more strict)
  • mAP@0.75: Stricter IoU requirement
mAP=1Ni=1NAPi\text{mAP} = \frac{1}{N} \sum_{i=1}^{N} \text{AP}_i

where N is the number of classes.

Precision and Recall

Precision: Fraction of detections that are correct

Precision=True PositivesTrue Positives+False Positives\text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}}

Recall: Fraction of ground-truth objects detected

Recall=True PositivesTrue Positives+False Negatives\text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}

A detection is a True Positive if:

  • Class prediction is correct
  • IoU with ground truth ≥ threshold (typically 0.5)

Precision-Recall Curve

Plots precision vs. recall at different confidence thresholds:

  • High threshold: Few detections, high precision, low recall
  • Low threshold: Many detections, low precision, high recall
  • Ideal: Maintains high precision across all recall levels
  • AP: Area under this curve

Object Size Metrics (COCO)

COCO dataset provides size-specific metrics:

  • mAP^small: Objects with area < 32² pixels
  • mAP^medium: Objects with 32² < area < 96² pixels
  • mAP^large: Objects with area > 96² pixels

Useful for understanding model behavior on different object scales.

Frames Per Second (FPS)

Inference speed metric:

  • Measured on specific hardware
  • Higher is better for real-time applications
  • Trade-off with accuracy
  • Consider full pipeline (preprocessing + model + postprocessing)

Annotation Formats

COCO Format (JSON)

Microsoft COCO dataset format, widely used:

{
  "images": [{"id": 1, "file_name": "image.jpg", "width": 640, "height": 480}],
  "annotations": [
    {
      "id": 1,
      "image_id": 1,
      "category_id": 1,
      "bbox": [x, y, width, height],
      "area": width * height,
      "iscrowd": 0
    }
  ],
  "categories": [{"id": 1, "name": "person"}]
}

Characteristics:

  • Supports segmentation masks in addition to boxes
  • (x, y) is top-left corner
  • Single JSON file for entire dataset

PASCAL VOC Format (XML)

XML files, one per image:

<annotation>
  <filename>image.jpg</filename>
  <size><width>640</width><height>480</height></size>
  <object>
    <name>person</name>
    <bndbox>
      <xmin>100</xmin><ymin>100</ymin>
      <xmax>200</xmax><ymax>300</ymax>
    </bndbox>
  </object>
</annotation>

Characteristics:

  • One XML file per image
  • (xmin, ymin, xmax, ymax) format
  • Human-readable

YOLO Format (Text)

Simple text files, one per image:

class_id x_center y_center width height
0 0.5 0.5 0.3 0.4
1 0.7 0.3 0.2 0.25

Characteristics:

  • All values normalized to [0, 1]
  • One line per object
  • Separate classes.txt file lists class names
  • Extremely simple and fast to parse

Conversion Tools

Most frameworks provide conversion utilities:

  • COCO ↔ YOLO converters widely available
  • PASCAL VOC conversion supported in most libraries
  • Custom formats can be converted with scripting

Data Requirements

Dataset Size

Depends on task complexity and approach:

  • Transfer learning: 100-1000 annotated images (minimum)
  • Training from scratch: 10,000+ images recommended
  • Fine-tuning: Can work with smaller datasets (50-500 images)
  • Complex scenes: More data needed for multi-object scenarios

Annotation Quality

High-quality annotations are critical:

  • Tight bounding boxes: Minimize background pixels
  • Consistent labeling: Same objects labeled the same way
  • Complete annotations: Don't miss objects in crowded scenes
  • Handle occlusion: Annotate partially visible objects
  • Class definitions: Clear guidelines for edge cases

Data Diversity

Model needs to see varied examples:

  • Multiple viewpoints: Front, side, angled views
  • Various scales: Near and far objects
  • Different lighting: Indoor, outdoor, day, night
  • Backgrounds: Cluttered and clean environments
  • Occlusion levels: Fully visible and partially hidden objects

Class Balance

Aim for balanced representation:

  • Similar number of instances per class
  • If imbalanced, use weighted losses or oversampling
  • Monitor per-class metrics during training
  • Consider combining rare classes

Common Challenges

Small Object Detection

Objects occupying few pixels are hard to detect:

  • Problems: Low resolution, little detail, fewer features
  • Solutions:
    • Multi-scale feature pyramids (FPN)
    • Higher input resolution
    • Specialized small-object detectors
    • Tile-based processing for very high-res images
  • Anchors: Design smaller anchor boxes
  • Augmentation: Careful with downscaling

Occlusion and Crowding

Objects partially hidden or overlapping:

  • Problems: Incomplete visual information, ambiguous boundaries
  • Solutions:
    • Train on occluded examples
    • Attention mechanisms to focus on visible parts
    • Context modeling to infer hidden portions
    • Soft-NMS to preserve overlapping detections
  • Annotation: Label even partially visible objects

Class Imbalance

Some classes appear far more frequently:

  • Problems: Model biased toward common classes, poor rare class performance
  • Solutions:
    • Focal loss to down-weight easy examples
    • Class-balanced sampling during training
    • Weighted losses
    • Oversample rare classes or undersample common ones
  • Evaluation: Report per-class metrics, not just overall mAP

Speed vs. Accuracy Tradeoff

Real-time requirements conflict with accuracy goals:

  • Analysis: Profile your application's speed requirements
  • Solutions:
    • Choose appropriate architecture for your use case
    • Model compression: Quantization, pruning
    • Hardware acceleration: TensorRT, ONNX Runtime
    • Resolution reduction (carefully)
  • Testing: Measure actual inference time on target hardware

Background False Positives

Model detects objects where none exist:

  • Problems: Cluttered backgrounds, similar patterns
  • Solutions:
    • Add "background" or "negative" training examples
    • Adjust confidence thresholds
    • Focal loss reduces focus on easy negatives
    • Hard negative mining
  • Augmentation: Include challenging background images

Domain Shift

Performance drops in new environments:

  • Problems: Different camera angles, lighting, image quality
  • Solutions:
    • Include diverse training data
    • Domain adaptation techniques
    • Test-time augmentation
    • Fine-tune on target domain
  • Validation: Test on data from deployment environment

Practical Applications

Autonomous Vehicles

  • Pedestrian and vehicle detection
  • Traffic sign and signal recognition
  • Lane and road boundary detection
  • Obstacle identification

Surveillance and Security

  • Person detection and tracking
  • Suspicious object identification
  • Crowd monitoring
  • Perimeter intrusion detection

Retail Analytics

  • Customer counting and tracking
  • Product recognition on shelves
  • Queue length monitoring
  • Inventory management

Manufacturing and Quality Control

  • Defect detection on production lines
  • Part identification and counting
  • Assembly verification
  • Safety equipment detection (hard hats, vests)

Medical Imaging

  • Tumor detection in CT/MRI scans
  • Organ localization
  • Anatomical landmark detection
  • Cell counting in microscopy

Agriculture

  • Crop disease detection
  • Fruit counting for yield estimation
  • Weed detection
  • Livestock monitoring

Wildlife Conservation

  • Animal detection in camera traps
  • Species identification
  • Population counting
  • Poaching detection

Augmented Reality

  • Object recognition for AR overlays
  • Spatial understanding
  • Hand and gesture detection
  • Face and facial landmark detection

Choosing an Approach

Consider these factors when selecting a detection method:

For real-time applications (robotics, video analytics):

  • Use one-stage detectors: YOLOv8, SSD
  • Accept slightly lower accuracy for speed
  • Optimize for your specific hardware
  • Consider edge-optimized variants (YOLO-nano, MobileNet-SSD)

For highest accuracy (medical, critical safety):

  • Use two-stage detectors: Cascade R-CNN, Mask R-CNN
  • Or large transformer models: DINO, Deformable DETR
  • Can afford slower inference
  • Ensemble multiple models if needed

For small objects (aerial imagery, microscopy):

  • Feature pyramid networks essential
  • Higher input resolution
  • Specialized architectures like CascadeRCNN
  • Consider tile-based processing

For limited data:

  • Transfer learning from COCO-pretrained models
  • Data augmentation critical
  • Consider few-shot detection methods
  • Active learning to prioritize annotation

For edge deployment (mobile, embedded):

  • Lightweight architectures: MobileNet-SSD, YOLO-nano
  • Model quantization and pruning
  • Profile on actual target hardware
  • Optimize preprocessing pipeline

Next Steps

Ready to train your own object detection models? Our Object Detection Training Guide provides comprehensive documentation on:

  • Available architectures (YOLO, Faster R-CNN, RetinaNet, etc.)
  • Hyperparameter configuration
  • Data preparation and augmentation strategies
  • Training optimization and debugging

For understanding related computer vision tasks, see:


Command Palette

Search for a command to run...

Schnellzugriffe
STRG + KSuche
STRG + DNachtmodus / Tagmodus
STRG + LSprache ändern

Software-Details
Kompiliert vor 1 Tag
Release: v4.0.0-production
Buildnummer: master@64a3463
Historie: 68 Items