Dokumentation (english)

Image Segmentation

Train models for pixel-level classification and instance segmentation

Image segmentation extends beyond object detection by assigning class labels to every pixel in an image. There are three main types: semantic segmentation (labeling pixels by class), instance segmentation (separating individual object instances), and panoptic segmentation (combining both). These tasks enable precise scene understanding for applications like medical imaging, autonomous driving, and image editing.

Learn About Image Segmentation

New to segmentation? Visit our Image Segmentation Concepts Guide to learn about semantic vs instance segmentation, mask representations, common metrics like IoU and Dice score, and annotation best practices.

Available Models

DETR Segmentation Family

Transformer-based panoptic segmentation extending DETR's object detection capabilities with segmentation masks.

Foundation Models

Large pre-trained models designed for versatile segmentation with minimal fine-tuning.

  • SAM (Segment Anything) - Promptable segmentation for any object with points, boxes, or masks
  • Mask R-CNN - Classic instance segmentation extending Faster R-CNN

Semantic Segmentation

Models focused on pixel-level semantic classification without instance separation.

  • SegFormer-B0 - Efficient hierarchical transformer for semantic segmentation

Common Configuration

Data Requirements

Training Images: Directory containing your images

Segmentation Masks: Either:

  • Folder of masks (for semantic segmentation): PNG/numpy masks where pixel values represent classes
  • COCO-format annotations (for instance segmentation): JSON with polygon or RLE masks

Mask Format Example (Semantic):

train_images/          segmentation_masks/
├── image1.jpg    ->   ├── image1.png  (pixel values = class IDs)
├── image2.jpg    ->   ├── image2.png
└── image3.jpg    ->   └── image3.png

Key Training Parameters

Batch Size: Images processed together

  • DETR segmentation: 2-4 (very memory-intensive)
  • SAM: Inference only (no training)
  • Mask R-CNN: 2-4
  • SegFormer: 4-16 depending on variant

Epochs: Training iterations

  • 1-10 epochs typical for fine-tuning
  • Segmentation often needs more epochs than classification

Learning Rate: Optimization step size

  • DETR: 1e-4 (higher than detection due to additional mask head)
  • Mask R-CNN: 5e-3 (different optimizer)
  • SegFormer: 6e-5 (very small)

Understanding Metrics

IoU (Intersection over Union): Overlap between predicted and ground truth masks

  • Primary metric for semantic segmentation
  • Calculated per-class then averaged (mIoU)
  • Values: 0.0 (no overlap) to 1.0 (perfect match)

Dice Score: Harmonic mean of precision and recall for masks

  • Often used in medical imaging
  • Similar to IoU but more sensitive to small objects
  • Formula: 2 x |A ∩ B| / (|A| + |B|)

Pixel Accuracy: Percentage of correctly classified pixels

  • Simple to understand but can be misleading
  • Dominated by large classes in imbalanced datasets

mAP (for instance segmentation): Mean Average Precision of masks

  • Same as object detection but evaluated on masks not boxes
  • More stringent than box-based mAP

Choosing the Right Model

By Segmentation Type

Semantic Segmentation (pixel-level classes, no instances)

  • SegFormer-B0: Best accuracy-efficiency balance
  • DETR Segmentation: If you want transformer-based approach

Instance Segmentation (separate object instances)

  • Mask R-CNN: Industry standard, reliable
  • DETR Segmentation: Modern transformer alternative
  • SAM: For promptable segmentation

Panoptic Segmentation (both semantic + instances)

  • DETR Segmentation models: Designed for this
  • Combines "stuff" (background) and "things" (countable objects)

By Priority

Maximum Accuracy

  1. DETR Segmentation ResNet-101 (transformer power)
  2. Mask R-CNN with ResNet-101 backbone
  3. SegFormer larger variants

Fastest Training

  1. SegFormer-B0 (efficient architecture)
  2. Mask R-CNN (mature optimization)
  3. DETR variants (slower convergence)

Best for Small Objects

  1. DETR Segmentation ResNet-50 DC5 (dilated convs)
  2. Mask R-CNN with FPN
  3. SegFormer with high resolution

Interactive/Promptable

  1. SAM (designed for this - inference only)
  2. Other models need full retraining for new classes

By Use Case

Medical Imaging

  • SegFormer-B0 or DETR Segmentation
  • High accuracy critical
  • Often semantic segmentation sufficient

Autonomous Driving

  • DETR Segmentation for panoptic understanding
  • Need both road surface (semantic) and vehicles (instance)
  • Real-time requirements favor SegFormer

Image Editing/Annotation

  • SAM for interactive segmentation
  • Promptable approach ideal for user-guided tasks

Industrial Inspection

  • Mask R-CNN for instance segmentation
  • Reliable, well-tested in production
  • Good for quality control defects

Best Practices

Data Preparation

  1. Mask Quality: Pixel-perfect masks critical

    • Accurate boundaries, no gaps
    • Consistent annotation across dataset
    • Include ambiguous regions appropriately
  2. Class Balance:

    • Balance pixel counts across classes
    • Small classes need oversampling or weighted loss
    • Background class often dominates - handle carefully
  3. Instance Annotation:

    • For instance segmentation, separate touching objects
    • Consistent rules for occlusion
    • Include partially visible instances if relevant
  4. Resolution Considerations:

    • Higher resolution captures fine details
    • But requires more memory and compute
    • Balance based on object sizes

Training Strategy

  1. Start Conservative: Default hyperparameters usually good starting point

  2. Monitor Multiple Metrics:

    • IoU/Dice for segmentation quality
    • Loss for training progress
    • Per-class metrics to identify weak classes
  3. Class Weights:

    • Use weighted loss for imbalanced classes
    • Emphasize difficult or rare classes
    • Prevent background class from dominating
  4. Augmentation:

    • Random crops (ensure objects still visible)
    • Flips and rotations (preserve mask-image alignment)
    • Color augmentation (doesn't affect masks)
    • Avoid transformations that misalign image and mask

Common Pitfalls

Background Dominance

  • Background pixels often 80-90% of dataset
  • Solution: Weighted loss, focal loss, or crop around objects

Boundary Errors

  • Predictions often poor at object boundaries
  • Solution: Boundary-aware loss, higher resolution, quality masks

Small Object Issues

  • Tiny objects easily missed or poorly segmented
  • Solution: Use DC5/dilated models, higher resolution, oversampling

Memory Problems

  • Segmentation very memory-intensive
  • Solution: Smaller batch sizes, gradient accumulation, lower resolution

Inconsistent Masks

  • Training masks have inconsistent annotation style
  • Solution: Quality control, clear guidelines, re-annotation if needed

Hardware Requirements

Memory Guidelines

Semantic Segmentation:

  • 8GB minimum (SegFormer-B0)
  • 12-16GB recommended

Instance Segmentation:

  • 12GB minimum (Mask R-CNN)
  • 16-24GB recommended for DETR variants

Batch Size Impact:

  • Segmentation uses 2-4x memory of classification
  • Batch size 2-4 typical even with large GPUs
  • Consider gradient accumulation for larger effective batches

Training Time Estimates

Per Epoch (5,000 images):

  • SegFormer-B0: 30-60 minutes
  • Mask R-CNN: 1-2 hours
  • DETR Segmentation: 3-5 hours

Times assume RTX 3080/4080 or better

Dataset Size Guidelines

Minimum: 500 images with quality masks Good: 2,000-5,000 images Excellent: 10,000+ images

Segmentation generally needs more data than classification or detection due to pixel-level supervision requirements.


Command Palette

Search for a command to run...

Schnellzugriffe
STRG + KSuche
STRG + DNachtmodus / Tagmodus
STRG + LSprache ändern

Software-Details
Kompiliert vor 1 Tag
Release: v4.0.0-production
Buildnummer: master@64a3463
Historie: 68 Items