Dokumentation (english)

Image Segmentation

Pixel-level classification for precise object boundaries and scene understanding

Image segmentation is the task of classifying every pixel in an image, creating precise delineations of objects and regions. Unlike object detection which uses bounding boxes, segmentation provides exact boundaries, enabling fine-grained understanding of image content and spatial relationships.

📚 Training Image Segmentation Models

Looking to train image segmentation models? Check out our comprehensive Image Segmentation Training Guide with detailed parameter documentation for all available models.

What is Image Segmentation?

Image segmentation partitions an image into meaningful regions by assigning a label to every pixel. The output is a segmentation mask with the same dimensions as the input image, where each pixel value represents its class or instance ID.

Key characteristics:

  • Pixel-wise classification: Every pixel gets a label
  • Precise boundaries: Exact object shapes, not just bounding boxes
  • Spatial understanding: Complete scene layout and relationships
  • Variable output: Number of segments can vary per image

Example applications:

  • Medical imaging: Delineating tumors, organs, or tissue types
  • Autonomous driving: Identifying drivable areas, lanes, and obstacles
  • Photo editing: Background removal and object selection
  • Satellite imagery: Land cover classification and building footprints

Types of Segmentation

Semantic Segmentation

Assigns a class label to each pixel, treating all instances of a class identically:

Characteristics:

  • Same class = same label, regardless of instance
  • Example: All people pixels labeled "person"
  • No distinction between individual objects
  • Simpler than instance segmentation

Use cases:

  • Scene understanding (road, sky, building)
  • Medical tissue classification
  • Land cover mapping
  • Image stylization

Output: Single-channel mask where pixel value = class ID

Instance Segmentation

Distinguishes between individual objects of the same class:

Characteristics:

  • Each object gets unique instance ID
  • Example: Person 1, Person 2, Person 3 have different labels
  • Can count objects
  • More complex than semantic segmentation

Use cases:

  • Object counting (cells, people, vehicles)
  • Tracking individual entities
  • Robotic manipulation
  • Retail analytics

Output: Each instance has separate mask or unique ID in combined mask

Panoptic Segmentation

Combines semantic and instance segmentation:

Characteristics:

  • Stuff classes: No instances (sky, road, grass) - semantic labels
  • Thing classes: Countable objects (people, cars) - instance IDs
  • Every pixel has both semantic class and instance ID
  • Unified scene understanding

Use cases:

  • Autonomous driving (complete scene parsing)
  • Robotics (full environment understanding)
  • Comprehensive scene analysis

Output: Two-channel representation (semantic class + instance ID)

Key Concepts

Pixel-wise Classification

Unlike image classification (one label per image) or detection (boxes), segmentation makes a decision for each pixel:

  • Dense prediction: Output has same spatial dimensions as input
  • Computational cost: Must process all pixels
  • Context matters: Local and global information both important
  • Class boundaries: Critical to get edges right

Segmentation Masks

The output representation:

  • Binary masks: Single class vs. background (H × W × 1)
  • Multiclass masks: One channel with class IDs (H × W × 1)
  • One-hot encoded: Separate channel per class (H × W × C)
  • Instance masks: Separate mask per instance or combined with unique IDs

Receptive Field

The region of input that influences a single output pixel:

  • Larger receptive field: Better context, global understanding
  • Smaller receptive field: More precise boundaries
  • Design trade-off: Need both local detail and global context
  • Architectures: Use pooling, convolutions, or attention to control

Encoder-Decoder Architecture

Common design pattern for segmentation:

Encoder (downsampling path):

  • Progressively reduce spatial dimensions
  • Increase number of channels
  • Extract high-level semantic features
  • Similar to classification networks

Decoder (upsampling path):

  • Restore spatial resolution
  • Reduce channels to number of classes
  • Combine low-level and high-level features
  • Produce dense predictions

Skip connections: Link encoder and decoder at same spatial scales to preserve fine details

Segmentation Approaches

Fully Convolutional Networks (FCN)

First successful deep learning approach for segmentation:

  • Replace fully connected layers with convolutions
  • Arbitrary input size
  • Upsampling through transposed convolutions
  • Skip connections to combine coarse and fine features
  • Foundation for modern methods

U-Net

Highly successful architecture, especially for medical imaging:

Structure:

  • Symmetric encoder-decoder with strong skip connections
  • Concatenate features from encoder to decoder
  • Large number of feature channels in upsampling
  • Works well with limited training data

Strengths:

  • Excellent boundary precision
  • Effective with small datasets
  • Fast training and inference
  • Widely adopted baseline

Variants:

  • U-Net++: Nested skip pathways
  • Attention U-Net: Attention gates in skip connections
  • 3D U-Net: Volumetric segmentation

DeepLab Series

Advanced techniques for improved accuracy:

DeepLab v1-v3+ innovations:

  • Atrous convolution: Increase receptive field without losing resolution
  • Atrous Spatial Pyramid Pooling (ASPP): Multi-scale context with parallel atrous convolutions
  • Encoder-decoder: Combine ASPP with decoder for boundary refinement
  • Separable convolutions: Efficiency improvements

Strengths:

  • Strong performance on benchmarks
  • Good multi-scale understanding
  • Relatively efficient

Mask R-CNN

Extends Faster R-CNN for instance segmentation:

Approach:

  1. Detect objects with bounding boxes (Faster R-CNN)
  2. Add mask prediction branch per detection
  3. Parallel mask and class prediction
  4. RoI Align for precise spatial localization

Strengths:

  • State-of-the-art instance segmentation
  • Unified detection and segmentation
  • Handles overlapping objects

Limitations:

  • Two-stage design (slower than one-stage)
  • Complex training pipeline

Segment Anything Model (SAM)

Foundation model for promptable segmentation:

Capabilities:

  • Zero-shot segmentation with prompts
  • Points, boxes, or text as input
  • Segments anything without task-specific training
  • Interactive refinement

Use cases:

  • Annotation tools
  • Quick prototyping
  • Novel object segmentation
  • Data generation

Transformer-Based Methods

Modern approaches using attention:

SegFormer:

  • Hierarchical transformer encoder
  • Lightweight MLP decoder
  • Efficient and accurate
  • No positional encoding needed

Mask2Former:

  • Universal architecture for semantic, instance, and panoptic
  • Masked attention in transformer decoder
  • State-of-the-art across all segmentation types

Evaluation Metrics

Intersection over Union (IoU) / Jaccard Index

Most common metric, measuring overlap between prediction and ground truth:

IoU=Area of OverlapArea of Union=ABAB\text{IoU} = \frac{\text{Area of Overlap}}{\text{Area of Union}} = \frac{|A \cap B|}{|A \cup B|}
  • Range: 0 (no overlap) to 1 (perfect match)
  • Per-class IoU: Computed separately for each class
  • Mean IoU (mIoU): Average across all classes
  • Strengths: Intuitive, standard metric
  • Limitations: Sensitive to class imbalance

Dice Coefficient / F1-Score

Alternative overlap metric, more robust to class imbalance:

Dice=2ABA+B\text{Dice} = \frac{2 |A \cap B|}{|A| + |B|}
  • Range: 0 to 1, same as IoU
  • Relationship to IoU:
Dice=2IoU1+IoU\text{Dice} = \frac{2 \cdot \text{IoU}}{1 + \text{IoU}}
  • Medical imaging: Commonly preferred over IoU
  • Differentiable: Can be used as loss function (Dice loss)
  • Weighting: More weight to overlap than union

Pixel Accuracy

Simplest metric, fraction of correctly classified pixels:

Pixel Accuracy=Correctly Classified PixelsTotal Pixels\text{Pixel Accuracy} = \frac{\text{Correctly Classified Pixels}}{\text{Total Pixels}}
  • Easy to understand: Direct accuracy measure
  • Problem: Misleading with class imbalance
  • Example: 90% background → 90% accuracy by predicting all background
  • Use: Supplementary metric, not primary

Boundary Metrics

Evaluate boundary precision, important for applications requiring exact edges:

Boundary IoU: IoU computed only on boundary pixels Boundary F1: Precision and recall of boundary predictions Average Surface Distance: Mean distance between predicted and true boundaries

Use cases:

  • Medical imaging: Precise organ boundaries critical
  • Photo editing: Clean object cutouts
  • Autonomous driving: Accurate lane markings

Class-Weighted Metrics

Address class imbalance by weighting contributions:

  • Weight by inverse class frequency
  • Focus on rare but important classes
  • More representative of practical performance
  • Prevents dominant classes from skewing results

Instance Segmentation Metrics

For instance segmentation, combine detection and mask quality:

Average Precision (AP): Same as object detection, but matching requires mask IoU

  • AP@0.5: IoU threshold 0.5
  • AP@0.75: Stricter threshold
  • AP@0.5:0.95: COCO metric averaging multiple thresholds

Panoptic Quality (PQ): For panoptic segmentation

PQ=(p,g)TPIoU(p,g)TP+12FP+12FN\text{PQ} = \frac{\sum_{(p,g) \in TP} \text{IoU}(p,g)}{|TP| + \frac{1}{2}|FP| + \frac{1}{2}|FN|}

Combines segmentation quality and recognition quality.

Annotation Requirements

Pixel-Level Labeling

Segmentation requires precise, labor-intensive annotations:

  • Polygons: Draw boundaries around objects
  • Brush tools: Paint pixels in annotation software
  • Superpixels: Group similar pixels, then label groups
  • Time-intensive: 10-100× longer than bounding boxes

Annotation Tools

Popular tools for creating segmentation datasets:

  • CVAT: Open-source, polygon and brush tools
  • Labelbox: Cloud-based, collaborative features
  • Supervisely: Specialized for segmentation
  • V7: Advanced automation and quality control
  • Label Studio: Open-source, flexible

Quality Considerations

Critical factors for annotation quality:

  • Boundary precision: Tight fit to object edges
  • Consistency: Same objects labeled the same way across images
  • Occlusion handling: Annotate visible portions only or infer hidden parts
  • Small regions: Don't miss thin structures or tiny objects
  • Ambiguous boundaries: Clear guidelines for gradual transitions

Semi-Automated Annotation

Reduce annotation burden:

  • Interactive segmentation: SAM, Interactive GrabCut
  • Superpixels: SLIC, Felzenszwalb
  • Propagation: Label one frame, propagate to video
  • Active learning: Model suggests uncertain regions
  • Weak supervision: Use boxes, scribbles, or points instead of full masks

Data Requirements

Dataset Size

Highly dependent on task complexity:

  • Transfer learning: 50-500 images can work
  • Training from scratch: 1,000-10,000+ images
  • Medical imaging: Often 100-1,000 (but high-quality)
  • Simple backgrounds: Fewer images needed
  • Complex scenes: More data required

Data Diversity

Essential for generalization:

  • Viewing angles: Top-down, side, diagonal
  • Scales: Near and far objects
  • Lighting: Various conditions
  • Occlusion: Different overlap levels
  • Backgrounds: Cluttered and clean
  • Object variations: Size, shape, appearance

Augmentation Strategies

Critical for segmentation with limited data:

Geometric augmentations (apply to both image and mask):

  • Rotation, flipping, scaling
  • Cropping, elastic deformations
  • Affine transformations

Color augmentations (apply to image only):

  • Brightness, contrast, saturation
  • Hue shifts, color jitter
  • Gaussian noise, blur

Advanced techniques:

  • CutMix, MixUp adapted for segmentation
  • CopyPaste: Paste objects from other images
  • Synthetic data generation

Common Challenges

Boundary Precision

Exact edges are difficult to predict:

  • Problem: Blurry or jagged boundaries
  • Solutions:
    • Multi-scale features with skip connections
    • Boundary-aware loss functions
    • Higher resolution inputs
    • Post-processing refinement (CRF)
    • Attention mechanisms near boundaries
  • Evaluation: Use boundary-specific metrics

Small Regions and Thin Structures

Fine details get lost in downsampling:

  • Problem: Missing small objects, broken thin structures (vessels, roads)
  • Solutions:
    • Preserve high resolution: Less aggressive downsampling
    • Strong skip connections: U-Net style
    • Specialized losses: Weight small regions more
    • Higher input resolution
    • Attention to fine details
  • Medical imaging: Particularly critical for vessels, nerves

Class Imbalance

Segmentation often has severe imbalance:

  • Example: 95% background, 5% objects
  • Problem: Model biased toward majority class
  • Solutions:
    • Weighted losses (inverse frequency weighting)
    • Focal loss: Down-weight easy examples
    • Dice loss: More robust to imbalance than cross-entropy
    • Balanced sampling: Sample minority class regions more
    • Evaluation: Use mIoU, not pixel accuracy

Computational Cost

Dense prediction is memory and compute intensive:

  • Problem: High memory usage, slow training/inference
  • Solutions:
    • Patch-based processing: Process image in tiles
    • Lower resolution: Trade accuracy for speed
    • Efficient architectures: Separable convolutions, pruning
    • Mixed precision training: FP16 instead of FP32
    • Gradient checkpointing: Trade compute for memory
  • Large images: Satellite, medical scans need special handling

Ambiguous Boundaries

Not all boundaries are clear-cut:

  • Problem: Fuzzy edges (hair, glass, reflections)
  • Solutions:
    • Soft labels: Probability instead of hard mask
    • Trimap annotations: Foreground/background/uncertain
    • Matting techniques: Alpha channel prediction
    • Multiple annotators: Capture uncertainty
    • Model confidence: Output probability masks

Instance Separation

Distinguishing touching objects of same class:

  • Problem: Semantic segmentation merges touching instances
  • Solutions:
    • Use instance segmentation methods (Mask R-CNN)
    • Watershed-based post-processing
    • Contour detection
    • Distance transform learning
    • Panoptic segmentation architectures

Practical Applications

Medical Imaging

Precise delineation of anatomical structures and pathologies:

  • Organ segmentation: Liver, heart, brain structures in CT/MRI
  • Tumor segmentation: Delineate cancer regions for treatment planning
  • Cell segmentation: Count and analyze cells in microscopy
  • Vessel segmentation: Blood vessels, neural tracts
  • Critical requirements: High accuracy, boundary precision, interpretability

Autonomous Driving

Complete scene understanding for safe navigation:

  • Drivable area: Road and lane segmentation
  • Object classes: Vehicles, pedestrians, cyclists, traffic signs
  • Static infrastructure: Barriers, poles, traffic lights
  • Panoptic: Instance-aware understanding of scene
  • Critical requirements: Real-time processing, robustness, safety

Image Editing and Content Creation

Precise object selection and manipulation:

  • Background removal: Portrait mode, product photography
  • Object selection: Select objects for editing
  • Style transfer: Apply effects to specific regions
  • Virtual try-on: Segment person for clothing overlay
  • Requirements: Clean boundaries, interactive speed

Satellite and Aerial Imagery

Land cover classification and infrastructure mapping:

  • Land use: Forest, water, urban, agricultural
  • Building footprints: Automated mapping
  • Road networks: Infrastructure detection
  • Change detection: Compare imagery over time
  • Requirements: Handle large images, multi-scale objects

Agriculture

Crop monitoring and precision farming:

  • Crop segmentation: Distinguish crop types
  • Disease detection: Segment affected areas
  • Weed identification: Targeted herbicide application
  • Yield estimation: Segment and count fruits/vegetables
  • Requirements: Outdoor lighting variations, occlusion

Robotics

Scene understanding for manipulation and navigation:

  • Object segmentation: Identify graspable objects
  • Bin picking: Segment objects in cluttered bins
  • Navigation: Traversable surface detection
  • Human-robot interaction: Person segmentation for safety
  • Requirements: Real-time, 3D understanding

Choosing an Approach

Select based on your specific requirements:

For semantic segmentation (scene understanding):

  • High accuracy: DeepLab v3+, SegFormer
  • Limited data: U-Net with strong augmentation
  • Real-time: BiSeNet, FasterSeg, U-Net (small)
  • Medical imaging: U-Net, U-Net++ (proven track record)

For instance segmentation (object counting):

  • General purpose: Mask R-CNN (reliable baseline)
  • State-of-the-art: Mask2Former
  • Real-time: YOLACT, SOLOv2
  • Quality over speed: Cascade Mask R-CNN

For interactive segmentation (annotation tools):

  • Foundation models: Segment Anything Model (SAM)
  • Fast iteration: Interactive GrabCut, MIVOSNet
  • Custom needs: Train U-Net with click/scribble inputs

For limited annotation budget:

  • Pretrained models: Fine-tune from COCO, ImageNet
  • Weak supervision: Use boxes, points, or scribbles
  • Semi-supervised: Combine labeled and unlabeled data
  • Interactive tools: SAM for rapid annotation

Next Steps

Ready to train your own segmentation models? Our Image Segmentation Training Guide provides comprehensive documentation on:

  • Available architectures (U-Net, DeepLab, Mask R-CNN, etc.)
  • Loss functions for segmentation (Cross-entropy, Dice, Focal)
  • Data preparation and augmentation techniques
  • Training strategies and optimization tips

For understanding related computer vision tasks, see:


Command Palette

Search for a command to run...

Schnellzugriffe
STRG + KSuche
STRG + DNachtmodus / Tagmodus
STRG + LSprache ändern

Software-Details
Kompiliert vor 1 Tag
Release: v4.0.0-production
Buildnummer: master@64a3463
Historie: 68 Items