SAM (Segment Anything Model)
Foundation model for promptable instance segmentation with points, boxes, or masks
SAM (Segment Anything Model) is a foundation model that can segment any object in an image through various prompts: point clicks, bounding boxes, or rough masks. Unlike traditional models requiring full retraining for new classes, SAM's zero-shot capabilities enable interactive segmentation for any object without additional training, making it revolutionary for annotation tools and flexible segmentation tasks.
When to Use SAM
SAM is ideal for:
- Interactive segmentation with user prompts
- Zero-shot segmentation without training data for specific classes
- Annotation tools for creating training data
- Flexible segmentation where object classes aren't predefined
- Research and prototyping requiring quick segmentation
Note: SAM is inference-only in this system - no training supported, only fine-tuned checkpoint inference.
Strengths
- Promptable: Segment anything by pointing or boxing
- Zero-shot: Works on novel objects without training
- Interactive: Real-time feedback for user-guided segmentation
- Versatile: Multiple prompt types (points, boxes, masks)
- Foundation model: Pre-trained on 1 billion+ masks
- Multi-mask output: Generates multiple plausible segmentations
Weaknesses
- Inference only: Cannot be trained in this system
- Semantic labels not provided (just masks)
- Requires user interaction for each segmentation
- Not optimized for fully automatic batch processing
- Large model size (~400MB for ViT-H variant)
Parameters
Inference Configuration
Input Image: Image to segment Finetuned Checkpoint (Optional): Fine-tuned SAM weights Prompt Points (Optional): List of (x, y) coordinates with labels (foreground/background) Prompt Boxes (Optional): Bounding box coordinates (x1, y1, x2, y2)
Multimask Output (Default: true)
- Generate multiple masks with different levels of granularity
- Recommended to keep true for flexibility
- Model automatically ranks masks by quality score
Mask Threshold (Default: 0.0)
- Threshold for converting soft masks to binary
- 0.0 uses model's default (adaptive)
- Increase (e.g., 0.5) for tighter masks
Usage Patterns
Point Prompts
Click on objects to segment them. Use positive points (foreground) and negative points (background) to refine.
Example: Click center of object (positive), click background areas (negative) to exclude unwanted regions
Box Prompts
Draw bounding box around object for quick segmentation.
Example: Drag box around person - SAM segments precise boundaries
Combining Prompts
Use both points and boxes for maximum control.
Example: Box around object + negative points to exclude overlapping objects
Configuration Tips
Best Practices
- Start with single positive point on object center
- Add negative points to refine boundaries
- Use boxes for quick rough segmentation
- Combine prompts for complex scenarios
- multimask_output=true to see alternatives
When to Use SAM
Interactive Annotation: Creating training data for other models - SAM accelerates manual annotation
Zero-shot Tasks: Need to segment objects without training data - SAM works immediately
Flexible Applications: Object classes change frequently - no retraining needed
Prototyping: Quick experimentation with segmentation - iterate without training
When NOT to Use SAM
Fully Automatic: Need batch processing without interaction - use trained segmentation models instead
Semantic Labels: Need class labels not just masks - SAM doesn't classify, only segments
Real-time Automatic: Need automatic detection + segmentation - use Mask R-CNN or DETR Segmentation
Output
Segmentation Masks: Numpy arrays of binary masks Mask Image: Visualization of masks overlaid on input Scores: Quality/confidence scores for each predicted mask (when multimask=true)
Example Use Cases
Creating Training Data
Scenario: Need to annotate 1,000 images for custom segmentation task
Why SAM: Dramatically faster than manual pixel-level annotation. Click object, review mask, accept/refine. Can create training set in hours instead of days.
Research Prototyping
Scenario: Testing segmentation idea on new object types
Why SAM: Zero-shot capability means immediate results without collecting and annotating training data.
Interactive Photo Editing
Scenario: Consumer app for selecting and editing objects in photos
Why SAM: Users click objects, get instant precise selections without technical knowledge.
Flexible Segmentation System
Scenario: Segmentation needs change based on user requirements
Why SAM: Can segment any object on-demand without model retraining for each new class.
Comparison with Alternatives
SAM vs Mask R-CNN
Choose SAM when:
- Interactive/promptable segmentation needed
- Zero-shot on novel objects
- Creating annotation tools
- Object classes undefined or changing
Choose Mask R-CNN when:
- Fully automatic segmentation required
- Fixed set of known classes
- Batch processing thousands of images
- Need semantic class labels
- Training data available
SAM vs DETR Segmentation
Choose SAM when:
- Promptable interaction needed
- No training data available
- Quick prototyping
- Flexible, undefined object classes
Choose DETR Segmentation when:
- Automatic panoptic segmentation
- Specific trained classes
- Batch inference
- Unified detection + segmentation
- Can train custom model
SAM vs SegFormer
Choose SAM when:
- Instance segmentation (separate objects)
- Interactive prompting
- Zero-shot capability needed
Choose SegFormer when:
- Semantic segmentation (pixel classes)
- Fully automatic processing
- Dense scene labeling
- Can train on custom data
Technical Notes
- Model Variants: SAM comes in ViT-B, ViT-L, ViT-H (Huge is default, best quality)
- Inference Speed: 50-200ms per image depending on prompt complexity and GPU
- Memory: ~2-4GB GPU memory for inference
- Fine-tuning: Possible outside this system, load fine-tuned checkpoints for specialized domains