Dokumentation (english)

Depth Estimation

Predicting the distance of objects from the camera to understand 3D scene structure

Depth estimation is the task of predicting the distance from the camera to every pixel or point in an image, creating a depth map that represents the 3D structure of the scene. It's a fundamental computer vision task that enables machines to understand spatial relationships and 3D geometry from 2D images.

📚 Training Depth Estimation Models

Looking to train depth estimation models? Check out our comprehensive Depth Estimation Training Guide with detailed parameter documentation for all available models and training techniques.

What is Depth Estimation?

Depth estimation takes a 2D image as input and produces a depth map — an image where each pixel value represents the distance from the camera to the corresponding point in the scene. Brighter pixels typically indicate objects closer to the camera, while darker pixels represent farther objects (or vice versa, depending on representation).

Examples:

  • A photo of a street scene → depth map showing cars nearby and buildings far away
  • Indoor room image → depth values for furniture, walls, and floor at different distances
  • Portrait photo → depth map distinguishing person from background
  • Landscape image → depth gradients from foreground to distant mountains

Applications: Autonomous driving, robotics, AR/VR, 3D reconstruction, photo effects, and accessibility tools.

Key Concepts

Monocular vs. Stereo Depth Estimation

Monocular depth estimation:

  • Uses a single image as input
  • More challenging: lacks explicit stereo cues
  • Relies on learned priors about object sizes, perspective, occlusion
  • More practical: works with any camera, including existing photos
  • This is the focus of most modern deep learning approaches

Stereo depth estimation:

  • Uses two images from different viewpoints (like human eyes)
  • Triangulates depth through disparity between views
  • More geometrically grounded
  • Requires calibrated stereo camera setup
  • Classical approaches well-established

Multi-view depth estimation:

  • Uses multiple images from different angles
  • Structure-from-Motion (SfM) techniques
  • More accurate but requires multiple captures

Depth Maps

The output of depth estimation — a 2D map where each pixel encodes depth information:

Representation formats:

  • Inverse depth: d=1/zd = 1/z (common in learning-based methods)
  • Disparity: Related to depth in stereo vision
  • Metric depth: Actual distance in meters
  • Relative depth: Ordinal relationships without absolute scale

Visualization:

  • Typically shown as grayscale images
  • Colormaps (Viridis, Plasma, Turbo) for better perception
  • Lighter/warmer colors for near, darker/cooler for far (or inverted)

Resolution: Usually matches input image resolution, though some methods predict at different scales.

Relative vs. Absolute Depth

Relative (ordinal) depth:

  • Predicts depth ordering: which objects are closer or farther
  • No absolute scale (one scene's "10" might be another's "100")
  • Sufficient for many applications (photo effects, occlusion reasoning)
  • Easier to learn: consistent across different scenes
  • Most general-purpose models predict relative depth

Absolute (metric) depth:

  • Predicts actual distances in real-world units (meters)
  • Requires training data with ground-truth metric measurements
  • Scene-specific scale factor
  • Essential for robotics, autonomous driving, AR
  • Harder to generalize across domains

Scale ambiguity: Monocular depth estimation fundamentally cannot determine absolute scale without additional information (like known object sizes).

Disparity

In stereo vision, disparity is the difference in image location of an object when viewed from different positions:

Relationship to depth:

z=fbdz = \frac{f \cdot b}{d}

where:

  • zz = depth
  • ff = focal length
  • bb = baseline (distance between cameras)
  • dd = disparity

Inverse relationship: Closer objects have larger disparity, farther objects smaller.

Scale Ambiguity

Fundamental challenge in monocular depth estimation:

Problem: From a single image, a small nearby object looks identical to a large far-away object.

Example: A toy car close to the camera vs. a real car far away can produce the same image.

Implications:

  • Cannot recover absolute metric depth without additional cues
  • Models learn statistical priors about typical object sizes
  • Scale may vary between scenes
  • Fine-tuning on domain-specific data can improve scale consistency

Solutions:

  • Multi-view methods (structure-from-motion)
  • Known reference objects in scene
  • Sensor fusion (camera + LiDAR)
  • Domain-specific training (e.g., only indoor or only outdoor)

Approaches and Architectures

CNN-Based Methods

Convolutional Neural Networks for dense depth prediction:

Early approaches:

  • Eigen et al. (2014): Multi-scale architecture, coarse-to-fine prediction
  • FCRN (Fully Convolutional Residual Networks): Up-convolution for resolution recovery
  • Typically used encoder-decoder architectures

MiDaS family (Mixed Data Strategy):

  • MiDaS v2: ResNet or ResNeXt encoder, multi-scale decoder
  • MiDaS v3: Combines multiple datasets with affine-invariant loss
  • MiDaS v3.1: Adds smaller efficient models (DPT-Hybrid)
  • Trained on diverse datasets for generalization
  • Predicts relative depth (robust across domains)
  • State-of-the-art zero-shot performance

Key techniques:

  • Skip connections from encoder to decoder
  • Multi-scale feature fusion
  • Up-sampling strategies (transpose convolutions, bilinear interpolation)

Transformer-Based Methods

Vision Transformers applied to depth estimation:

DPT (Dense Prediction Transformer):

  • Vision Transformer (ViT) backbone
  • Reassembles tokens at multiple scales
  • Convolutional decoder for dense predictions
  • Better global context understanding than CNNs
  • Included in MiDaS v3 family

Depth Anything (2024):

  • Large-scale foundation model approach
  • Trained on massive unlabeled data with pseudo-labels
  • Strong zero-shot generalization
  • Versions: Small, Base, Large
  • Excellent fine-grained detail and edge preservation

DepthFormer:

  • Transformer encoder with hierarchical features
  • Efficient attention mechanisms
  • Competitive accuracy with lower compute

Advantages of Transformers:

  • Better long-range dependencies
  • More effective at capturing global scene context
  • Superior performance with sufficient data
  • Higher computational cost than CNNs

Self-Supervised Learning

Training depth models without ground-truth depth labels:

Core idea: Use stereo pairs or video sequences and enforce geometric consistency.

Monodepth2:

  • Predicts depth from monocular video
  • Loss based on photometric reprojection error
  • Learns from temporal consistency
  • No explicit depth supervision needed

Process:

  1. Predict depth for frame t
  2. Predict camera pose between frames t and t+1
  3. Warp frame t+1 to frame t using predicted depth and pose
  4. Minimize reconstruction error between original and warped frame

Formula:

Lphoto=pminiSSIM(It(p),It+i(p))L_{\text{photo}} = \sum_p \min_i \text{SSIM}(I_t(p), I_{t+i}'(p))

where It+iI_{t+i}' is the warped image.

Benefits:

  • No expensive depth annotations needed
  • Can leverage abundant video data
  • Learns from real-world geometric constraints

Challenges:

  • Moving objects violate static scene assumption
  • Textureless regions provide little supervision
  • Scale ambiguity remains

Multi-Task Learning

Learning depth jointly with related tasks:

Common combinations:

  • Depth + Semantic Segmentation: Shared features benefit both tasks
  • Depth + Surface Normals: Geometric consistency
  • Depth + Optical Flow: Motion understanding

Benefits:

  • Improved generalization through shared representations
  • Mutual regularization between tasks
  • More efficient use of training data

Example architecture:

  • Shared encoder
  • Task-specific decoder heads
  • Multi-task loss: L=αLdepth+βLsegmentationL = \alpha L_{\text{depth}} + \beta L_{\text{segmentation}}

Evaluation Metrics

Absolute Relative Error (Abs Rel)

Measures relative depth error averaged over all pixels:

Abs Rel=1TiTziz^izi\text{Abs Rel} = \frac{1}{|T|} \sum_{i \in T} \frac{|z_i - \hat{z}_i|}{z_i}

where ziz_i is ground-truth depth and z^i\hat{z}_i is predicted depth for pixel ii.

Interpretation:

  • Lower is better (0 = perfect prediction)
  • Scale-independent: works for relative and absolute depth
  • Emphasizes relative accuracy rather than absolute values
  • Commonly reported metric

Root Mean Squared Error (RMSE)

Standard metric for prediction error:

RMSE=1TiT(ziz^i)2\text{RMSE} = \sqrt{\frac{1}{|T|} \sum_{i \in T} (z_i - \hat{z}_i)^2}

Interpretation:

  • Lower is better
  • Units match depth units (meters for metric depth)
  • Sensitive to outliers (large errors heavily penalized)
  • Commonly used alongside RMSE log

Log RMSE

RMSE in logarithmic space:

RMSE log=1TiT(logzilogz^i)2\text{RMSE log} = \sqrt{\frac{1}{|T|} \sum_{i \in T} (\log z_i - \log \hat{z}_i)^2}

Benefits:

  • Less sensitive to absolute scale
  • Treats relative errors more uniformly across depth ranges
  • Better for relative depth evaluation
  • More robust to outliers

Threshold Accuracy (δ < 1.25)

Percentage of pixels with relative error below threshold:

δt=1TiT1[max(ziz^i,z^izi)<t]\delta_t = \frac{1}{|T|} \sum_{i \in T} \mathbb{1}[\max(\frac{z_i}{\hat{z}_i}, \frac{\hat{z}_i}{z_i}) < t]

Common thresholds:

  • δ<1.25\delta < 1.25 (δ₁)
  • δ<1.252=1.5625\delta < 1.25^2 = 1.5625 (δ₂)
  • δ<1.253=1.953\delta < 1.25^3 = 1.953 (δ₃)

Interpretation:

  • Higher is better (1.0 = 100% of pixels accurate)
  • δ1>0.9\delta_1 > 0.9 indicates very good performance
  • Scale-invariant metric
  • Intuitive: "percentage of pixels with small enough error"

Squared Relative Error (Sq Rel)

Squared relative differences:

Sq Rel=1TiT(ziz^i)2zi\text{Sq Rel} = \frac{1}{|T|} \sum_{i \in T} \frac{(z_i - \hat{z}_i)^2}{z_i}

Interpretation:

  • Lower is better
  • More heavily penalizes outliers than Abs Rel
  • Less commonly reported than other metrics

Scale-Invariant Metrics

For relative depth evaluation where absolute scale is irrelevant:

Scale-invariant log error:

SI-log=1Ti(logzilogz^idˉ)2\text{SI-log} = \sqrt{\frac{1}{|T|} \sum_i (\log z_i - \log \hat{z}_i - \bar{d})^2}

where dˉ\bar{d} is the mean log difference (aligning scales).

Use case: Evaluating models that predict relative depth without absolute scale.

Output Interpretation

Depth Map Visualization

Converting depth values to interpretable images:

Grayscale representation:

  • Normalize depth values to [0, 255]
  • Black = far, White = near (or inverted)
  • Simple but limited perceptual range

Color mapping:

  • Apply colormaps (Viridis, Plasma, Turbo, Jet)
  • Better perceptual discrimination of depth levels
  • More visually appealing
  • Standard in publications and demos

Example (Python):

import matplotlib.pyplot as plt
depth_colored = plt.cm.viridis(depth_normalized)

Normalization Strategies

Depth maps often require normalization for visualization or downstream tasks:

Min-max normalization:

dnorm=ddmindmaxdmind_{\text{norm}} = \frac{d - d_{\min}}{d_{\max} - d_{\min}}
  • Maps to [0, 1] range
  • Preserves relative ordering
  • Sensitive to outliers

Percentile clipping:

  • Clip to 1st and 99th percentiles
  • Then apply min-max normalization
  • More robust to outliers and noise

Inverse depth normalization:

  • Work with 1/d1/d instead of dd
  • Better numerical properties for distant objects
  • Common in learning-based methods

Converting to 3D Point Clouds

Depth maps can be unprojected to 3D points:

Camera intrinsics required:

  • Focal length fx,fyf_x, f_y
  • Principal point cx,cyc_x, c_y

Unprojection formula for pixel (u,v)(u, v) with depth zz:

X=(ucx)zfxX = \frac{(u - c_x) \cdot z}{f_x} Y=(vcy)zfyY = \frac{(v - c_y) \cdot z}{f_y} Z=zZ = z

Result: 3D point cloud representing the scene geometry.

Applications:

  • 3D reconstruction
  • Mesh generation
  • Scene understanding
  • AR/VR rendering

Confidence and Uncertainty

Some methods provide uncertainty estimates alongside depth:

Types:

  • Aleatoric uncertainty: Inherent noise in data
  • Epistemic uncertainty: Model uncertainty (lack of knowledge)

Use cases:

  • Filter unreliable predictions
  • Adaptive processing based on confidence
  • Active learning for data collection

Common Challenges

Scale Ambiguity

Problem: Monocular depth cannot determine absolute scale.

Manifestation:

  • Same model produces different scales for different scenes
  • Toy objects vs. real objects confusion
  • Inconsistent metric values

Solutions:

  • Accept relative depth for applicable use cases
  • Fine-tune on domain-specific data with consistent scale
  • Use sensor fusion (camera + LiDAR) for ground truth
  • Incorporate known object sizes as cues
  • Multi-view geometry for scale recovery

Reflective and Transparent Surfaces

Problem: Mirrors, glass, water violate appearance-depth consistency.

Why it happens:

  • Reflected/refracted content doesn't match actual surface depth
  • Models trained on opaque surfaces struggle
  • Specular reflections mislead appearance-based methods

Impact:

  • Windows often assigned incorrect depth
  • Mirrors show depth of reflected content, not surface
  • Water bodies may have inconsistent depth

Solutions:

  • Training data with challenging reflective surfaces
  • Multi-modal inputs (polarization, thermal)
  • Explicit modeling of reflectance properties
  • Post-processing to detect and handle glass/mirrors

Textureless Regions

Problem: Large uniform areas (walls, sky, roads) lack features.

Why it happens:

  • Deep learning relies on visual patterns
  • Flat color regions provide little information
  • Self-supervised methods get weak photometric signal

Impact:

  • Smooth regions may have noisy or incorrect depth
  • Over-smoothing or artifacts
  • Uncertain boundaries

Solutions:

  • Smoothness priors and regularization
  • Multi-scale feature extraction
  • Transformer attention for global context
  • Edge-aware refinement
  • Surface normal constraints

Edge Artifacts

Problem: Blurry or inaccurate depth boundaries between objects.

Causes:

  • Upsampling in decoder loses fine detail
  • Conflicting depth values at boundaries
  • Limited resolution in latent representations

Impact:

  • Fuzzy object boundaries
  • Halo effects
  • Depth bleeding across edges

Solutions:

  • Higher resolution processing
  • Edge-preserving losses
  • Guided filtering with image edges
  • Attention mechanisms for sharp boundaries
  • Instance-aware depth prediction

Indoor vs. Outdoor Scene Differences

Problem: Performance varies significantly between environments.

Differences:

  • Indoor: Complex layouts, small spaces, more occlusion, artificial lighting
  • Outdoor: Larger scales, different depth ranges, natural lighting, weather

Impact:

  • Models trained on one domain struggle on the other
  • Depth range assumptions may not transfer
  • Different typical object distributions

Solutions:

  • Domain-specific training or fine-tuning
  • Mixed dataset training (like MiDaS)
  • Domain adaptation techniques
  • Separate models for different environments
  • Adaptive normalization based on scene type

Computational Cost

Trade-off: Accuracy vs. inference speed.

Factors:

  • Model architecture (CNN vs. Transformer)
  • Input resolution
  • Model size (parameters)

Speed requirements:

  • Real-time robotics: 30+ FPS
  • Offline 3D reconstruction: Slower acceptable
  • Mobile AR: Must run on device with limited power

Optimization:

  • Smaller models (MiDaS-small, Depth Anything-S)
  • Lower input resolution with upsampling
  • Model quantization and pruning
  • Hardware acceleration (TensorRT, ONNX)
  • Efficient architectures (MobileNet-based encoders)

Practical Applications

3D Reconstruction

Creating 3D models from 2D images:

Process:

  1. Depth estimation for each view
  2. Point cloud generation
  3. Mesh reconstruction (Poisson, TSDF fusion)
  4. Texture mapping

Applications:

  • Building and environment scanning
  • Cultural heritage preservation
  • E-commerce product models
  • Virtual reality environments

Autonomous Navigation

Depth sensing for robots and vehicles:

Use cases:

  • Obstacle detection and avoidance
  • Path planning in 3D space
  • Terrain assessment
  • Safe distance estimation

Advantages of monocular:

  • Works with single camera (cost-effective)
  • Complements LiDAR and radar sensors
  • Wide field of view

AR/VR Applications

Depth information for immersive experiences:

Applications:

  • Occlusion handling (virtual objects behind real ones)
  • Physics simulation (objects interact with environment)
  • Hand tracking and gesture recognition
  • Scene understanding and semantic mapping

Requirements:

  • Real-time performance (30+ FPS)
  • Accurate depth at interactive ranges
  • Temporal consistency across frames

Robotics

Depth perception for robot manipulation and interaction:

Use cases:

  • Grasp planning and manipulation
  • Navigation in cluttered environments
  • Human-robot interaction (safe distances)
  • Object localization and tracking

Challenges:

  • Need metric depth for precise control
  • Real-time requirements
  • Varied lighting and environments

Photo Effects

Depth-based image editing:

Effects:

  • Bokeh/Portrait mode: Blur background based on depth
  • 3D Photos: Parallax effect from depth (Facebook 3D Photos)
  • Relighting: Depth-aware lighting adjustments
  • Depth-based filters: Artistic effects using depth

Approach:

  • Depth estimation from single photo
  • Relative depth sufficient (no metric accuracy needed)
  • Post-processing for smoothness and quality

Popular implementations:

  • Smartphone portrait modes
  • Instagram/Snapchat filters
  • Photo editing software

Accessibility Tools

Depth information for visually impaired users:

Applications:

  • Audio feedback about obstacles and distances
  • Haptic feedback for navigation
  • Describing spatial relationships in scenes
  • Safe mobility assistance

Requirements:

  • Real-time depth on mobile devices
  • Accurate obstacle detection
  • Reliable in varied environments

Safety and Surveillance

Depth-enhanced monitoring:

Use cases:

  • Perimeter intrusion detection (depth-based zones)
  • Fall detection (person height from depth)
  • Crowd density estimation
  • Anomaly detection in 3D space

Choosing an Approach

Consider these factors when selecting a depth estimation method:

For general-purpose zero-shot depth:

  • Depth Anything: Latest, strongest generalization
  • MiDaS v3.1: Excellent balance, widely used
  • Predict relative depth, work across domains
  • Good for photo effects, visualization, initial prototyping

For metric depth estimation:

  • Fine-tune on domain-specific data with ground truth
  • Use stereo or LiDAR during training
  • Essential for robotics and autonomous systems
  • Consider domain: indoor (NYU Depth v2) vs. outdoor (KITTI)

For real-time applications:

  • Smaller models (MiDaS-small, Depth Anything-S)
  • Lower input resolution (e.g., 256×256 or 384×384)
  • Efficient backbones (MobileNet, EfficientNet)
  • Optimize with TensorRT or ONNX
  • Profile on target hardware

For highest accuracy:

  • Large transformer models (Depth Anything-L, DPT-Large)
  • High input resolution (512×512 or higher)
  • Ensemble multiple models
  • Multi-view or stereo methods if possible
  • Accept slower inference

For indoor scenes:

  • Models trained on indoor datasets (NYU Depth v2)
  • Smaller depth ranges, complex layouts
  • Fine-tune on similar environments

For outdoor/driving scenes:

  • Models trained on KITTI or similar
  • Larger depth ranges
  • Handle varying lighting and weather

For mobile deployment:

  • Lightweight architectures
  • Quantization (INT8 or FP16)
  • On-device frameworks (TensorFlow Lite, CoreML)
  • Balance accuracy and latency

Next Steps

Ready to train or fine-tune depth estimation models? Our Depth Estimation Training Guide provides comprehensive documentation on:

  • Available architectures (MiDaS, DPT, Depth Anything)
  • Training strategies for metric vs. relative depth
  • Dataset preparation and augmentation
  • Fine-tuning on custom domains
  • Self-supervised training techniques
  • Inference optimization and deployment

For understanding related computer vision tasks, see:


Command Palette

Search for a command to run...

Schnellzugriffe
STRG + KSuche
STRG + DNachtmodus / Tagmodus
STRG + LSprache ändern

Software-Details
Kompiliert vor 1 Tag
Release: v4.0.0-production
Buildnummer: master@64a3463
Historie: 68 Items