Dokumentation (english)

Stable Diffusion v1.5

Industry-standard latent diffusion model for generating photo-realistic 512x512 images from text

Stable Diffusion v1.5 is the most widely adopted text-to-image diffusion model, trained on billions of text-image pairs from the LAION dataset. Using latent diffusion in a compressed representation space, it generates high-quality 512x512 images efficiently while maintaining excellent prompt understanding and artistic flexibility. Released by Runway ML and Stability AI, it has become the foundation for countless creative and commercial applications.

When to Use Stable Diffusion v1.5

Stable Diffusion v1.5 is ideal for:

  • General image generation from text descriptions with broad creative range
  • Fine-tuning for specific styles like brand aesthetics or artistic styles
  • Product visualization when customized to your product domain
  • Content creation for marketing, social media, and digital art
  • Concept art and ideation in creative workflows
  • Custom subject generation after fine-tuning on specific subjects

This is the go-to model when you need reliable, high-quality text-to-image generation with extensive community support and resources.

Strengths

  • Excellent prompt understanding: Natural language comprehension trained on billions of captions
  • 512x512 resolution: High enough quality for most applications, manageable compute
  • Fast generation: 2-5 seconds per image on modern GPUs with 50 steps
  • Memory efficient: 4GB VRAM sufficient for inference, 8GB comfortable
  • Highly versatile: Handles diverse subjects, styles, and compositions
  • Fine-tuning capable: Can be customized for specific domains with limited data
  • Mature ecosystem: Extensive documentation, tools, and community resources
  • Reproducible: Seed control enables perfect reproduction of results

Weaknesses

  • Limited resolution: 512x512 native (can upscale but quality degrades)
  • Text rendering poor: Cannot reliably generate readable text in images
  • Fine details challenging: Small objects or intricate patterns can be problematic
  • Hand/finger issues: Common artifact with human hands
  • Training bias: LAION dataset biases affect generated content
  • Inference time: 20-100 steps needed, slower than GAN-based approaches
  • Faces can vary: Sometimes produces distorted or unrealistic faces
  • Limited composition control: Spatial relationships can be unpredictable

Architecture Overview

Latent Diffusion Model

Stable Diffusion uses a three-component architecture:

  1. Variational Autoencoder (VAE)

    • Encoder: Compresses 512x512 image to 64x64 latent representation
    • Decoder: Reconstructs image from latent space
    • 8x compression ratio enables efficient diffusion
  2. U-Net Denoiser

    • Core diffusion model operating in latent space
    • Cross-attention layers for text conditioning
    • Predicts noise to remove at each step
    • 860M parameters
  3. Text Encoder (CLIP)

    • OpenAI CLIP ViT-L/14 text encoder
    • Converts prompts to 77-token embeddings
    • Enables semantic text-image alignment

Diffusion Process:

  • Forward: Gradually add noise to latent representation
  • Reverse: Iteratively denoise guided by text prompt
  • Classifier-free guidance strengthens prompt adherence

Specifications:

  • Total Parameters: ~860M (U-Net) + 123M (text encoder) + 84M (VAE)
  • Input: Text prompt (up to 77 tokens)
  • Output: 512x512 RGB image
  • Latent Space: 64x64x4

Parameters

Training Configuration

Training Images

  • Type: Folder
  • Description: Directory containing training images for fine-tuning
  • Required: Yes for training
  • Format: JPG, PNG, WebP
  • Minimum: 50 images for style fine-tuning
  • Optimal: 200-500 images for best results
  • Resolution: Ideally 512x512 or will be resized

Image Captions (Optional but Recommended)

  • Type: CSV file (TabularBlob)
  • Description: Paired captions for training images
  • Format: filename,caption columns
  • Required: No (but significantly improves results)
  • Caption quality: Detailed descriptions (10-50 words) work best
  • Consistency: Similar caption style across dataset recommended

Learning Rate (Default: 1e-5)

  • Range: 1e-6 to 5e-5
  • Type: Float
  • Recommendation:
    • 1e-5 for standard fine-tuning (safe default)
    • 5e-6 for very small datasets (<50 images)
    • 2e-5 for large datasets (>500 images)
  • Impact: Controls how much model adapts to training data

Batch Size (Default: 4)

  • Range: 1-16
  • Type: Integer
  • Recommendation:
    • 1-2 for 8GB GPU
    • 4-8 for 16GB GPU
    • 8-16 for 24GB+ GPU
  • Impact: Larger batches more stable but require more memory

Inference Configuration

Finetuned Checkpoint (Optional)

  • Type: Artifact (.pth file)
  • Description: Custom fine-tuned model weights
  • Required: No (uses base pre-trained model if not provided)
  • Use case: When you've fine-tuned for specific style/subject

Prompt (Required)

  • Type: Text
  • Description: Text description of image to generate
  • Required: Yes
  • Length: Up to 77 tokens (roughly 60-70 words)
  • Best practices: Be specific, descriptive, include style terms
  • Examples:
    • "a photograph of a red sports car, sunset lighting, professional photography"
    • "digital art of a fantasy castle, dramatic clouds, highly detailed"

Num Inference Steps (Default: 50)

  • Range: 20-100
  • Type: Integer
  • Recommendation:
    • 20-30 for fast drafts (lower quality)
    • 50 for standard generation (good balance)
    • 70-100 for high-quality final output
  • Impact: More steps = higher quality but slower
  • Diminishing returns after 80 steps

Guidance Scale (Default: 7.5)

  • Range: 1.0-20.0
  • Type: Float
  • Recommendation:
    • 5.0-7.0 for creative, varied results
    • 7.5 for balanced (default, works great)
    • 10.0-15.0 for strict prompt adherence
    • 15.0+ can cause oversaturation/artifacts
  • Impact: Higher = stronger prompt influence, less variation

Seed (Default: 42)

  • Range: Any integer
  • Type: Integer
  • Recommendation:
    • Fix seed when iterating on prompts
    • Change seed for variations
    • Document seeds of good results
  • Impact: Controls random initialization for reproducibility

Configuration Tips

By Use Case

Artistic Exploration

  • Configuration: num_inference_steps=50, guidance_scale=7.5, vary seed
  • Prompts: Include art style references, mood descriptors
  • Strategy: Generate 4-8 variations, select best
  • Example: "impressionist painting of a garden, soft colors, monet style"

Product Visualization

  • Configuration: num_inference_steps=70, guidance_scale=10-12
  • Prompts: "professional product photo, [product], white background, studio lighting"
  • Strategy: Fine-tune on product images first for consistency
  • Use fixed seed for product variations

Character/Subject Consistency

  • Fine-tune on 100-200 images of subject
  • Use learning_rate=1e-5, batch_size=4
  • Training: 1000-2000 steps
  • Include subject token in all prompts after training

Concept Art

  • Configuration: num_inference_steps=70-100, guidance_scale=6-8
  • Prompts: Detailed scene descriptions with mood and style
  • Lower guidance_scale for creativity
  • Example: "concept art of futuristic city, neon lights, cyberpunk, highly detailed"

Fine-tuning Best Practices

Dataset Preparation

  1. Image quality: Use high-resolution source images (>512x512)
  2. Consistency: Similar style, lighting, or subject across dataset
  3. Variety: Enough variation to avoid overfitting (different angles, settings)
  4. Captions: Detailed, consistent description style

Training Configuration

  1. Starting point: learning_rate=1e-5, batch_size=4
  2. Duration: 500-2000 training steps depending on dataset size
  3. Monitoring: Generate test images every 100-200 steps
  4. Early stopping: Stop if outputs lose diversity or memorize training data

Caption Writing

  • Include important visual details (colors, lighting, composition)
  • Mention style if consistent (photographic, artistic, etc.)
  • Keep similar length and structure across dataset
  • Example good caption: "A golden retriever sitting in a park, natural daylight, shallow depth of field, professional photography"

Avoiding Overfitting

  • Don't train too long (watch for training loss plateau)
  • Use enough variety in training data (>50 images minimum)
  • Lower learning rate if model memorizes training images
  • Test on prompts not in training captions

Hardware Requirements

Minimum Configuration (Inference)

  • GPU: 6GB VRAM (GTX 1060 6GB, RTX 2060)
  • RAM: 8GB system memory
  • Storage: 4GB for model weights
  • Speed: ~10-15 seconds per image (50 steps)

Recommended Configuration (Inference)

  • GPU: 8GB VRAM (RTX 3060, RTX 4060)
  • RAM: 16GB system memory
  • Storage: 10GB (model + cache)
  • Speed: 2-5 seconds per image (50 steps)

Fine-tuning Requirements

  • GPU: 16GB+ VRAM (RTX 3090, RTX 4090, A100)
  • RAM: 32GB system memory
  • Storage: 20GB+ (model + checkpoints + dataset)
  • Training time: 1-4 hours for 1000 steps

CPU Generation

  • Possible but impractical (10-20x slower)
  • 50 steps takes 2-5 minutes per image
  • Only for testing without GPU access

Common Issues and Solutions

Poor Image Quality

Problem: Generated images are blurry, lack detail, or look unrealistic

Solutions:

  1. Increase num_inference_steps to 70-80
  2. Ensure guidance_scale is 7.5 or higher
  3. Improve prompt specificity and add quality terms
  4. Add negative prompts: "blurry, low quality, distorted"
  5. Try different seeds (some produce better results)

Prompt Not Followed

Problem: Generated image doesn't match text description

Solutions:

  1. Increase guidance_scale to 10-12
  2. Make prompt more specific and detailed
  3. Remove ambiguous or conflicting terms
  4. Place most important elements first in prompt
  5. Use more inference steps (70-100)

Artifacts and Distortions

Problem: Strange shapes, distorted faces, malformed hands

Solutions:

  1. Increase num_inference_steps to 80-100
  2. Adjust guidance_scale (try 8-10)
  3. Rephrase prompt to avoid problematic elements
  4. Change seed to try different random initialization
  5. Use negative prompts: "distorted, deformed, disfigured"

Training Instability

Problem: Training loss increases or becomes erratic

Solutions:

  1. Reduce learning_rate to 5e-6 or lower
  2. Decrease batch_size to 2
  3. Check caption quality and consistency
  4. Ensure images are valid and not corrupted
  5. Reduce training duration (stop earlier)

Out of Memory During Fine-tuning

Problem: CUDA out of memory error during training

Solutions:

  1. Reduce batch_size to 1
  2. Enable gradient checkpointing (if available in your framework)
  3. Use smaller resolution images (resize to 512x512)
  4. Close other GPU applications
  5. Use mixed precision training (FP16)

Overfitting

Problem: Model only generates images similar to training data, lacks diversity

Solutions:

  1. Reduce training steps by 30-50%
  2. Lower learning_rate to 5e-6
  3. Add more variety to training dataset
  4. Stop training earlier (monitor test generations)
  5. Use larger, more diverse dataset (>100 images)

Example Use Cases

Brand Style Fine-tuning

Scenario: Company wants to generate marketing images in consistent brand style

Configuration:

Training:
  Images: 200 brand images
  Captions: Detailed descriptions of each image
  Learning Rate: 1e-5
  Batch Size: 4
  Training Steps: 1500

Inference:
  Num Inference Steps: 70
  Guidance Scale: 10.0
  Prompts: Include brand style descriptor

Why this works: Enough images for style learning, higher guidance ensures consistency

Expected Results: Consistent brand aesthetic, 80-90% prompt adherence

Product Mockup Generation

Scenario: E-commerce site needs product visualization images

Configuration:

Training:
  Images: 100-300 product photos
  Captions: "professional product photo of [product], white background, studio lighting"
  Learning Rate: 1e-5
  Training Steps: 1000

Inference:
  Num Inference Steps: 80
  Guidance Scale: 12.0
  Seed: Fixed for product variations

Why this works: Fine-tuned for product photography style, high guidance for accuracy

Expected Results: Professional-looking product images, consistent lighting/background

Concept Art Generation

Scenario: Game studio needs environment concept art

Configuration:

Inference Only (no fine-tuning):
  Num Inference Steps: 100
  Guidance Scale: 7.0
  Prompts: "concept art of [scene], [mood], cinematic lighting, highly detailed, trending on artstation"
  Seed: Generate 8-10 variations

Why this works: Base model excellent for artistic content, lower guidance allows creativity

Expected Results: Diverse artistic interpretations, high detail, cinematic quality

Custom Character Generation

Scenario: Content creator wants consistent character across images

Configuration:

Training:
  Images: 150 character images (different poses/angles)
  Captions: "[character name] [action/pose], [setting], [style]"
  Learning Rate: 1e-5
  Training Steps: 2000

Inference:
  Num Inference Steps: 60
  Guidance Scale: 8.0
  Prompts: Include character name token

Why this works: Character consistency through fine-tuning, varied poses prevent memorization

Expected Results: 85-95% character consistency, flexible posing/settings

Comparison with Alternatives

Stable Diffusion v1.5 vs DALL-E 2

Choose Stable Diffusion v1.5 when:

  • Need fine-tuning capability
  • Want local/self-hosted deployment
  • Require full control over generation
  • Need reproducibility with seeds
  • Budget-conscious (open source, no API costs)

Choose DALL-E 2 when:

  • Need higher resolution (>512x512)
  • Want simpler API interface
  • Don't need fine-tuning
  • Prefer cloud-based solution

Stable Diffusion v1.5 vs Stable Diffusion v2.x

Choose v1.5 when:

  • Need proven, mature model
  • Want extensive community resources
  • Fine-tuning with established techniques
  • Broader prompt compatibility

Choose v2.x when:

  • Need higher resolution support
  • Want improved prompt understanding
  • Can accept less community support
  • Need better consistency

Stable Diffusion v1.5 vs Midjourney

Choose Stable Diffusion when:

  • Need fine-tuning for custom styles
  • Want local deployment
  • Require programmatic access
  • Need exact reproducibility (seeds)
  • Budget matters (open source)

Choose Midjourney when:

  • Want highest artistic quality out-of-box
  • Don't need fine-tuning
  • Prefer Discord interface
  • Can accept less control

Pre-trained vs Fine-tuned

Use pre-trained when:

  • General image generation sufficient
  • Prompt engineering meets needs
  • Limited training data (<50 images)
  • Quick experimentation needed

Fine-tune when:

  • Need consistent style/subject
  • Have quality training dataset (100+ images)
  • Can invest time in training (few hours)
  • Require brand/product consistency
  • Want specialized domain knowledge

Command Palette

Search for a command to run...

Schnellzugriffe
STRG + KSuche
STRG + DNachtmodus / Tagmodus
STRG + LSprache ändern

Software-Details
Kompiliert vor 1 Tag
Release: v4.0.0-production
Buildnummer: master@64a3463
Historie: 68 Items