Dokumentation (english)

Text to Image

Generate photo-realistic images from text descriptions

Text-to-image generation is the task of creating images from textual descriptions using diffusion models. These models learn to gradually denoise random noise into coherent images guided by text prompts, enabling creative applications ranging from art generation to product visualization and content creation.

Learn About Text-to-Image Generation

New to text-to-image generation? Visit our Text-to-Image Concepts Guide to learn about diffusion models, prompt engineering, and best practices for generating high-quality images.

Available Models

Stable Diffusion Models

Stable Diffusion uses latent diffusion to efficiently generate high-quality images from text prompts with excellent prompt understanding.

Common Configuration

Training Data Structure

Fine-tuning requires paired images and captions:

train_images/
├── image1.jpg
├── image2.jpg
└── ...

image_captions.csv
├── filename,caption
├── image1.jpg,"A detailed description"
├── image2.jpg,"Another description"
└── ...

Key Generation Parameters

Num Inference Steps: Number of denoising steps

  • More steps: Higher quality, slower generation
  • Fewer steps: Faster, potentially less refined
  • Typical range: 20-100 steps
  • Default 50 is good balance

Guidance Scale: Strength of text prompt influence

  • Higher values: Closer adherence to prompt, less creative
  • Lower values: More creative variation, less prompt adherence
  • Typical range: 5.0-15.0
  • Default 7.5 works for most cases

Seed: Random seed for reproducibility

  • Same seed + prompt = identical image
  • Essential for iterating on prompts
  • Change seed for variations on same prompt

Learning Rate: Fine-tuning step size

  • Lower rates: More stable, slower learning
  • Higher rates: Faster but risk instability
  • Typical range: 1e-6 to 5e-5
  • Default 1e-5 recommended

Fine-tuning vs Inference

Inference Only (Recommended for Most Users)

  • Use pre-trained model with custom prompts
  • No training data required
  • Immediate results
  • Great for general image generation

Fine-tuning

  • Customize model for specific style or subject
  • Requires 50-500 paired images and captions
  • Training takes hours to days
  • Best for consistent style or specialized subjects

Understanding Metrics

Training Loss: Measures reconstruction quality

  • Should decrease steadily during training
  • Sudden spikes indicate instability
  • Lower is better

Denoising Quality: Visual assessment of generated images

  • Coherence: Objects should be recognizable
  • Detail: Fine details should be clear
  • Prompt adherence: Image should match description

Choosing the Right Model

By Priority

Best Prompt Understanding

  1. Stable Diffusion v1.5 (excellent natural language comprehension)

Fastest Generation

  1. Use fewer inference steps (20-30)
  2. Reduce image resolution
  3. Lower guidance scale slightly

Highest Quality

  1. Use more inference steps (80-100)
  2. Fine-tune on domain-specific data
  3. Careful prompt engineering

By Use Case

Artistic Creation

  • Stable Diffusion v1.5 with guidance_scale=7.5-10
  • Experiment with different seeds
  • Use detailed, descriptive prompts

Product Visualization

  • Fine-tune on product images
  • Higher guidance_scale (10-12) for accuracy
  • Consistent seeds for product variations

Concept Art

  • Lower guidance_scale (5-7) for creativity
  • More inference steps (70-100)
  • Iterate on prompts with same seed

Content Creation

  • Default settings work well
  • Focus on prompt engineering
  • Generate multiple variations

Best Practices

Prompt Engineering

  1. Be specific: Include details about style, lighting, composition
  2. Use modifiers: Add quality terms like "highly detailed", "8k", "photorealistic"
  3. Describe style: Reference art styles, artists, or aesthetics
  4. Structure prompts: Subject + details + style + quality terms

Generation Strategy

  1. Start with defaults: 50 steps, guidance_scale 7.5
  2. Iterate systematically: Change one parameter at a time
  3. Use consistent seeds: For comparing prompt variations
  4. Batch generation: Generate multiple seeds to find best result

Fine-tuning Guidelines

  1. Dataset quality: High-quality images with detailed captions
  2. Caption consistency: Similar level of detail across captions
  3. Training duration: Start with 500-1000 steps, monitor loss
  4. Regular testing: Generate sample images during training
  5. Avoid overfitting: Stop if generated images lose diversity

Hardware Considerations

  • GPU recommended: 8GB+ VRAM for 512x512 images
  • Inference speed: ~2-5 seconds per image with GPU
  • Training requirements: 16GB+ VRAM for fine-tuning
  • CPU generation: Possible but 10-20x slower

Common Prompt Patterns

Photorealistic Images

"a photograph of [subject], [details], natural lighting,
high resolution, professional photography, sharp focus"

Artistic Style

"[subject] in the style of [artist/style], [mood],
[color palette], highly detailed digital art"

Product Rendering

"professional product photo of [product], [background],
studio lighting, high quality, commercial photography"

Concept Art

"concept art of [subject], [setting], [mood],
cinematic lighting, detailed, trending on artstation"

Common Pitfalls

Low Quality Outputs

Solution: Increase inference steps to 70-100, add quality terms to prompt, check guidance_scale is 7.5+

Prompt Not Followed

Solution: Increase guidance_scale to 10-15, make prompt more specific, try rephrasing prompt

Repetitive Results

Solution: Change seed, reduce guidance_scale slightly, add variety to prompts

Training Instability

Solution: Lower learning rate to 5e-6, reduce batch size, check caption quality

Out of Memory Errors

Solution: Reduce batch size, use gradient checkpointing, generate smaller images

Artifacts or Distortions

Solution: Increase inference steps, adjust guidance_scale, improve prompt clarity

Advanced Techniques

Negative Prompts

  • Specify what you don't want in the image
  • Useful for avoiding common artifacts
  • Example: "blurry, low quality, distorted, ugly"

Iterative Refinement

  1. Generate with default settings
  2. Identify what's wrong
  3. Adjust prompt or parameters
  4. Keep seed constant for comparison

Style Consistency

  • Fine-tune on consistent dataset
  • Use similar prompt structures
  • Document successful prompts
  • Maintain seed lists for good results

Multi-subject Composition

  • Be explicit about spatial relationships
  • Use positional terms (left, right, foreground, background)
  • Separate subjects with commas
  • Increase guidance_scale for complex scenes

Command Palette

Search for a command to run...

Schnellzugriffe
STRG + KSuche
STRG + DNachtmodus / Tagmodus
STRG + LSprache ändern

Software-Details
Kompiliert vor 1 Tag
Release: v4.0.0-production
Buildnummer: master@64a3463
Historie: 68 Items