Text to Image

Text-to-image generation is the task of creating images from textual descriptions using diffusion models. These models learn to gradually denoise random noise into coherent images guided by text prompts, enabling creative applications ranging from art generation to product visualization and content creation.

Learn About Text-to-Image Generation

New to text-to-image generation? Visit our Text-to-Image Concepts Guide to learn about diffusion models, prompt engineering, and best practices for generating high-quality images.

Available Models

Stable Diffusion Models

Stable Diffusion uses latent diffusion to efficiently generate high-quality images from text prompts with excellent prompt understanding.

Stable Diffusion v1.5 - Most popular text-to-image model, 512x512 output, excellent prompt adherence

Common Configuration

Training Data Structure

Fine-tuning requires paired images and captions:

train_images/
├── image1.jpg
├── image2.jpg
└── ...

image_captions.csv
├── filename,caption
├── image1.jpg,"A detailed description"
├── image2.jpg,"Another description"
└── ...

Key Generation Parameters

Num Inference Steps: Number of denoising steps

More steps: Higher quality, slower generation
Fewer steps: Faster, potentially less refined
Typical range: 20-100 steps
Default 50 is good balance

Guidance Scale: Strength of text prompt influence

Higher values: Closer adherence to prompt, less creative
Lower values: More creative variation, less prompt adherence
Typical range: 5.0-15.0
Default 7.5 works for most cases

Seed: Random seed for reproducibility

Same seed + prompt = identical image
Essential for iterating on prompts
Change seed for variations on same prompt

Learning Rate: Fine-tuning step size

Lower rates: More stable, slower learning
Higher rates: Faster but risk instability
Typical range: 1e-6 to 5e-5
Default 1e-5 recommended

Fine-tuning vs Inference

Inference Only (Recommended for Most Users)

Use pre-trained model with custom prompts
No training data required
Immediate results
Great for general image generation

Fine-tuning

Customize model for specific style or subject
Requires 50-500 paired images and captions
Training takes hours to days
Best for consistent style or specialized subjects

Understanding Metrics

Training Loss: Measures reconstruction quality

Should decrease steadily during training
Sudden spikes indicate instability
Lower is better

Denoising Quality: Visual assessment of generated images

Coherence: Objects should be recognizable
Detail: Fine details should be clear
Prompt adherence: Image should match description

Choosing the Right Model

By Priority

Best Prompt Understanding

Stable Diffusion v1.5 (excellent natural language comprehension)

Fastest Generation

Use fewer inference steps (20-30)
Reduce image resolution
Lower guidance scale slightly

Highest Quality

Use more inference steps (80-100)
Fine-tune on domain-specific data
Careful prompt engineering

By Use Case

Artistic Creation

Stable Diffusion v1.5 with guidance_scale=7.5-10
Experiment with different seeds
Use detailed, descriptive prompts

Product Visualization

Fine-tune on product images
Higher guidance_scale (10-12) for accuracy
Consistent seeds for product variations

Concept Art

Lower guidance_scale (5-7) for creativity
More inference steps (70-100)
Iterate on prompts with same seed

Content Creation

Default settings work well
Focus on prompt engineering
Generate multiple variations

Best Practices

Prompt Engineering

Be specific: Include details about style, lighting, composition
Use modifiers: Add quality terms like "highly detailed", "8k", "photorealistic"
Describe style: Reference art styles, artists, or aesthetics
Structure prompts: Subject + details + style + quality terms

Generation Strategy

Start with defaults: 50 steps, guidance_scale 7.5
Iterate systematically: Change one parameter at a time
Use consistent seeds: For comparing prompt variations
Batch generation: Generate multiple seeds to find best result

Fine-tuning Guidelines

Dataset quality: High-quality images with detailed captions
Caption consistency: Similar level of detail across captions
Training duration: Start with 500-1000 steps, monitor loss
Regular testing: Generate sample images during training
Avoid overfitting: Stop if generated images lose diversity

Hardware Considerations

GPU recommended: 8GB+ VRAM for 512x512 images
Inference speed: ~2-5 seconds per image with GPU
Training requirements: 16GB+ VRAM for fine-tuning
CPU generation: Possible but 10-20x slower

Common Prompt Patterns

Photorealistic Images

"a photograph of [subject], [details], natural lighting,
high resolution, professional photography, sharp focus"

Artistic Style

"[subject] in the style of [artist/style], [mood],
[color palette], highly detailed digital art"

Product Rendering

"professional product photo of [product], [background],
studio lighting, high quality, commercial photography"

Concept Art

"concept art of [subject], [setting], [mood],
cinematic lighting, detailed, trending on artstation"

Common Pitfalls

Low Quality Outputs

Solution: Increase inference steps to 70-100, add quality terms to prompt, check guidance_scale is 7.5+

Prompt Not Followed

Solution: Increase guidance_scale to 10-15, make prompt more specific, try rephrasing prompt

Repetitive Results

Solution: Change seed, reduce guidance_scale slightly, add variety to prompts

Training Instability

Solution: Lower learning rate to 5e-6, reduce batch size, check caption quality

Out of Memory Errors

Solution: Reduce batch size, use gradient checkpointing, generate smaller images

Artifacts or Distortions

Solution: Increase inference steps, adjust guidance_scale, improve prompt clarity

Advanced Techniques

Negative Prompts

Specify what you don't want in the image
Useful for avoiding common artifacts
Example: "blurry, low quality, distorted, ugly"

Iterative Refinement

Generate with default settings
Identify what's wrong
Adjust prompt or parameters
Keep seed constant for comparison

Style Consistency

Fine-tune on consistent dataset
Use similar prompt structures
Document successful prompts
Maintain seed lists for good results

Multi-subject Composition

Be explicit about spatial relationships
Use positional terms (left, right, foreground, background)
Separate subjects with commas
Increase guidance_scale for complex scenes

Text to Image

Available Models

Stable Diffusion Models

Common Configuration

Training Data Structure

Key Generation Parameters

Fine-tuning vs Inference

Understanding Metrics

Choosing the Right Model

By Priority

By Use Case

Best Practices

Prompt Engineering

Generation Strategy

Fine-tuning Guidelines

Hardware Considerations

Common Prompt Patterns

Common Pitfalls

Low Quality Outputs

Prompt Not Followed

Repetitive Results

Training Instability

Out of Memory Errors

Artifacts or Distortions

Advanced Techniques

Negative Prompts

Iterative Refinement

Style Consistency

Multi-subject Composition

On this page

Sicherheit auf Enterprise-Niveau

In jeder Infrastruktur einsetzbar

DSGVO-konform

Text to Image

Available Models

Stable Diffusion Models

Common Configuration

Training Data Structure

Key Generation Parameters

Fine-tuning vs Inference

Understanding Metrics

Choosing the Right Model

By Priority

By Use Case

Best Practices

Prompt Engineering

Generation Strategy

Fine-tuning Guidelines

Hardware Considerations

Common Prompt Patterns

Common Pitfalls

Low Quality Outputs

Prompt Not Followed

Repetitive Results

Training Instability

Out of Memory Errors

Artifacts or Distortions

Advanced Techniques

Negative Prompts

Iterative Refinement

Style Consistency

Multi-subject Composition

On this page

Command Palette