Text to Image
Generate photo-realistic images from text descriptions
Text-to-image generation is the task of creating images from textual descriptions using diffusion models. These models learn to gradually denoise random noise into coherent images guided by text prompts, enabling creative applications ranging from art generation to product visualization and content creation.
Learn About Text-to-Image Generation
New to text-to-image generation? Visit our Text-to-Image Concepts Guide to learn about diffusion models, prompt engineering, and best practices for generating high-quality images.
Available Models
Stable Diffusion Models
Stable Diffusion uses latent diffusion to efficiently generate high-quality images from text prompts with excellent prompt understanding.
- Stable Diffusion v1.5 - Most popular text-to-image model, 512x512 output, excellent prompt adherence
Common Configuration
Training Data Structure
Fine-tuning requires paired images and captions:
train_images/
├── image1.jpg
├── image2.jpg
└── ...
image_captions.csv
├── filename,caption
├── image1.jpg,"A detailed description"
├── image2.jpg,"Another description"
└── ...Key Generation Parameters
Num Inference Steps: Number of denoising steps
- More steps: Higher quality, slower generation
- Fewer steps: Faster, potentially less refined
- Typical range: 20-100 steps
- Default 50 is good balance
Guidance Scale: Strength of text prompt influence
- Higher values: Closer adherence to prompt, less creative
- Lower values: More creative variation, less prompt adherence
- Typical range: 5.0-15.0
- Default 7.5 works for most cases
Seed: Random seed for reproducibility
- Same seed + prompt = identical image
- Essential for iterating on prompts
- Change seed for variations on same prompt
Learning Rate: Fine-tuning step size
- Lower rates: More stable, slower learning
- Higher rates: Faster but risk instability
- Typical range: 1e-6 to 5e-5
- Default 1e-5 recommended
Fine-tuning vs Inference
Inference Only (Recommended for Most Users)
- Use pre-trained model with custom prompts
- No training data required
- Immediate results
- Great for general image generation
Fine-tuning
- Customize model for specific style or subject
- Requires 50-500 paired images and captions
- Training takes hours to days
- Best for consistent style or specialized subjects
Understanding Metrics
Training Loss: Measures reconstruction quality
- Should decrease steadily during training
- Sudden spikes indicate instability
- Lower is better
Denoising Quality: Visual assessment of generated images
- Coherence: Objects should be recognizable
- Detail: Fine details should be clear
- Prompt adherence: Image should match description
Choosing the Right Model
By Priority
Best Prompt Understanding
- Stable Diffusion v1.5 (excellent natural language comprehension)
Fastest Generation
- Use fewer inference steps (20-30)
- Reduce image resolution
- Lower guidance scale slightly
Highest Quality
- Use more inference steps (80-100)
- Fine-tune on domain-specific data
- Careful prompt engineering
By Use Case
Artistic Creation
- Stable Diffusion v1.5 with guidance_scale=7.5-10
- Experiment with different seeds
- Use detailed, descriptive prompts
Product Visualization
- Fine-tune on product images
- Higher guidance_scale (10-12) for accuracy
- Consistent seeds for product variations
Concept Art
- Lower guidance_scale (5-7) for creativity
- More inference steps (70-100)
- Iterate on prompts with same seed
Content Creation
- Default settings work well
- Focus on prompt engineering
- Generate multiple variations
Best Practices
Prompt Engineering
- Be specific: Include details about style, lighting, composition
- Use modifiers: Add quality terms like "highly detailed", "8k", "photorealistic"
- Describe style: Reference art styles, artists, or aesthetics
- Structure prompts: Subject + details + style + quality terms
Generation Strategy
- Start with defaults: 50 steps, guidance_scale 7.5
- Iterate systematically: Change one parameter at a time
- Use consistent seeds: For comparing prompt variations
- Batch generation: Generate multiple seeds to find best result
Fine-tuning Guidelines
- Dataset quality: High-quality images with detailed captions
- Caption consistency: Similar level of detail across captions
- Training duration: Start with 500-1000 steps, monitor loss
- Regular testing: Generate sample images during training
- Avoid overfitting: Stop if generated images lose diversity
Hardware Considerations
- GPU recommended: 8GB+ VRAM for 512x512 images
- Inference speed: ~2-5 seconds per image with GPU
- Training requirements: 16GB+ VRAM for fine-tuning
- CPU generation: Possible but 10-20x slower
Common Prompt Patterns
Photorealistic Images
"a photograph of [subject], [details], natural lighting,
high resolution, professional photography, sharp focus"Artistic Style
"[subject] in the style of [artist/style], [mood],
[color palette], highly detailed digital art"Product Rendering
"professional product photo of [product], [background],
studio lighting, high quality, commercial photography"Concept Art
"concept art of [subject], [setting], [mood],
cinematic lighting, detailed, trending on artstation"Common Pitfalls
Low Quality Outputs
Solution: Increase inference steps to 70-100, add quality terms to prompt, check guidance_scale is 7.5+
Prompt Not Followed
Solution: Increase guidance_scale to 10-15, make prompt more specific, try rephrasing prompt
Repetitive Results
Solution: Change seed, reduce guidance_scale slightly, add variety to prompts
Training Instability
Solution: Lower learning rate to 5e-6, reduce batch size, check caption quality
Out of Memory Errors
Solution: Reduce batch size, use gradient checkpointing, generate smaller images
Artifacts or Distortions
Solution: Increase inference steps, adjust guidance_scale, improve prompt clarity
Advanced Techniques
Negative Prompts
- Specify what you don't want in the image
- Useful for avoiding common artifacts
- Example: "blurry, low quality, distorted, ugly"
Iterative Refinement
- Generate with default settings
- Identify what's wrong
- Adjust prompt or parameters
- Keep seed constant for comparison
Style Consistency
- Fine-tune on consistent dataset
- Use similar prompt structures
- Document successful prompts
- Maintain seed lists for good results
Multi-subject Composition
- Be explicit about spatial relationships
- Use positional terms (left, right, foreground, background)
- Separate subjects with commas
- Increase guidance_scale for complex scenes