Dokumentation (english)

ViT Base

Vision Transformer Base model for image classification tasks

ViT (Vision Transformer) Base is a transformer-based architecture that treats image classification as a sequence modeling problem. It splits images into patches, projects them to embeddings, and processes them through standard transformer layers. With 86 million parameters, ViT Base offers an excellent balance between accuracy and computational requirements.

When to Use ViT Base

ViT Base excels in scenarios where you have:

  • Moderate to large datasets (1,000+ images per class)
  • Sufficient computational resources for transformer training
  • Need for high accuracy without the overhead of ViT Large
  • Complex visual patterns that benefit from global attention mechanisms

Choose ViT Base when you need better accuracy than ResNet-50 and have enough data to effectively fine-tune transformer models.

Strengths

  • Superior accuracy: Outperforms CNN models of similar size on most benchmarks
  • Global receptive field: Attention mechanism captures long-range dependencies from the first layer
  • Scalability: Architecture scales well to larger datasets and model sizes
  • Transfer learning: Pre-trained on ImageNet-21k, excellent for fine-tuning
  • Patch-based processing: Inherently handles variable input sizes with minimal modifications

Weaknesses

  • Data hungry: Requires more training data than CNNs for optimal performance
  • Computational cost: Higher memory and compute requirements than ResNet models
  • Training time: Slower to train than equivalent-sized CNN architectures
  • Inductive bias: Lacks the built-in translation equivariance of convolutional networks
  • Small dataset performance: May underperform ResNets when data is limited

Architecture Overview

Vision Transformer Design

ViT Base processes images through these stages:

  1. Patch Embedding: Splits 224x224 images into 16x16 patches (196 patches total)
  2. Linear Projection: Each patch is flattened and projected to 768 dimensions
  3. Position Embeddings: Added to retain spatial information
  4. Transformer Encoder: 12 layers with multi-head self-attention (12 heads per layer)
  5. Classification Head: MLP head on the [CLS] token output

Key Specifications:

  • Hidden size: 768
  • Number of layers: 12
  • Attention heads: 12
  • Patch size: 16x16
  • Parameters: ~86M

Parameters

Training Configuration

Training Images

  • Type: Folder
  • Description: Directory containing training images organized in class subfolders
  • Format: Each subfolder name represents a class label
  • Required: Yes
  • Example structure:
    train_images/
    ├── dogs/
    ├── cats/
    └── birds/

Batch Size (Default: 4)

  • Range: 1-32 (depending on GPU memory)
  • Recommendation:
    • 4-8 for 8GB GPU
    • 16-32 for 16GB+ GPU
    • Reduce if out-of-memory errors occur
  • Impact: Larger batches stabilize training but require more memory

Epochs (Default: 1)

  • Range: 1-20
  • Recommendation:
    • 1-3 epochs for large datasets (>10k images)
    • 3-10 epochs for medium datasets (1k-10k images)
    • 10-20 epochs for small datasets (<1k images)
  • Impact: More epochs improve accuracy but risk overfitting

Learning Rate (Default: 5e-5)

  • Range: 1e-6 to 5e-4
  • Recommendation:
    • 5e-5 for standard fine-tuning
    • 1e-5 for small datasets or few classes
    • 1e-4 for large datasets with many classes
  • Impact: Critical parameter - too high causes instability, too low slows convergence

Eval Steps (Default: 1)

  • Description: Number of steps between evaluations during training
  • Recommendation: Set to 1 to evaluate after each epoch
  • Impact: More frequent evaluation helps monitor training progress

Configuration Tips

Dataset Size Recommendations

Small Datasets (<1,000 images)

  • Not recommended - Use ResNet-18 or ResNet-50 instead
  • If you must use ViT: learning_rate=1e-5, epochs=20, heavy augmentation
  • Expect lower accuracy than CNNs due to limited data

Medium Datasets (1,000-10,000 images)

  • Good choice with proper configuration
  • learning_rate=5e-5, epochs=5-10, batch_size=8
  • Use standard augmentation (horizontal flip, rotation, color jitter)
  • Monitor validation metrics to prevent overfitting

Large Datasets (>10,000 images)

  • Excellent choice - ViT Base excels with abundant data
  • learning_rate=5e-5 to 1e-4, epochs=3-5, batch_size=16-32
  • Standard or light augmentation sufficient
  • Expect superior accuracy to CNNs of similar size

Fine-tuning Best Practices

  1. Start Conservative: Begin with default learning rate (5e-5) and 1-3 epochs
  2. Monitor Loss: Training loss should decrease steadily; plateaus indicate convergence
  3. Check Validation: If validation accuracy lags training, reduce epochs or add regularization
  4. Gradual Increases: If model converges too quickly, carefully increase learning rate by 2x
  5. Batch Size: Use largest batch size that fits in memory for stable gradients

Hardware Requirements

Minimum Configuration

  • GPU: 8GB VRAM (NVIDIA GTX 1070 or better)
  • RAM: 16GB system memory
  • Storage: 500MB for model weights + dataset size

Recommended Configuration

  • GPU: 16GB VRAM (NVIDIA RTX 3080/4080 or A4000)
  • RAM: 32GB system memory
  • Storage: SSD for faster data loading

CPU Training

  • Possible but not recommended
  • 10-50x slower than GPU training
  • Only viable for very small datasets (<500 images)

Common Issues and Solutions

Out of Memory Errors

Problem: CUDA out of memory during training

Solutions:

  1. Reduce batch_size to 2 or 4
  2. Use gradient accumulation if available
  3. Reduce image resolution (though this may hurt accuracy)
  4. Close other GPU-intensive applications

Overfitting

Problem: Training accuracy high but validation accuracy low

Solutions:

  1. Reduce epochs (try half of current value)
  2. Add data augmentation
  3. Collect more training data
  4. Use a smaller model (ResNet-50) if data is limited
  5. Apply dropout or other regularization

Slow Training

Problem: Training takes too long per epoch

Solutions:

  1. Increase batch_size (if memory allows)
  2. Use mixed precision training
  3. Ensure data is on SSD not HDD
  4. Verify GPU utilization is high (use nvidia-smi)
  5. Consider using a smaller model for rapid iteration

Poor Accuracy

Problem: Model accuracy is below expectations

Solutions:

  1. Train for more epochs (try doubling current value)
  2. Increase learning rate cautiously (try 1e-4)
  3. Check for class imbalance in dataset
  4. Verify image quality and labeling correctness
  5. Ensure sufficient data per class (aim for 100+ images minimum)

Loss Not Decreasing

Problem: Training loss stays flat or increases

Solutions:

  1. Increase learning rate (try 1e-4 or 2e-4)
  2. Check data loading - verify images are loading correctly
  3. Verify labels match folder structure
  4. Try simpler model (ResNet-18) to rule out data issues
  5. Ensure images are normalized properly

Example Use Cases

Medical Image Classification

Scenario: Classifying X-rays into normal/abnormal categories

Configuration:

Model: ViT Base
Batch Size: 8
Epochs: 10
Learning Rate: 3e-5
Images: 5,000 X-rays (2,500 per class)

Why ViT Base: High accuracy requirements, sufficient medical imaging data, global context important for diagnosis

Expected Results: 92-95% accuracy with proper data quality and balanced classes

Product Categorization

Scenario: E-commerce product classification into 50 categories

Configuration:

Model: ViT Base
Batch Size: 16
Epochs: 5
Learning Rate: 5e-5
Images: 15,000 products (300 per category)

Why ViT Base: Many categories benefit from transformer's attention mechanism, sufficient data per class

Expected Results: 85-90% accuracy depending on category similarity and image quality

Wildlife Species Identification

Scenario: Identifying animal species from camera trap images

Configuration:

Model: ViT Base
Batch Size: 4
Epochs: 15
Learning Rate: 2e-5
Images: 2,000 images across 20 species

Why ViT Base: Complex patterns, varying backgrounds, need high accuracy for conservation work

Expected Results: 80-88% accuracy; consider more data or ResNet-50 if accuracy insufficient

Comparison with Alternatives

ViT Base vs ResNet-50

Choose ViT Base when:

  • You have >1,000 images per class
  • Accuracy is more important than speed
  • You have GPU resources available
  • Dataset has complex, non-local patterns

Choose ResNet-50 when:

  • Dataset is small (<1,000 total images)
  • Training time is critical
  • Inference speed matters
  • Computational resources are limited

ViT Base vs ViT Large

Choose ViT Base when:

  • Dataset is moderate size (1k-50k images)
  • GPU memory is limited (8-16GB)
  • Training time is a concern
  • Accuracy requirements are reasonable

Choose ViT Large when:

  • Large dataset (>50k images)
  • Maximum accuracy needed
  • Ample GPU resources (24GB+ VRAM)
  • Inference latency is acceptable

ViT Base vs EfficientNet-B0

Choose ViT Base when:

  • Accuracy is priority over efficiency
  • Sufficient training data available
  • Modern GPU hardware in use

Choose EfficientNet-B0 when:

  • Parameter efficiency is important
  • Deployment size constraints exist
  • Training with limited data
  • Need balance of accuracy and speed

Command Palette

Search for a command to run...

Schnellzugriffe
STRG + KSuche
STRG + DNachtmodus / Tagmodus
STRG + LSprache ändern

Software-Details
Kompiliert vor 1 Tag
Release: v4.0.0-production
Buildnummer: master@64a3463
Historie: 68 Items