Dokumentation (english)

Multi-layer Perceptron (MLP)

Neural network with fully connected layers for complex pattern recognition

Neural network with fully connected layers for classification tasks.

When to Use

  • Complex non-linear patterns in data
  • Large datasets with sufficient training samples
  • Modern alternative to traditional ML algorithms
  • Can sacrifice interpretability for accuracy
  • Have computational resources for training

Strengths

  • Handles very complex patterns and interactions
  • Flexible architecture (can adjust layers and neurons)
  • Proven effective in production systems
  • Works well with diverse feature types
  • Can learn hierarchical representations

Weaknesses

  • Needs more data than traditional methods
  • Longer training time
  • Requires careful hyperparameter tuning
  • Black box - difficult to interpret decisions
  • Prone to overfitting without regularization
  • Sensitive to feature scaling

Model Parameters

Hidden Layer Sizes

Default: (100,)

Architecture of the neural network - number of neurons in each hidden layer.

Format: Tuple of integers, e.g., (100, 50) means 2 layers with 100 and 50 neurons

Examples:

  • (100,) - Single hidden layer with 100 neurons (default)
  • (50,) - Single smaller layer (faster, simpler patterns)
  • (100, 50) - Two layers: 100 then 50 neurons
  • (200, 100, 50) - Three layers with decreasing size
  • (128, 64, 32) - Deep network for complex patterns

Guidelines:

  • Start with 1 layer for simple problems
  • Use 2-3 layers for complex patterns
  • First layer typically largest
  • More layers = more capacity but slower training
  • Typical sizes: 50-200 neurons per layer

Activation Function

Default: relu

Non-linear function applied after each layer.

Options:

  • relu - Rectified Linear Unit (default, most common)
    • Fast computation
    • Works well for most problems
    • Can suffer from "dying ReLU" problem
  • tanh - Hyperbolic tangent
    • Smooth, centered around 0
    • Good for smaller networks
    • Can saturate (vanishing gradients)
  • logistic - Sigmoid function
    • Output between 0 and 1
    • Can saturate easily
    • Slower than ReLU

Recommendation: Start with relu, try tanh if training is unstable

Solver

Default: adam

Optimization algorithm for training.

Options:

  • adam - Adaptive Moment Estimation
    • Best for large datasets (>1000 samples)
    • Adaptive learning rates
    • Fast convergence
    • Default choice for most cases
  • sgd - Stochastic Gradient Descent
    • Good with learning rate schedule
    • More stable but slower
    • Better for some noisy data
  • lbfgs - Limited-memory BFGS
    • Good for small datasets (<1000 samples)
    • Quasi-Newton method
    • Fast on small data but doesn't scale

Guidelines:

  • Large data (>1k): Use adam
  • Small data (<1k): Try lbfgs
  • Noisy data: Try sgd with momentum

Alpha

Default: 0.0001

L2 regularization parameter to prevent overfitting.

How it works: Penalizes large weights

Values:

  • 0.0001 - Light regularization (default)
  • 0.001 - Moderate regularization
  • 0.01 - Strong regularization (prevents overfitting)
  • 0.00001 - Very light (almost none)

When to adjust:

  • Overfitting (high train, low validation accuracy) → Increase alpha
  • Underfitting (low train and validation accuracy) → Decrease alpha

Learning Rate

Default: constant

How the learning rate changes during training.

Options:

  • constant - Fixed learning rate throughout (default)
    • Simple and reliable
    • Works with adam solver
  • invscaling - Gradually decreasing
    • learning_rate = learning_rate_init / (t^power_t)
    • Good for sgd solver
    • Helps convergence
  • adaptive - Adapts based on training progress
    • Reduces learning rate when validation stops improving
    • Only for sgd solver
    • Can help escape plateaus

Recommendation: Use constant with adam solver

Learning Rate Init

Default: 0.001

Initial learning rate for weight updates (only for sgd/adam).

Values:

  • 0.001 - Standard (default)
  • 0.01 - Faster learning (may be unstable)
  • 0.0001 - Slower, more stable learning

Guidelines:

  • Start with default 0.001
  • If training is unstable → decrease to 0.0001
  • If training is too slow → increase to 0.01

Max Iterations

Default: 200

Maximum number of training epochs.

Values:

  • 200 - Default, usually sufficient
  • 100 - Faster training, may underfit
  • 500+ - More thorough training, risk of overfitting
  • 1000+ - Deep learning, use with early stopping

Guidelines:

  • Small/simple data: 100-200 iterations
  • Complex patterns: 300-500 iterations
  • Always use with early_stopping to prevent overfitting

Early Stopping

Default: false

Stop training when validation score stops improving.

Options:

  • false - Train for full max_iter (default)
  • true - Stop early if no improvement for 10 consecutive epochs

Benefits when enabled:

  • Prevents overfitting
  • Saves training time
  • Automatically finds optimal stopping point

Recommendation: Enable for datasets with >1000 samples

Validation Fraction

Default: 0.1 (when early_stopping is enabled)

Fraction of training data used for validation during early stopping.

Values:

  • 0.1 - 10% for validation (default)
  • 0.2 - 20% for validation (more reliable, less training data)
  • 0.05 - 5% for validation (more training data, less reliable)

Random State

Default: 42

Seed for weight initialization and random operations.

Purpose: Ensures reproducible results across runs

Recommendation: Keep at 42 or set to your preferred seed for consistency


Configuration Tips

For Small Datasets (<1k samples)

hidden_layer_sizes: (50,)
solver: lbfgs
alpha: 0.01
max_iter: 200

For Medium Datasets (1k-10k samples)

hidden_layer_sizes: (100, 50)
solver: adam
alpha: 0.0001
early_stopping: true
max_iter: 300

For Large Datasets (>10k samples)

hidden_layer_sizes: (200, 100)
solver: adam
alpha: 0.0001
early_stopping: true
max_iter: 500
learning_rate_init: 0.001

For Complex Patterns

hidden_layer_sizes: (128, 64, 32)
solver: adam
activation: relu
alpha: 0.001
early_stopping: true
max_iter: 500

Common Issues and Solutions

Training is very slow:

  • Reduce hidden_layer_sizes (fewer neurons/layers)
  • Reduce max_iter
  • Use smaller batch of data

Underfitting (low accuracy on train and validation):

  • Increase hidden_layer_sizes (more neurons/layers)
  • Decrease alpha (less regularization)
  • Increase max_iter
  • Try different activation function

Overfitting (high train accuracy, low validation accuracy):

  • Increase alpha (more regularization)
  • Enable early_stopping
  • Reduce hidden_layer_sizes
  • Add more training data

Training is unstable (accuracy jumps around):

  • Decrease learning_rate_init
  • Try different solver (e.g., sgd instead of adam)
  • Use learning_rate='adaptive'
  • Check feature scaling (MLP requires scaled features!)

Important Notes

  1. Feature Scaling is Critical

    • MLP is very sensitive to feature scales
    • Always scale/normalize features before training
    • Use StandardScaler or MinMaxScaler
  2. Computational Requirements

    • More neurons/layers = longer training
    • GPU acceleration not available in scikit-learn
    • Consider simpler models first
  3. When to Use vs. Alternatives

    • For tabular data, try XGBoost/LightGBM first
    • MLP shines when you have lots of data
    • Traditional ML often works better on small tabular datasets
  4. Interpretability Trade-off

    • MLP is a black box
    • Use simpler models if interpretability matters
    • Consider Logistic Regression or Decision Trees for explainability

Command Palette

Search for a command to run...

Schnellzugriffe
STRG + KSuche
STRG + DNachtmodus / Tagmodus
STRG + LSprache ändern

Software-Details
Kompiliert vor 1 Tag
Release: v4.0.0-production
Buildnummer: master@64a3463
Historie: 68 Items