Multi-layer Perceptron (MLP)

Neural network with fully connected layers for classification tasks.

When to Use

Complex non-linear patterns in data
Large datasets with sufficient training samples
Modern alternative to traditional ML algorithms
Can sacrifice interpretability for accuracy
Have computational resources for training

Strengths

Handles very complex patterns and interactions
Flexible architecture (can adjust layers and neurons)
Proven effective in production systems
Works well with diverse feature types
Can learn hierarchical representations

Weaknesses

Needs more data than traditional methods
Longer training time
Requires careful hyperparameter tuning
Black box - difficult to interpret decisions
Prone to overfitting without regularization
Sensitive to feature scaling

Model Parameters

Hidden Layer Sizes

Default: (100,)

Architecture of the neural network - number of neurons in each hidden layer.

Format: Tuple of integers, e.g., (100, 50) means 2 layers with 100 and 50 neurons

Examples:

(100,) - Single hidden layer with 100 neurons (default)
(50,) - Single smaller layer (faster, simpler patterns)
(100, 50) - Two layers: 100 then 50 neurons
(200, 100, 50) - Three layers with decreasing size
(128, 64, 32) - Deep network for complex patterns

Guidelines:

Start with 1 layer for simple problems
Use 2-3 layers for complex patterns
First layer typically largest
More layers = more capacity but slower training
Typical sizes: 50-200 neurons per layer

Activation Function

Default: relu

Non-linear function applied after each layer.

Options:

relu - Rectified Linear Unit (default, most common)
- Fast computation
- Works well for most problems
- Can suffer from "dying ReLU" problem
tanh - Hyperbolic tangent
- Smooth, centered around 0
- Good for smaller networks
- Can saturate (vanishing gradients)
logistic - Sigmoid function
- Output between 0 and 1
- Can saturate easily
- Slower than ReLU

Recommendation: Start with relu, try tanh if training is unstable

Solver

Default: adam

Optimization algorithm for training.

Options:

adam - Adaptive Moment Estimation
- Best for large datasets (>1000 samples)
- Adaptive learning rates
- Fast convergence
- Default choice for most cases
sgd - Stochastic Gradient Descent
- Good with learning rate schedule
- More stable but slower
- Better for some noisy data
lbfgs - Limited-memory BFGS
- Good for small datasets (<1000 samples)
- Quasi-Newton method
- Fast on small data but doesn't scale

Guidelines:

Large data (>1k): Use adam
Small data (<1k): Try lbfgs
Noisy data: Try sgd with momentum

Alpha

Default: 0.0001

L2 regularization parameter to prevent overfitting.

How it works: Penalizes large weights

Values:

0.0001 - Light regularization (default)
0.001 - Moderate regularization
0.01 - Strong regularization (prevents overfitting)
0.00001 - Very light (almost none)

When to adjust:

Overfitting (high train, low validation accuracy) → Increase alpha
Underfitting (low train and validation accuracy) → Decrease alpha

Learning Rate

Default: constant

How the learning rate changes during training.

Options:

constant - Fixed learning rate throughout (default)
- Simple and reliable
- Works with adam solver
invscaling - Gradually decreasing
- learning_rate = learning_rate_init / (t^power_t)
- Good for sgd solver
- Helps convergence
adaptive - Adapts based on training progress
- Reduces learning rate when validation stops improving
- Only for sgd solver
- Can help escape plateaus

Recommendation: Use constant with adam solver

Learning Rate Init

Default: 0.001

Initial learning rate for weight updates (only for sgd/adam).

Values:

0.001 - Standard (default)
0.01 - Faster learning (may be unstable)
0.0001 - Slower, more stable learning

Guidelines:

Start with default 0.001
If training is unstable → decrease to 0.0001
If training is too slow → increase to 0.01

Max Iterations

Default: 200

Maximum number of training epochs.

Values:

200 - Default, usually sufficient
100 - Faster training, may underfit
500+ - More thorough training, risk of overfitting
1000+ - Deep learning, use with early stopping

Guidelines:

Small/simple data: 100-200 iterations
Complex patterns: 300-500 iterations
Always use with early_stopping to prevent overfitting

Early Stopping

Default: false

Stop training when validation score stops improving.

Options:

false - Train for full max_iter (default)
true - Stop early if no improvement for 10 consecutive epochs

Benefits when enabled:

Prevents overfitting
Saves training time
Automatically finds optimal stopping point

Recommendation: Enable for datasets with >1000 samples

Validation Fraction

Default: 0.1 (when early_stopping is enabled)

Fraction of training data used for validation during early stopping.

Values:

0.1 - 10% for validation (default)
0.2 - 20% for validation (more reliable, less training data)
0.05 - 5% for validation (more training data, less reliable)

Random State

Default: 42

Seed for weight initialization and random operations.

Purpose: Ensures reproducible results across runs

Recommendation: Keep at 42 or set to your preferred seed for consistency

Configuration Tips

For Small Datasets (<1k samples)

hidden_layer_sizes: (50,)
solver: lbfgs
alpha: 0.01
max_iter: 200

For Medium Datasets (1k-10k samples)

hidden_layer_sizes: (100, 50)
solver: adam
alpha: 0.0001
early_stopping: true
max_iter: 300

For Large Datasets (>10k samples)

hidden_layer_sizes: (200, 100)
solver: adam
alpha: 0.0001
early_stopping: true
max_iter: 500
learning_rate_init: 0.001

For Complex Patterns

hidden_layer_sizes: (128, 64, 32)
solver: adam
activation: relu
alpha: 0.001
early_stopping: true
max_iter: 500

Common Issues and Solutions

Training is very slow:

Reduce hidden_layer_sizes (fewer neurons/layers)
Reduce max_iter
Use smaller batch of data

Underfitting (low accuracy on train and validation):

Increase hidden_layer_sizes (more neurons/layers)
Decrease alpha (less regularization)
Increase max_iter
Try different activation function

Overfitting (high train accuracy, low validation accuracy):

Increase alpha (more regularization)
Enable early_stopping
Reduce hidden_layer_sizes
Add more training data

Training is unstable (accuracy jumps around):

Decrease learning_rate_init
Try different solver (e.g., sgd instead of adam)
Use learning_rate='adaptive'
Check feature scaling (MLP requires scaled features!)

Important Notes

Feature Scaling is Critical
- MLP is very sensitive to feature scales
- Always scale/normalize features before training
- Use StandardScaler or MinMaxScaler
Computational Requirements
- More neurons/layers = longer training
- GPU acceleration not available in scikit-learn
- Consider simpler models first
When to Use vs. Alternatives
- For tabular data, try XGBoost/LightGBM first
- MLP shines when you have lots of data
- Traditional ML often works better on small tabular datasets
Interpretability Trade-off
- MLP is a black box
- Use simpler models if interpretability matters
- Consider Logistic Regression or Decision Trees for explainability

Multi-layer Perceptron (MLP)

When to Use

Strengths

Weaknesses

Model Parameters

Hidden Layer Sizes

Activation Function

Solver

Alpha

Learning Rate

Learning Rate Init

Max Iterations

Early Stopping

Validation Fraction

Random State

Configuration Tips

For Small Datasets (<1k samples)

For Medium Datasets (1k-10k samples)

For Large Datasets (>10k samples)

For Complex Patterns

Common Issues and Solutions

Important Notes

On this page

Sicherheit auf Enterprise-Niveau

In jeder Infrastruktur einsetzbar

DSGVO-konform

Multi-layer Perceptron (MLP)

When to Use

Strengths

Weaknesses

Model Parameters

Hidden Layer Sizes

Activation Function

Solver

Alpha

Learning Rate

Learning Rate Init

Max Iterations

Early Stopping

Validation Fraction

Random State

Configuration Tips

For Small Datasets (<1k samples)

For Medium Datasets (1k-10k samples)

For Large Datasets (>10k samples)

For Complex Patterns

Common Issues and Solutions

Important Notes

On this page

Command Palette