Multi-layer Perceptron (MLP)
Neural network with fully connected layers for complex pattern recognition
Neural network with fully connected layers for classification tasks.
When to Use
- Complex non-linear patterns in data
- Large datasets with sufficient training samples
- Modern alternative to traditional ML algorithms
- Can sacrifice interpretability for accuracy
- Have computational resources for training
Strengths
- Handles very complex patterns and interactions
- Flexible architecture (can adjust layers and neurons)
- Proven effective in production systems
- Works well with diverse feature types
- Can learn hierarchical representations
Weaknesses
- Needs more data than traditional methods
- Longer training time
- Requires careful hyperparameter tuning
- Black box - difficult to interpret decisions
- Prone to overfitting without regularization
- Sensitive to feature scaling
Model Parameters
Hidden Layer Sizes
Default: (100,)
Architecture of the neural network - number of neurons in each hidden layer.
Format: Tuple of integers, e.g., (100, 50) means 2 layers with 100 and 50 neurons
Examples:
(100,)- Single hidden layer with 100 neurons (default)(50,)- Single smaller layer (faster, simpler patterns)(100, 50)- Two layers: 100 then 50 neurons(200, 100, 50)- Three layers with decreasing size(128, 64, 32)- Deep network for complex patterns
Guidelines:
- Start with 1 layer for simple problems
- Use 2-3 layers for complex patterns
- First layer typically largest
- More layers = more capacity but slower training
- Typical sizes: 50-200 neurons per layer
Activation Function
Default: relu
Non-linear function applied after each layer.
Options:
- relu - Rectified Linear Unit (default, most common)
- Fast computation
- Works well for most problems
- Can suffer from "dying ReLU" problem
- tanh - Hyperbolic tangent
- Smooth, centered around 0
- Good for smaller networks
- Can saturate (vanishing gradients)
- logistic - Sigmoid function
- Output between 0 and 1
- Can saturate easily
- Slower than ReLU
Recommendation: Start with relu, try tanh if training is unstable
Solver
Default: adam
Optimization algorithm for training.
Options:
- adam - Adaptive Moment Estimation
- Best for large datasets (>1000 samples)
- Adaptive learning rates
- Fast convergence
- Default choice for most cases
- sgd - Stochastic Gradient Descent
- Good with learning rate schedule
- More stable but slower
- Better for some noisy data
- lbfgs - Limited-memory BFGS
- Good for small datasets (<1000 samples)
- Quasi-Newton method
- Fast on small data but doesn't scale
Guidelines:
- Large data (>1k): Use adam
- Small data (<1k): Try lbfgs
- Noisy data: Try sgd with momentum
Alpha
Default: 0.0001
L2 regularization parameter to prevent overfitting.
How it works: Penalizes large weights
Values:
0.0001- Light regularization (default)0.001- Moderate regularization0.01- Strong regularization (prevents overfitting)0.00001- Very light (almost none)
When to adjust:
- Overfitting (high train, low validation accuracy) → Increase alpha
- Underfitting (low train and validation accuracy) → Decrease alpha
Learning Rate
Default: constant
How the learning rate changes during training.
Options:
- constant - Fixed learning rate throughout (default)
- Simple and reliable
- Works with adam solver
- invscaling - Gradually decreasing
- learning_rate = learning_rate_init / (t^power_t)
- Good for sgd solver
- Helps convergence
- adaptive - Adapts based on training progress
- Reduces learning rate when validation stops improving
- Only for sgd solver
- Can help escape plateaus
Recommendation: Use constant with adam solver
Learning Rate Init
Default: 0.001
Initial learning rate for weight updates (only for sgd/adam).
Values:
0.001- Standard (default)0.01- Faster learning (may be unstable)0.0001- Slower, more stable learning
Guidelines:
- Start with default 0.001
- If training is unstable → decrease to 0.0001
- If training is too slow → increase to 0.01
Max Iterations
Default: 200
Maximum number of training epochs.
Values:
200- Default, usually sufficient100- Faster training, may underfit500+- More thorough training, risk of overfitting1000+- Deep learning, use with early stopping
Guidelines:
- Small/simple data: 100-200 iterations
- Complex patterns: 300-500 iterations
- Always use with early_stopping to prevent overfitting
Early Stopping
Default: false
Stop training when validation score stops improving.
Options:
- false - Train for full max_iter (default)
- true - Stop early if no improvement for 10 consecutive epochs
Benefits when enabled:
- Prevents overfitting
- Saves training time
- Automatically finds optimal stopping point
Recommendation: Enable for datasets with >1000 samples
Validation Fraction
Default: 0.1 (when early_stopping is enabled)
Fraction of training data used for validation during early stopping.
Values:
0.1- 10% for validation (default)0.2- 20% for validation (more reliable, less training data)0.05- 5% for validation (more training data, less reliable)
Random State
Default: 42
Seed for weight initialization and random operations.
Purpose: Ensures reproducible results across runs
Recommendation: Keep at 42 or set to your preferred seed for consistency
Configuration Tips
For Small Datasets (<1k samples)
hidden_layer_sizes: (50,)
solver: lbfgs
alpha: 0.01
max_iter: 200For Medium Datasets (1k-10k samples)
hidden_layer_sizes: (100, 50)
solver: adam
alpha: 0.0001
early_stopping: true
max_iter: 300For Large Datasets (>10k samples)
hidden_layer_sizes: (200, 100)
solver: adam
alpha: 0.0001
early_stopping: true
max_iter: 500
learning_rate_init: 0.001For Complex Patterns
hidden_layer_sizes: (128, 64, 32)
solver: adam
activation: relu
alpha: 0.001
early_stopping: true
max_iter: 500Common Issues and Solutions
Training is very slow:
- Reduce hidden_layer_sizes (fewer neurons/layers)
- Reduce max_iter
- Use smaller batch of data
Underfitting (low accuracy on train and validation):
- Increase hidden_layer_sizes (more neurons/layers)
- Decrease alpha (less regularization)
- Increase max_iter
- Try different activation function
Overfitting (high train accuracy, low validation accuracy):
- Increase alpha (more regularization)
- Enable early_stopping
- Reduce hidden_layer_sizes
- Add more training data
Training is unstable (accuracy jumps around):
- Decrease learning_rate_init
- Try different solver (e.g., sgd instead of adam)
- Use learning_rate='adaptive'
- Check feature scaling (MLP requires scaled features!)
Important Notes
-
Feature Scaling is Critical
- MLP is very sensitive to feature scales
- Always scale/normalize features before training
- Use StandardScaler or MinMaxScaler
-
Computational Requirements
- More neurons/layers = longer training
- GPU acceleration not available in scikit-learn
- Consider simpler models first
-
When to Use vs. Alternatives
- For tabular data, try XGBoost/LightGBM first
- MLP shines when you have lots of data
- Traditional ML often works better on small tabular datasets
-
Interpretability Trade-off
- MLP is a black box
- Use simpler models if interpretability matters
- Consider Logistic Regression or Decision Trees for explainability