Time Series
Predicting future values based on sequential temporal data
Time series forecasting predicts future values based on past observations ordered in time. Unlike standard supervised learning, the sequence matters—you can't shuffle rows or treat observations as independent.
Time series forecasting uses historical data (blue) to predict future values (orange). The model learns patterns from the past and projects them forward, with uncertainty typically increasing further into the future.
What Makes Time Series Different
Temporal dependency: Future values depend on past values. Today's sales relate to yesterday's, last week's, and seasonal patterns.
Ordering matters: You can't randomly split or shuffle the data. Time flows forward. Training happens on earlier periods, testing on later ones.
Multiple components: Most series combine trend (long-term direction), seasonality (repeating patterns), cycles (irregular longer-term patterns), and noise (randomness).
Frequency: Data collected daily, hourly, monthly, etc. The frequency determines what patterns you can detect.
Types of Time Series
Univariate: Only the target variable—predict tomorrow's sales using past sales.
Multivariate: Additional time-dependent features—prices, promotions, weather, competitor activity.
Regular: Evenly spaced timestamps (every hour, every day).
Irregular: Missing timestamps, uneven spacing.
Key Concepts
Stationarity
A stationary series has constant mean, constant variance, and stable relationships between values at different lags. Most forecasting models assume stationarity.
Why it matters: Models learn patterns that stay consistent over time. Non-stationary series have shifting patterns that confuse models.
How to detect:
- Visual: Plot the series. Rising/falling trend = non-stationary. Growing variance = non-stationary.
- Autocorrelation: Slow decay at large lags suggests non-stationarity
- Tests: Augmented Dickey-Fuller (ADF) tests for unit root. KPSS tests for stationarity.
How to fix:
- Differencing: Subtract previous value to remove trend
- Seasonal differencing: Subtract value from same season (lag 7 for weekly, lag 12 for monthly)
- Log transform: Stabilize variance when it grows with level
- Box-Cox transform: General variance stabilization
Decomposition
Break the series into components:
Additive: Y = Trend + Seasonal + Noise (seasonal magnitude stays constant)
Multiplicative: Y = Trend × Seasonal × Noise (seasonal magnitude scales with level)
Methods:
- Classical decomposition: Moving averages extract trend, then seasonal averages
- STL (Seasonal-Trend Loess): More flexible, handles changing seasonality
Decomposition separates a time series into its constituent parts. The original series (top) is broken down into trend (long-term direction), seasonal (repeating patterns), and residual (noise). Understanding these components helps choose appropriate modeling approaches and diagnose problems.
Autocorrelation
Correlation between the series and lagged versions of itself.
ACF (Autocorrelation Function): Shows correlation at each lag. Peaks indicate repeating patterns or dependencies.
PACF (Partial Autocorrelation Function): Direct correlation after removing indirect effects of shorter lags. Helps determine autoregressive order.
Classical Statistical Models
Autoregressive (AR)
Current value depends on recent past values:
Order p determined by PACF.
Moving Average (MA)
Current value depends on past forecast errors:
Order q determined by ACF.
ARIMA
Combines AR, differencing (I for Integrated), and MA:
ARIMA(p, d, q):
- p: AR order (how many past values)
- d: Differencing order (how many times to difference)
- q: MA order (how many past errors)
SARIMA: Adds seasonal components for repeating patterns.
SARIMAX: Adds external regressors (prices, promotions).
Exponential Smoothing
Updates estimates gradually, weighting recent observations more:
Simple: Level only (no trend or seasonality)
Holt's Method: Level + trend
Holt-Winters: Level + trend + seasonality
Flexible and intuitive. No stationarity required.
Machine Learning Approaches
ML models don't inherently understand time. You must engineer features.
Feature Engineering
Lag features: Yesterday's value, last week's value, etc.
Rolling statistics: 7-day mean, 7-day std, rolling max/min
Calendar features: Day of week, month, quarter, holiday indicators
Fourier features: Sine/cosine to encode smooth cyclical patterns
External variables: Prices, promotions, weather, events
Tree-Based Models
Random Forest, XGBoost, LightGBM work well when:
- Many external features drive the target
- Nonlinear interactions matter
- You need to model thousands of series efficiently
Strengths: Fast, scalable, handle interactions naturally
Weaknesses: Don't understand sequences without explicit lag features
Neural Networks
Feed-forward NNs: Similar to tree models, need engineered features
RNNs/LSTMs: Process sequences directly, maintain internal memory for long dependencies
Seq2Seq: Encoder-decoder for multi-step forecasts
Temporal CNNs: Convolutional filters across time, fast and parallel
Transformers: Attention mechanisms focus on relevant past time steps. Powerful but data-hungry.
When to use: Complex patterns, long sequences, many interacting variables, large datasets
Evaluation
Never shuffle data. Train on earlier periods, test on later periods.
Metrics
MAE (Mean Absolute Error): Average error in original units. Easy to interpret, stable.
RMSE (Root Mean Squared Error): Penalizes large errors more. Use when big mistakes are costly.
MAPE (Mean Absolute Percentage Error): Error as percentage. Business-friendly. Unstable near zero.
sMAPE (Symmetric MAPE): Divides by average of actual and forecast. More stable than MAPE near zero.
Validation
Rolling window: Train on Jan-Jun, forecast July. Then train on Jan-Jul, forecast August. Repeat.
Expanding window: Keep adding data to training set.
Simulates real deployment where models update continuously.
Rolling window validation splits data chronologically into multiple train-test pairs. Each iteration moves forward in time, training on past data and testing on the immediate future. This approach provides realistic error estimates that reflect how the model will perform in production.
Horizon
Accuracy decreases with forecast horizon. One-step-ahead is easier than 30-steps-ahead. Evaluate errors at different horizons separately.
Common Pitfalls
Data leakage: Using future information. Happens with:
- Incorrect rolling windows (including future)
- Scaling using full dataset statistics
- Wrong lag shifts
Solution: All transformations must use only past data. Recompute statistics chronologically.
Ignoring seasonality: Weekly, monthly, yearly patterns matter. Model them explicitly or remove them.
Treating time as a feature: Don't use timestamp directly as input to tree models. Extract meaningful features (day of week, month, etc.).
Not monitoring drift: Patterns change. Customer behavior shifts. Retrain regularly and monitor forecast accuracy.
Choosing an Approach
Simple, stable patterns: Exponential smoothing or ARIMA
Strong external drivers: Tree-based ML (XGBoost, LightGBM)
Complex sequences, long dependencies: RNNs, LSTMs, Transformers
Multiple interacting seasonalities: SARIMA or neural networks
Large scale (thousands of series): Tree models or simple methods that parallelize
Need interpretability: Classical models or tree models with feature importance
Start simple. Add complexity only if simpler methods fail. ARIMA often beats fancy neural networks on small datasets.