Time Series

Time series forecasting predicts future values based on past observations ordered in time. Unlike standard supervised learning, the sequence matters—you can't shuffle rows or treat observations as independent.

Time series forecasting uses historical data (blue) to predict future values (orange). The model learns patterns from the past and projects them forward, with uncertainty typically increasing further into the future.

What Makes Time Series Different

Temporal dependency: Future values depend on past values. Today's sales relate to yesterday's, last week's, and seasonal patterns.

Ordering matters: You can't randomly split or shuffle the data. Time flows forward. Training happens on earlier periods, testing on later ones.

Multiple components: Most series combine trend (long-term direction), seasonality (repeating patterns), cycles (irregular longer-term patterns), and noise (randomness).

Frequency: Data collected daily, hourly, monthly, etc. The frequency determines what patterns you can detect.

Types of Time Series

Univariate: Only the target variable—predict tomorrow's sales using past sales.

Multivariate: Additional time-dependent features—prices, promotions, weather, competitor activity.

Regular: Evenly spaced timestamps (every hour, every day).

Irregular: Missing timestamps, uneven spacing.

Key Concepts

Stationarity

A stationary series has constant mean, constant variance, and stable relationships between values at different lags. Most forecasting models assume stationarity.

Why it matters: Models learn patterns that stay consistent over time. Non-stationary series have shifting patterns that confuse models.

How to detect:

Visual: Plot the series. Rising/falling trend = non-stationary. Growing variance = non-stationary.
Autocorrelation: Slow decay at large lags suggests non-stationarity
Tests: Augmented Dickey-Fuller (ADF) tests for unit root. KPSS tests for stationarity.

How to fix:

Differencing: Subtract previous value to remove trend
Seasonal differencing: Subtract value from same season (lag 7 for weekly, lag 12 for monthly)
Log transform: Stabilize variance when it grows with level
Box-Cox transform: General variance stabilization

Decomposition

Break the series into components:

Additive: Y = Trend + Seasonal + Noise (seasonal magnitude stays constant)

Multiplicative: Y = Trend × Seasonal × Noise (seasonal magnitude scales with level)

Methods:

Classical decomposition: Moving averages extract trend, then seasonal averages
STL (Seasonal-Trend Loess): More flexible, handles changing seasonality

Decomposition separates a time series into its constituent parts. The original series (top) is broken down into trend (long-term direction), seasonal (repeating patterns), and residual (noise). Understanding these components helps choose appropriate modeling approaches and diagnose problems.

Autocorrelation

Correlation between the series and lagged versions of itself.

ACF (Autocorrelation Function): Shows correlation at each lag. Peaks indicate repeating patterns or dependencies.

PACF (Partial Autocorrelation Function): Direct correlation after removing indirect effects of shorter lags. Helps determine autoregressive order.

Classical Statistical Models

Autoregressive (AR)

Current value depends on recent past values:

y_t = c + \phi_1 y_{t-1} + \phi_2 y_{t-2} + \cdots + \phi_p y_{t-p} + \epsilon_t

Order p determined by PACF.

Moving Average (MA)

Current value depends on past forecast errors:

y_t = c + \epsilon_t + \theta_1 \epsilon_{t-1} + \theta_2 \epsilon_{t-2} + \cdots + \theta_q \epsilon_{t-q}

Order q determined by ACF.

ARIMA

Combines AR, differencing (I for Integrated), and MA:

ARIMA(p, d, q):

p: AR order (how many past values)
d: Differencing order (how many times to difference)
q: MA order (how many past errors)

SARIMA: Adds seasonal components for repeating patterns.

SARIMAX: Adds external regressors (prices, promotions).

Exponential Smoothing

Updates estimates gradually, weighting recent observations more:

Simple: Level only (no trend or seasonality)

Holt's Method: Level + trend

Holt-Winters: Level + trend + seasonality

Flexible and intuitive. No stationarity required.

Machine Learning Approaches

ML models don't inherently understand time. You must engineer features.

Feature Engineering

Lag features: Yesterday's value, last week's value, etc.

Rolling statistics: 7-day mean, 7-day std, rolling max/min

Calendar features: Day of week, month, quarter, holiday indicators

Fourier features: Sine/cosine to encode smooth cyclical patterns

External variables: Prices, promotions, weather, events

Tree-Based Models

Random Forest, XGBoost, LightGBM work well when:

Many external features drive the target
Nonlinear interactions matter
You need to model thousands of series efficiently

Strengths: Fast, scalable, handle interactions naturally

Weaknesses: Don't understand sequences without explicit lag features

Neural Networks

Feed-forward NNs: Similar to tree models, need engineered features

RNNs/LSTMs: Process sequences directly, maintain internal memory for long dependencies

Seq2Seq: Encoder-decoder for multi-step forecasts

Temporal CNNs: Convolutional filters across time, fast and parallel

Transformers: Attention mechanisms focus on relevant past time steps. Powerful but data-hungry.

When to use: Complex patterns, long sequences, many interacting variables, large datasets

Evaluation

Never shuffle data. Train on earlier periods, test on later periods.

Metrics

MAE (Mean Absolute Error): Average error in original units. Easy to interpret, stable.

\text{MAE} = \frac{1}{n}\sum_{i=1}^{n} |y_i - \hat{y}_i|

RMSE (Root Mean Squared Error): Penalizes large errors more. Use when big mistakes are costly.

\text{RMSE} = \sqrt{\frac{1}{n}\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}

MAPE (Mean Absolute Percentage Error): Error as percentage. Business-friendly. Unstable near zero.

\text{MAPE} = \frac{100\%}{n}\sum_{i=1}^{n} \left|\frac{y_i - \hat{y}_i}{y_i}\right|

sMAPE (Symmetric MAPE): Divides by average of actual and forecast. More stable than MAPE near zero.

Validation

Rolling window: Train on Jan-Jun, forecast July. Then train on Jan-Jul, forecast August. Repeat.

Expanding window: Keep adding data to training set.

Simulates real deployment where models update continuously.

Rolling window validation splits data chronologically into multiple train-test pairs. Each iteration moves forward in time, training on past data and testing on the immediate future. This approach provides realistic error estimates that reflect how the model will perform in production.

Horizon

Accuracy decreases with forecast horizon. One-step-ahead is easier than 30-steps-ahead. Evaluate errors at different horizons separately.

Common Pitfalls

Data leakage: Using future information. Happens with:

Incorrect rolling windows (including future)
Scaling using full dataset statistics
Wrong lag shifts

Solution: All transformations must use only past data. Recompute statistics chronologically.

Ignoring seasonality: Weekly, monthly, yearly patterns matter. Model them explicitly or remove them.

Treating time as a feature: Don't use timestamp directly as input to tree models. Extract meaningful features (day of week, month, etc.).

Not monitoring drift: Patterns change. Customer behavior shifts. Retrain regularly and monitor forecast accuracy.

Choosing an Approach

Simple, stable patterns: Exponential smoothing or ARIMA

Strong external drivers: Tree-based ML (XGBoost, LightGBM)

Complex sequences, long dependencies: RNNs, LSTMs, Transformers

Multiple interacting seasonalities: SARIMA or neural networks

Large scale (thousands of series): Tree models or simple methods that parallelize

Need interpretability: Classical models or tree models with feature importance

Start simple. Add complexity only if simpler methods fail. ARIMA often beats fancy neural networks on small datasets.