Dokumentation (english)

Association Analysis

Discovering patterns and relationships in transaction data

Association analysis is an unsupervised learning task that discovers relationships between items that frequently occur together in transactions. Unlike classification or regression, there's no target variable—the goal is to uncover hidden patterns and associations in transactional data.

Training Association Models

Looking to train association models? Check out our comprehensive Association Model Training Guide with detailed parameter documentation for all 5 available algorithms including Apriori, FP-Growth, Eclat, and more.

What Is Association Analysis

Association analysis identifies patterns where items tend to appear together more frequently than would be expected by chance. The most famous application is market basket analysis, which answers questions like "What products do customers buy together?"

Common use cases:

  • Product recommendation systems
  • Store layout optimization
  • Cross-selling and promotional bundling
  • Customer behavior analysis
  • Web page navigation patterns
  • Medical diagnosis (symptom combinations)

Classic Example: The "beer and diapers" discovery at a retail store found that customers who buy diapers often buy beer in the same transaction—an unexpected but actionable pattern for store layout and promotions.

Key Concepts

Transactions and Items

Transaction: A collection of items that occur together in a single event. Examples:

  • Shopping cart: All items purchased together
  • Web session: Pages visited in one session
  • Medical record: Symptoms or treatments for one patient

Item: A discrete entity that can appear in transactions. Items are typically:

  • Products (SKUs) in retail
  • Web pages or features in clickstream data
  • Symptoms or medications in healthcare
  • Words in text documents

Itemsets

Itemset: A collection of one or more items.

  • 1-itemset: [Bread]
  • 2-itemset: [Bread, Butter]
  • 3-itemset: [Bread, Butter, Milk]

Frequent Itemset: An itemset that appears in at least min_support proportion of transactions.

Association Rules

Rule Format: $X \rightarrow Y$ (read as "if X then Y")

  • Antecedent (X): The "if" part—items on the left side
  • Consequent (Y): The "then" part—items on the right side
  • Example: [Bread, Butter] -> [Milk]
  • Meaning: Customers who buy bread and butter also tend to buy milk

Key Point: Rules are directional for measurement purposes, but don't imply causation. [Bread] -> [Butter] and [Butter] -> [Bread] are different rules with potentially different confidence values.

Understanding Association Metrics

Association rules are evaluated using several metrics that measure different aspects of the relationship.

Support

What it measures: How frequently an itemset appears in the data.

Formula: support(X) = (# transactions containing X) / (total # transactions)

Example:

  • 1000 transactions total
  • [Bread, Milk] appears in 150 transactions
  • support([Bread, Milk]) = 150/1000 = 0.15 = 15%

Why it matters: Support filters out rare patterns that might be noise. Very low support patterns (< 1%) might be spurious or not actionable at scale.

Typical thresholds:

  • Large datasets (>10k transactions): 0.001-0.01 (0.1%-1%)
  • Medium datasets: 0.01-0.05 (1%-5%)
  • Small datasets: 0.05-0.1 (5%-10%)

Confidence

What it measures: The reliability of the rule—how often Y appears when X appears.

Formula: confidence(X \rightarrow Y) = support(X ∪ Y) / support(X)

Example:

  • support([Bread]) = 0.50 (50% of transactions)
  • support([Bread, Butter]) = 0.30 (30% of transactions)
  • confidence(Bread -> Butter) = 0.30 / 0.50 = 0.60 = 60%

Interpretation: 60% of customers who buy bread also buy butter.

Limitation: High confidence doesn't always mean strong association. If butter appears in 60% of all transactions anyway, this rule isn't particularly informative.

Lift

What it measures: How much more likely Y is to appear with X compared to Y's baseline frequency.

Formula: lift(X -> Y) = confidence(X -> Y) / support(Y)

Example:

  • confidence(Bread -> Butter) = 0.60
  • support(Butter) = 0.40 (40% of all transactions contain butter)
  • lift(Bread -> Butter) = 0.60 / 0.40 = 1.5

Interpretation:

  • lift = 1.0: X and Y are independent (no association)
  • lift > 1.0: Positive association (Y more likely with X)
    • 1.5 = 50% increase in likelihood
    • 2.0 = 100% increase (twice as likely)
  • lift < 1.0: Negative association (Y less likely with X)

Why lift is crucial: It accounts for item popularity. A rule with 90% confidence but lift of 1.0 means the consequent is just a popular item, not meaningfully associated with the antecedent.

Best for discovery: Lift is symmetric [lift(X -> Y) = lift(Y -> X)] and identifies true associations rather than popular items.

Other Metrics

Leverage: Measures how much more frequently X and Y occur together than expected if independent. Positive values indicate positive association.

Conviction: Measures dependency—how much more Y depends on X. Values > 1 indicate Y depends on X; infinity means perfect dependency.

Market Basket Analysis

Market basket analysis is the most common application of association analysis, focused on retail transactions.

The Goal

Understand which products are purchased together to:

  • Recommend complementary products
  • Design promotional bundles (e.g., "buy bread, get 20% off butter")
  • Optimize store layout (place associated items near each other)
  • Plan inventory (stock complementary items together)
  • Create targeted marketing campaigns

Types of Associations

Complementary items: Products used together

  • [Toothbrush] -> [Toothpaste]
  • [Burger buns] -> [Ground beef]
  • [Pasta] -> [Pasta sauce]

Substitute items: Products rarely bought together (negative association)

  • [Coke] and [Pepsi] (lift < 1.0)
  • [iPhone] and [Android phone]

Unexpected associations: Surprising patterns that require investigation

  • [Diapers] -> [Beer] (lifestyle factors)
  • [Batteries] -> [Toys] (seasonal Christmas shopping)

When to Use Association Analysis

Good fit:

  • Transaction data with multiple items per transaction
  • Want to discover patterns without predefined target
  • Need recommendations or cross-selling strategies
  • Interested in understanding co-occurrence patterns
  • Have at least hundreds of transactions

Poor fit:

  • Single-item transactions (no co-occurrence to analyze)
  • Time-series prediction (use time series methods instead)
  • Classification into predefined categories (use classification)
  • Very few transactions (< 100) or very few items
  • Need to predict specific outcomes (use supervised learning)

Choosing Between Algorithms

All association algorithms find the same frequent itemsets and rules—they differ in strategy and performance.

Algorithm Comparison

Apriori:

  • Strategy: Breadth-first, generate and test candidates
  • Best for: Learning, understanding the fundamentals
  • Speed: Slower on large datasets
  • Memory: Can be high with low support

FP-Growth:

  • Strategy: Tree-based, no candidate generation
  • Best for: Production use, most datasets
  • Speed: Fastest for most scenarios
  • Memory: Efficient

Eclat:

  • Strategy: Vertical format, set intersection
  • Best for: Sparse data (many items, few per transaction)
  • Speed: Fast for sparse datasets
  • Memory: Higher for dense data

Relim:

  • Strategy: Recursive elimination
  • Best for: Memory-constrained environments
  • Speed: Moderate
  • Memory: Most memory-efficient

FPMax:

  • Strategy: Finds only maximal itemsets
  • Best for: Compact output, overview of longest patterns
  • Speed: Faster than finding all itemsets
  • Memory: Lower output size

Quick Selection Guide

Start with FP-Growth for most use cases. Consider alternatives if:

  • Learning the concepts → Apriori
  • Sparse data (e.g., supermarket with 10k SKUs, baskets of 10 items) → Eclat
  • Memory constraints → Relim
  • Only want longest patterns → FPMax

Practical Considerations

Data Format

Association algorithms require transaction data in one of two formats:

Wide format:

  • Each column is an item
  • Each row is a transaction
  • Values are 1 (present) or 0 (absent)

Long format:

  • Each row is one item in a transaction
  • Transaction ID column groups items
  • More natural for real-world data

Most systems handle long format more naturally (database tables, CSV exports).

Parameter Tuning

Minimum Support:

  • Too low: Too many patterns, slow computation, noise
  • Too high: Miss interesting rare patterns
  • Start: 0.02 (2%), adjust based on results

Maximum Itemset Length:

  • 2: Pairwise associations only (easiest to interpret)
  • 3: Include three-way patterns (typical max)
  • 4+: Harder to interpret, exponentially more patterns

Rule Metric:

  • Lift: Best for discovery (accounts for popularity)
  • Confidence: For reliability requirements
  • Leverage/Conviction: Alternative strength measures

Filtering and Validation

Initial filtering:

  • Set min_support to reduce candidates
  • Use min_lift > 1.5 for strong associations
  • Set min_confidence > 0.5 for reliable rules

Post-processing:

  • Remove trivial rules (obvious associations)
  • Sort by lift to find strongest associations
  • Focus on actionable patterns
  • Validate with domain experts

Common pitfalls:

  • High confidence + low lift = popular item, not true association
  • Very low support = might be noise
  • Contradicts domain knowledge = investigate or discard

Evaluation

Association analysis has no single accuracy metric. Evaluate based on:

Internal quality:

  • Are patterns frequent enough to be actionable?
  • Do rules have strong lift values (> 1.5)?
  • Are confidence levels adequate?

Business value:

  • Are patterns surprising and useful?
  • Can insights drive actions (recommendations, promotions)?
  • Do they align with domain knowledge?

Experimental validation:

  • A/B test recommendations
  • Measure conversion rates
  • Track bundle sales performance

Example Workflow

1. Prepare Transaction Data

Format: Transaction ID + Items
Clean: Remove returns, test orders
Filter: Focus on specific product categories or time periods

2. Choose Algorithm

  • Most cases: FP-Growth
  • Learning: Apriori
  • Sparse data: Eclat

3. Set Initial Parameters

min_support = 0.02 (2%)
max_length = 3
rule_metric = lift
min_threshold = 1.5

4. Mine Patterns

Run the algorithm and examine results:

  • How many itemsets found?
  • How many rules generated?
  • What's the distribution of lift values?

5. Adjust and Refine

Too few patterns:

  • Lower min_support to 0.01
  • Reduce min_lift threshold
  • Increase max_length

Too many patterns:

  • Increase min_support to 0.05
  • Increase min_lift to 2.0
  • Enable advanced filtering (confidence + lift)
  • Focus on specific categories

6. Validate and Apply

  • Review top rules by lift
  • Validate with domain experts
  • Filter for actionable insights
  • Implement recommendations or strategies
  • Measure business impact

Relationship to Other Tasks

vs. Clustering:

  • Clustering: Groups similar transactions or customers
  • Association: Finds item co-occurrence patterns within transactions
  • Can combine: Cluster customers, then find associations within each cluster

vs. Collaborative Filtering:

  • Collaborative Filtering: Recommends based on user-item ratings (e.g., "users like you also liked...")
  • Association: Recommends based on item co-occurrence (e.g., "people who bought X also bought Y")
  • Use together: Blend both approaches for recommendations

vs. Sequential Pattern Mining:

  • Sequential: Finds patterns in ordered sequences (e.g., page A → page B → page C)
  • Association: Finds co-occurrence regardless of order
  • Choose sequential when: Order matters (clickstreams, customer journeys)

Common Applications

Retail:

  • Product recommendations
  • Bundle promotions
  • Store layout optimization
  • Inventory management

E-commerce:

  • "Frequently bought together" suggestions
  • Personalized product recommendations
  • Cart completion prompts

Healthcare:

  • Disease-symptom associations
  • Drug interaction patterns
  • Treatment protocol analysis

Web Analytics:

  • Page navigation patterns
  • Feature usage combinations
  • User behavior clustering

Finance:

  • Fraud detection (unusual transaction patterns)
  • Service bundle recommendations
  • Cross-selling financial products

Association analysis is exploratory and insight-driven. The goal is to discover actionable patterns that weren't obvious beforehand. Success depends on domain knowledge, proper filtering, and validation through real-world testing.


Command Palette

Search for a command to run...

Schnellzugriffe
STRG + KSuche
STRG + DNachtmodus / Tagmodus
STRG + LSprache ändern

Software-Details
Kompiliert vor 1 Tag
Release: v4.0.0-production
Buildnummer: master@64a3463
Historie: 68 Items