FP-Growth

Fast algorithm using FP-tree (Frequent Pattern tree) data structure. Avoids candidate generation by compressing the database into a tree.

When to Use FP-Growth

Best choice for most use cases
Medium to large datasets (>10k transactions)
Need faster performance than Apriori
Production systems

Strengths

Much faster than Apriori
No candidate generation required
Efficient memory use
Scales well to large datasets
Only two database scans

Weaknesses

More complex to understand
Less intuitive than Apriori
Algorithm internals harder to explain

How it Works

Scan database, count item frequencies
Build FP-tree (compressed representation of database)
- Sort items by frequency in each transaction
- Share common prefixes in tree structure
Mine frequent patterns from FP-tree recursively
- Extract conditional pattern bases
- Build conditional FP-trees
Generate rules from patterns

Key Advantage: Only scans database twice (vs. multiple scans for Apriori). The FP-tree compresses the entire database into a compact tree structure, allowing pattern mining without repeated database access.

When to Choose Over Apriori

Datasets >10k transactions
Need faster results
Low min_support values (finds rare patterns efficiently)
Production systems where performance matters
Large number of items or transactions

Parameters

All association algorithms share these common parameters:

Data Format

Input Format: 'long' or 'wide'

How your transaction data is structured:

Wide Format:

Each column represents one item
Each row is a transaction
Values are 1 (item present) or 0 (item absent)

Example:

TransactionID | Bread | Milk | Eggs | Butter
1             | 1     | 1    | 0    | 1
2             | 0     | 1    | 1    | 0

Long Format:

Each row is one item in a transaction
Requires Transaction ID column to group items
More natural for real-world data

Example:

TransactionID | Item
1             | Bread
1             | Milk
1             | Butter
2             | Milk
2             | Eggs

Feature Configuration

Feature Columns (required)

Wide format: List all item columns
Long format: Select the single column containing item names

Transaction ID Column (required for long format) Column that identifies which transaction each item belongs to.

Contains Multiple Items (long format only) Check if a single row can contain multiple items (e.g., "Bread, Milk, Eggs").

Item Separator (if multiset) Character separating multiple items (default: comma).

Example: "Bread, Milk, Eggs" uses "," as separator

Segmentation (Optional)

Segmentation Column Analyze different customer segments separately:

Store locations (downtown vs. suburban)
Customer types (premium vs. regular)
Time periods (weekday vs. weekend)

Target Segment Value Filter to analyze only specific segment.

Model Parameters

Minimum Support (default: 0.02, required) Threshold for how frequently an itemset must appear.

0.02 = 2% of transactions
Lower values: Find rare patterns, but slower and more results
Higher values: Only common patterns, faster
Recommendations:
- Large stores (>10k transactions): 0.001-0.01 (0.1%-1%)
- Medium stores: 0.01-0.05 (1%-5%)
- Small datasets: 0.05-0.1 (5%-10%)

Maximum Itemset Length (default: 3, required) Maximum number of items in a pattern.

2: Pairs only (A -> B)
3: Triples (A, B -> C)
4+: Complex patterns (slower, harder to interpret)
Recommendations:
- Start with 2-3 for interpretability
- Increase only if needed

Rule Evaluation Metric (default: "lift", required) How to measure rule strength:

lift: Strength of association (recommended)
confidence: Reliability of rule
leverage: Lift adjusted by item frequencies
conviction: Dependency strength

Metric Threshold (default: 1.2, required) Minimum value for the selected metric to keep a rule.

For lift: >1.0 (1.2 = 20% more likely)
For confidence: 0.5-0.9 (50%-90% probability)

Advanced Filtering (Optional)

Enable Advanced Filtering Set both confidence and lift thresholds simultaneously for stricter rules.

Minimum Confidence (default: 0.6) Probability that Y is purchased given X is purchased.

0.6 = 60% of transactions with X also have Y
Range: 0.1-1.0

Minimum Lift (default: 1.1) How much more likely Y is with X versus without X.

1.0 = No association (independent)
1.1 = 10% increase in likelihood
2.0 = 2x more likely
Range: >0.0 (typically >1.0 for meaningful rules)

Understanding Association Metrics

Support

Definition: How frequently an itemset appears in the database.

Formula: support(X) = (transactions containing X) / (total transactions)

Example:

100 transactions total
[Bread, Milk] appears in 20 transactions
support([Bread, Milk]) = 20/100 = 0.2 = 20%

Interpretation:

0.01 (1%): Rare pattern
0.05 (5%): Moderate frequency
0.2 (20%): Very common pattern

Use: Filter out rare, potentially spurious patterns

Confidence

Definition: Probability of finding Y in transactions that contain X.

Formula: confidence(X -> Y) = support(X U Y) / support(X)

Example:

support([Bread]) = 0.5 (50% of transactions)
support([Bread, Butter]) = 0.3 (30% of transactions)
confidence(Bread -> Butter) = 0.3 / 0.5 = 0.6 = 60%

Interpretation:

0.6 = 60% of customers who buy bread also buy butter
Higher confidence = more reliable rule

Limitation: Can be misleading if Y is very common

Lift

Definition: How much more likely Y is with X versus without X.

Formula: lift(X -> Y) = confidence(X -> Y) / support(Y)

Example:

confidence(Bread -> Butter) = 0.6
support(Butter) = 0.4 (40% buy butter overall)
lift(Bread -> Butter) = 0.6 / 0.4 = 1.5

Interpretation:

lift = 1.0: No association (X and Y are independent)
lift > 1.0: Positive association (Y more likely with X)
- 1.5 = 50% increase in likelihood
- 2.0 = 2x more likely (100% increase)
lift < 1.0: Negative association (Y less likely with X)

Why Lift is Best for Discovery:

Accounts for item popularity
Detects true associations vs. coincidence
Symmetric: lift(X -> Y) = lift(Y -> X)

Leverage

Definition: Difference between observed and expected co-occurrence.

Formula: leverage(X -> Y) = support(X U Y) - support(X) x support(Y)

Example:

support([Bread, Butter]) = 0.3 (observed)
support(Bread) x support(Butter) = 0.5 x 0.4 = 0.2 (expected if independent)
leverage = 0.3 - 0.2 = 0.1

Interpretation:

0: No association
Positive: Items appear together more than expected
Negative: Items appear together less than expected
Magnitude matters: Higher absolute value = stronger relationship

Conviction

Definition: Dependency measure - how much more Y depends on X.

Formula: conviction(X -> Y) = (1 - support(Y)) / (1 - confidence(X -> Y))

Example:

support(Butter) = 0.4
confidence(Bread -> Butter) = 0.6
conviction = (1 - 0.4) / (1 - 0.6) = 0.6 / 0.4 = 1.5

Interpretation:

1.0: No association (independent)
>1.0: Y depends on X
infinity: Perfect dependency (always Y when X)

Use: Measures how much the rule deviates from independence

Configuration Tips

Best Practices for FP-Growth

Default Choice:

FP-Growth should be your go-to algorithm
Fastest for most real-world scenarios
Handles large datasets efficiently

Optimal Settings:

min_support = 0.01-0.02 for large datasets
max_length = 3 (good balance of depth and speed)
rule_metric = "lift" with threshold 1.5

Performance Advantages:

Can handle min_support as low as 0.001 efficiently
Scales well to millions of transactions
Memory efficient compared to Apriori

When FP-Growth Excels:

Large transaction databases
Need to find rare patterns (low support)
Production recommendation systems
Real-time or near-real-time analysis

Common Issues and Solutions

Results Differ from Apriori

Symptom: FP-Growth finds different itemsets than Apriori

Explanation: Both algorithms should find the same frequent itemsets (above min_support threshold). If results differ:

Check that parameters match exactly
Verify data preprocessing is identical
Ensure min_support threshold is the same

Note: Order of itemsets/rules may differ, but the set should be identical

Memory Issues

Symptom: Out of memory errors or slow performance

Solutions:

Increase min_support
Reduce max_length
Process data in segments using segmentation
Filter to fewer items before mining

Too Many Patterns

Symptom: Thousands of itemsets and rules generated

Solutions:

Increase min_support (0.01 -> 0.02)
Enable advanced filtering
Set higher lift threshold (1.5+)
Focus on max_length = 2

Slow Performance Despite FP-Growth

Symptom: FP-Growth takes longer than expected

Possible Causes:

Very low min_support (<0.001)
Very high max_length (>4)
Extremely large dataset
Dense transactions (many items per transaction)

Solutions:

Increase min_support slightly
Reduce max_length
Pre-filter rare items
Consider using Relim for memory-constrained environments

FP-Growth

When to Use FP-Growth

Strengths

Weaknesses

How it Works

When to Choose Over Apriori

Parameters

Data Format

Feature Configuration

Segmentation (Optional)

Model Parameters

Advanced Filtering (Optional)

Understanding Association Metrics

Support

Confidence

Lift

Leverage

Conviction

Configuration Tips

Best Practices for FP-Growth

Common Issues and Solutions

Results Differ from Apriori

Memory Issues

Too Many Patterns

Slow Performance Despite FP-Growth

On this page

Sicherheit auf Enterprise-Niveau

In jeder Infrastruktur einsetzbar

DSGVO-konform

FP-Growth

When to Use FP-Growth

Strengths

Weaknesses

How it Works

When to Choose Over Apriori

Parameters

Data Format

Feature Configuration

Segmentation (Optional)

Model Parameters

Advanced Filtering (Optional)

Understanding Association Metrics

Support

Confidence

Lift

Leverage

Conviction

Configuration Tips

Best Practices for FP-Growth

Common Issues and Solutions

Results Differ from Apriori

Memory Issues

Too Many Patterns

Slow Performance Despite FP-Growth

On this page

Command Palette