FP-Growth
Fast algorithm using FP-tree data structure for efficient pattern mining
Fast algorithm using FP-tree (Frequent Pattern tree) data structure. Avoids candidate generation by compressing the database into a tree.
When to Use FP-Growth
- Best choice for most use cases
- Medium to large datasets (>10k transactions)
- Need faster performance than Apriori
- Production systems
Strengths
- Much faster than Apriori
- No candidate generation required
- Efficient memory use
- Scales well to large datasets
- Only two database scans
Weaknesses
- More complex to understand
- Less intuitive than Apriori
- Algorithm internals harder to explain
How it Works
- Scan database, count item frequencies
- Build FP-tree (compressed representation of database)
- Sort items by frequency in each transaction
- Share common prefixes in tree structure
- Mine frequent patterns from FP-tree recursively
- Extract conditional pattern bases
- Build conditional FP-trees
- Generate rules from patterns
Key Advantage: Only scans database twice (vs. multiple scans for Apriori). The FP-tree compresses the entire database into a compact tree structure, allowing pattern mining without repeated database access.
When to Choose Over Apriori
- Datasets >10k transactions
- Need faster results
- Low min_support values (finds rare patterns efficiently)
- Production systems where performance matters
- Large number of items or transactions
Parameters
All association algorithms share these common parameters:
Data Format
Input Format: 'long' or 'wide'
How your transaction data is structured:
Wide Format:
- Each column represents one item
- Each row is a transaction
- Values are 1 (item present) or 0 (item absent)
- Example:
TransactionID | Bread | Milk | Eggs | Butter 1 | 1 | 1 | 0 | 1 2 | 0 | 1 | 1 | 0
Long Format:
- Each row is one item in a transaction
- Requires Transaction ID column to group items
- More natural for real-world data
- Example:
TransactionID | Item 1 | Bread 1 | Milk 1 | Butter 2 | Milk 2 | Eggs
Feature Configuration
Feature Columns (required)
- Wide format: List all item columns
- Long format: Select the single column containing item names
Transaction ID Column (required for long format) Column that identifies which transaction each item belongs to.
Contains Multiple Items (long format only) Check if a single row can contain multiple items (e.g., "Bread, Milk, Eggs").
Item Separator (if multiset) Character separating multiple items (default: comma).
- Example: "Bread, Milk, Eggs" uses "," as separator
Segmentation (Optional)
Segmentation Column Analyze different customer segments separately:
- Store locations (downtown vs. suburban)
- Customer types (premium vs. regular)
- Time periods (weekday vs. weekend)
Target Segment Value Filter to analyze only specific segment.
Model Parameters
Minimum Support (default: 0.02, required) Threshold for how frequently an itemset must appear.
- 0.02 = 2% of transactions
- Lower values: Find rare patterns, but slower and more results
- Higher values: Only common patterns, faster
- Recommendations:
- Large stores (>10k transactions): 0.001-0.01 (0.1%-1%)
- Medium stores: 0.01-0.05 (1%-5%)
- Small datasets: 0.05-0.1 (5%-10%)
Maximum Itemset Length (default: 3, required) Maximum number of items in a pattern.
- 2: Pairs only (A -> B)
- 3: Triples (A, B -> C)
- 4+: Complex patterns (slower, harder to interpret)
- Recommendations:
- Start with 2-3 for interpretability
- Increase only if needed
Rule Evaluation Metric (default: "lift", required) How to measure rule strength:
- lift: Strength of association (recommended)
- confidence: Reliability of rule
- leverage: Lift adjusted by item frequencies
- conviction: Dependency strength
Metric Threshold (default: 1.2, required) Minimum value for the selected metric to keep a rule.
- For lift: >1.0 (1.2 = 20% more likely)
- For confidence: 0.5-0.9 (50%-90% probability)
Advanced Filtering (Optional)
Enable Advanced Filtering Set both confidence and lift thresholds simultaneously for stricter rules.
Minimum Confidence (default: 0.6) Probability that Y is purchased given X is purchased.
- 0.6 = 60% of transactions with X also have Y
- Range: 0.1-1.0
Minimum Lift (default: 1.1) How much more likely Y is with X versus without X.
- 1.0 = No association (independent)
- 1.1 = 10% increase in likelihood
- 2.0 = 2x more likely
- Range: >0.0 (typically >1.0 for meaningful rules)
Understanding Association Metrics
Support
Definition: How frequently an itemset appears in the database.
Formula: support(X) = (transactions containing X) / (total transactions)
Example:
- 100 transactions total
- [Bread, Milk] appears in 20 transactions
- support([Bread, Milk]) = 20/100 = 0.2 = 20%
Interpretation:
- 0.01 (1%): Rare pattern
- 0.05 (5%): Moderate frequency
- 0.2 (20%): Very common pattern
Use: Filter out rare, potentially spurious patterns
Confidence
Definition: Probability of finding Y in transactions that contain X.
Formula: confidence(X -> Y) = support(X U Y) / support(X)
Example:
- support([Bread]) = 0.5 (50% of transactions)
- support([Bread, Butter]) = 0.3 (30% of transactions)
- confidence(Bread -> Butter) = 0.3 / 0.5 = 0.6 = 60%
Interpretation:
- 0.6 = 60% of customers who buy bread also buy butter
- Higher confidence = more reliable rule
Limitation: Can be misleading if Y is very common
Lift
Definition: How much more likely Y is with X versus without X.
Formula: lift(X -> Y) = confidence(X -> Y) / support(Y)
Example:
- confidence(Bread -> Butter) = 0.6
- support(Butter) = 0.4 (40% buy butter overall)
- lift(Bread -> Butter) = 0.6 / 0.4 = 1.5
Interpretation:
- lift = 1.0: No association (X and Y are independent)
- lift > 1.0: Positive association (Y more likely with X)
- 1.5 = 50% increase in likelihood
- 2.0 = 2x more likely (100% increase)
- lift < 1.0: Negative association (Y less likely with X)
Why Lift is Best for Discovery:
- Accounts for item popularity
- Detects true associations vs. coincidence
- Symmetric: lift(X -> Y) = lift(Y -> X)
Leverage
Definition: Difference between observed and expected co-occurrence.
Formula: leverage(X -> Y) = support(X U Y) - support(X) x support(Y)
Example:
- support([Bread, Butter]) = 0.3 (observed)
- support(Bread) x support(Butter) = 0.5 x 0.4 = 0.2 (expected if independent)
- leverage = 0.3 - 0.2 = 0.1
Interpretation:
- 0: No association
- Positive: Items appear together more than expected
- Negative: Items appear together less than expected
- Magnitude matters: Higher absolute value = stronger relationship
Conviction
Definition: Dependency measure - how much more Y depends on X.
Formula: conviction(X -> Y) = (1 - support(Y)) / (1 - confidence(X -> Y))
Example:
- support(Butter) = 0.4
- confidence(Bread -> Butter) = 0.6
- conviction = (1 - 0.4) / (1 - 0.6) = 0.6 / 0.4 = 1.5
Interpretation:
- 1.0: No association (independent)
- >1.0: Y depends on X
- infinity: Perfect dependency (always Y when X)
Use: Measures how much the rule deviates from independence
Configuration Tips
Best Practices for FP-Growth
Default Choice:
- FP-Growth should be your go-to algorithm
- Fastest for most real-world scenarios
- Handles large datasets efficiently
Optimal Settings:
- min_support = 0.01-0.02 for large datasets
- max_length = 3 (good balance of depth and speed)
- rule_metric = "lift" with threshold 1.5
Performance Advantages:
- Can handle min_support as low as 0.001 efficiently
- Scales well to millions of transactions
- Memory efficient compared to Apriori
When FP-Growth Excels:
- Large transaction databases
- Need to find rare patterns (low support)
- Production recommendation systems
- Real-time or near-real-time analysis
Common Issues and Solutions
Results Differ from Apriori
Symptom: FP-Growth finds different itemsets than Apriori
Explanation: Both algorithms should find the same frequent itemsets (above min_support threshold). If results differ:
- Check that parameters match exactly
- Verify data preprocessing is identical
- Ensure min_support threshold is the same
Note: Order of itemsets/rules may differ, but the set should be identical
Memory Issues
Symptom: Out of memory errors or slow performance
Solutions:
- Increase min_support
- Reduce max_length
- Process data in segments using segmentation
- Filter to fewer items before mining
Too Many Patterns
Symptom: Thousands of itemsets and rules generated
Solutions:
- Increase min_support (0.01 -> 0.02)
- Enable advanced filtering
- Set higher lift threshold (1.5+)
- Focus on max_length = 2
Slow Performance Despite FP-Growth
Symptom: FP-Growth takes longer than expected
Possible Causes:
- Very low min_support (<0.001)
- Very high max_length (>4)
- Extremely large dataset
- Dense transactions (many items per transaction)
Solutions:
- Increase min_support slightly
- Reduce max_length
- Pre-filter rare items
- Consider using Relim for memory-constrained environments