Apriori Algorithm in ML
The Apriori algorithm, introduced by R. Agrawal and R. Srikant in 1994, targets frequent itemset identification in datasets for association rule mining. Named 'Apriori' due to its foundational knowledge of itemset properties, it employs an iterative, level-wise strategy. This approach leverages k-frequent itemsets to deduce k+1 itemsets, optimizing efficiency via the Apriori property, which curtails the search space.
Operating on transactional databases, the algorithm extracts association rules, quantifying relationships between items. Using breadth-first search and Hash Tree techniques, it efficiently computes itemset associations.
Predominantly utilized for market basket insights, Apriori identifies product combinations, but its applicability extends to sectors like healthcare, pinpointing potential drug interactions for patients.
Important Points Needed for Implementing Apriori Algorithm
- Confidence
- Measures how often items in Y appear in transactions that contain X.
- Confidence(b | a) = (number of transactions containing a and b)/(number of transactions containing a)
- Support(a)
- (Number of transactions in which a appears)/(total number of transactions)
- Frequent item set
- Itemset having items whose support is greater than the threshold values or user-specified minimum support.
Steps for Apriori Algorithm
1. Select minimum support and confidence from our transactional database.
2. Select all transaction supports that have a greater confidence value than the threshold or minimum confidence value.
3. Identify any rules in these subsets that have a greater confidence value than the threshold or minimal confidence value.
4. Arrange the rules in order of increasing lift.
Working on Apriori Algorithm
Let's understand this with the help of an example:
We are given the following data set, and Using the Apriori method, we must locate the frequently occurring itemsets and construct association rules:
| Transaction ID | ItemSet |
|---|---|
| T1 | a, b |
| T2 | a, b, c |
| T3 | a, b, c, e |
| T4 | b, c, d |
| T5 | a, d |
Minimum Support = 2 and minimum confidence = 60%
Solution
1. Create a table that contains the support count(frequency of each item set) of itemsets individually.
| Item Set | Support Count |
|---|---|
| a | 4 |
| b | 4 |
| c | 3 |
| d | 2 |
| e | 1 |
After removing an item set with a support count less than minimum support, we get
| Item Set | Support Count |
|---|---|
| a | 4 |
| b | 4 |
| c | 3 |
| d | 2 |
2. Create a table that contains the support count of itemsets present in the final table of step 1 in pairs
| Item Set | Support Count |
|---|---|
| a, b | 3 |
| a, c | 2 |
| a, d | 1 |
| b, c | 3 |
| b, d | 1 |
| c, d | 1 |
After removing an item set with a support count less than minimum support, we get
| Item Set | Support Count |
|---|---|
| a, b | 3 |
| a, c | 2 |
| b, c | 3 |
3. Create a table that contains the support count of itemsets present in the final table of step 1 in triplets.
| Item Set | Support Count |
|---|---|
| a, b, c | 2 |
| b, c, d | 1 |
After removing an item set with a support count less than minimum support, we get
| Item Set | Support Count |
|---|---|
| a, b, c | 2 |
4. Find the association rules for the subsets Create a new table with all possible rules from the occurred combination {a, b, c}.
| Rules | Support | Confidence |
|---|---|---|
| {a, b} -> c | 2 | 2/4 = 50% |
| {b, c} -> a | 2 | 2/3 =66.67% |
| {a, c} -> b | 2 | 2/2 =100% |
| a -> {b, c} | 2 | 2/4 =50% |
| b -> {a, c} | 2 | 2/4 =50% |
| c -> {a, b} | 2 | 2/3=66.67% |
After removing rules with confidence less than minimum confidence, we get
| Rules | Support | Confidence |
|---|---|---|
| {b, c} -> a | 2 | 2/3 =66.67% |
| {a, c} -> b | 2 | 2/2 =100% |
| c -> {a, b} | 2 | 2/3=66.67% |
Now we can consider {b, c} -> a, {a, c} -> b, c -> {a, b} as strong association rules for the given problem.
Advantages of the Apriori Algorithm
1. This algorithm is simple to comprehend.
2. On big datasets, the algorithm's join and prune steps are simple to implement.
Disadvantages of Apriori Algorithm
1. In comparison to other algorithms, the apriori algorithm is slow.
2. Because it checks the database many times, overall performance may suffer.
3. The apriori algorithm has a time and space complexity of O(2D), which is extremely high. The horizontal width of the database is represented by D.
Python Implementation of Apriori Algorithm
1. transactions is a list of transactions.
2. min_support= To specify the minimum float value for support. We've used 0.005 in this case.
3. min_confidence= Set the minimum confidence value using min confidence. We've taken 0.2 in this case. It can be altered to suit the needs of the company.
4. min_lift= To determine the lift's minimal value.
5. min_length= For the alliance, the bare minimum of products is required.
6. max_length= It takes the maximum number of products for the association.
Code:
Application of Apriori Algorithm
1. Extracting association rules in data mining of admitted students based on features and specialties in the field of education.
2. In the Medical field: For example, Analysis of the patient’s database.
3. In Forestry: Analysis of probability and intensity of forest fire with the forest fire data.
4. Many firms employ Apriori, including Amazon's Recommender System and Google's auto-complete feature.
Conclusion
- The apriori algorithm is a fast database scanning algorithm that only scans the database once.
- It significantly reduces the size of the itemsets in the database while maintaining decent performance. As a result, data mining aids consumers and businesses in making better decisions.