This article primarily demonstrates how to do market basket analysis using Python, along with a primer on association rules.
What is Market Basket Analysis?
Market Basket Analysis is a modelling technique based upon the theory of association rules which states that if you buy a certain type of item, you are more or less likely to buy another type of item. For example, if you buy milk and eggs you are more likely to buy bread than anyone else who did not buy milk or eggs. Market basket analysis seeks to find a relationship between purchases made by customers. This is very important for supermarkets and online stores for placement of products and recommendation of products.
How does Market Basket Analysis work?
Market basket analysis is one of the key techniques used by large retailers to uncover association between items. This helps in understanding the customer’s behavior to a certain extent. And enables the retailers to make recommendations based on past purchases of other customers.
Market Basket analysis is primarily based on rules, association rules in particular. A good example of association rules is – if a customer already chose peanut butter and jelly, then what’s the possibility of buying bread? This kind of “if item A is bought, the possibility of buying Y” is called an association rule. The rule can be represented as –
{peanut butter, jelly} => {bread}
Why Association Rules?
This is indeed an interesting question. When there are multiple techniques such as SVM’s, Random Forest, Clustering and so on. Why is association rules preferred over these techniques for market basket analysis? Some of the drawbacks that come with these techniques are –
- Tuning such algorithms can be quite hard.
- These algorithms tend to require quite a large amount of data to give good recommendations.
- They also require quite a bit of feature engineering.
This is where Association rules has an edge over other techniques –
- It is relatively fast method.
- Works well on small quantities of data.
- Not much feature engineering is required.
Some of the important terms involved with market basket analysis –
Support – This is the relative frequency of an item in the transaction data. Support for an item can be calculated as –
Confidence – It is the probability of seeing the consequent in a transaction given that it also contains the antecedent. In the below case, A is the antecedent, whereas C is the consequent. Confidence is also a measure of the reliability of a rule.
Lift – Lift is a metric which measures how much more often the antecedent and consequent occur together than them occurring independently. A lift score of 1 and above is considered.
Performing Market Basket Analysis Using Python
Since we need to perform association analysis, there is a very good package available in python to accomplish this. The package is named MLxtend, it has similar syntax to scikit-learn. Let’s install MLxtend –
- If you are using the pip package manager – pip install mlxtend
- Alternatively, if you are using the conda package manager – conda install mlxtend
For explicit directions on how to install mlxtend, feel free to visit this page.
Once you are done installing MLxtend, it’s time to proceed to the next step – reading in the data. Now, it’s important to get the data in the right format. Transaction data is usually in the below format –
The above data is a sample dataset from a convenience store and it shall not be be used for analysis in this article. I have prepared a new set of data to work with the algorithm. Each row in the above data set represents a transaction, the first row represents one transaction wherein the items purchased were – citrus fruit, semi-finished bread, margarine and ready soups. For the apriori algorithm to function, the transaction data must be converted to the sparse matrix which is in the format as shown below –
This transformation can be achieved by using the TransactionEncoder module from mlxtend.preprocessing. The code for performing the encoding is – TransactionEncoder.fit(dataset).transform(dataset). For working with the algorithm, I have already transformed a new set of data into the sparse matrix format. Let’s start with the analysis
import pandas as pd from mlxtend.frequent_patterns import apriori from mlxtend.frequent_patterns import association_rules
Now let’s read in the data –
basket_sets = pd.read_csv('a.csv') basket_sets.head()
The above snippet is basically transaction data from a hardware store. We have the invoice number and the items as columns and each row represents a transaction. The InvoiceNo column needs to be dropped since it does not add any value to the analysis.
basket_sets = basket_sets.drop('InvoiceNo', axis=1) basket_sets.head()
Cool, the InvoiceNo is removed. Now let’s find out which of the items have a minimum support of at least 0.6.
apriori(basket_sets, min_support=0.6) Support itemsets 0 0.728850 [0] 1 0.659436 [25] 2 0.904555 [35] 3 0.624729 [46] 4 0.735358 [61] 5 0.785249 [64] 6 0.685466 [65] 7 2.611714 [81] 8 1.015184 [146] 9 0.867679 [225] 10 0.824295 [238] 11 0.629067 [239] 12 0.607375 [240] 13 0.737527 [308] 14 0.637744 [315] 15 0.646421 [317] 16 0.624729 [349] 17 1.171367 [366] 18 0.780911 [542] 19 0.911063 [595] 20 1.145336 [598] 21 0.806941 [601] 22 0.780911 [614] 23 0.655098 [626] 24 0.650759 [627] 25 1.388286 [631] 26 0.954447 [633] 27 1.056399 [639] 28 0.650759 [641] 29 1.219089 [697] ... ... ... 64 1.017354 [1027] 65 0.631236 [1035] 66 2.850325 [1057] 67 0.607375 [1086] 68 0.969631 [1090] 69 0.709328 [1119] 70 1.373102 [1121] 71 0.659436 [1168] 72 0.809111 [1179] 73 0.683297 [1234] 74 0.676790 [1235] 75 1.301518 [1239] 76 1.492408 [1240] 77 0.728850 [1243] 78 1.041215 [1245] 79 1.613883 [1246] 80 2.082430 [1248] 81 2.759219 [1267] 82 2.420824 [1268] 83 1.041215 [1302] 84 0.806941 [1321] 85 1.787419 [1326] 86 0.676790 [1339] 87 0.702820 [1350] 88 1.093275 [1372] 89 0.661605 [1523] 90 1.145336 [1535] 91 0.704989 [1536] 92 0.704989 [1540] 93 0.759219 [1551]
Now there are 93 items which have a support of 0.6 and above. But the above table has item indices and not the item names. To get the item names, the following command can be passed –
apriori(basket_sets, min_support=0.6, use_colnames=True) support itemsets 0 0.728850 [10 COLOUR SPACEBOY PEN] 1 0.659436 [36 PENCILS TUBE RED RETROSPOT] 2 0.904555 [4 TRADITIONAL SPINNING TOPS] 3 0.624729 [60 TEATIME FAIRY CAKE CASES] 4 0.735358 [ALARM CLOCK BAKELIKE GREEN] 5 0.785249 [ALARM CLOCK BAKELIKE PINK] 6 0.685466 [ALARM CLOCK BAKELIKE RED] 7 2.611714 [ASSORTED COLOUR BIRD ORNAMENT] 8 1.015184 [BLUE HARMONICA IN BOX] 9 0.867679 [CARTOON PENCIL SHARPENERS] 10 0.824295 [CHARLOTTE BAG APPLES DESIGN] 11 0.629067 [CHARLOTTE BAG DOLLY GIRL DESIGN] 12 0.607375 [CHARLOTTE BAG PINK POLKADOT] 13 0.737527 [CIRCUS PARADE LUNCH BOX] 14 0.637744 [CLOTHES PEGS RETROSPOT PACK 24] 15 0.646421 [COFFEE MUG APPLES DESIGN] 16 0.624729 [DINOSAUR KEYRINGS ASSORTED] 17 1.171367 [DOLLY GIRL LUNCH BOX] 18 0.780911 [GUMBALL COAT RACK] 19 0.911063 [ICE CREAM BUBBLES] 20 1.145336 [ICE CREAM SUNDAE LIP GLOSS] 21 0.806941 [INFLATABLE POLITICAL GLOBE] 22 0.780911 [JAM MAKING SET PRINTED] 23 0.655098 [JUMBO BAG APPLES] 24 0.650759 [JUMBO BAG DOILEY PATTERNS] 25 1.388286 [JUMBO BAG PINK POLKADOT] 26 0.954447 [JUMBO BAG RED RETROSPOT] 27 1.056399 [JUMBO BAG VINTAGE DOILY] 28 0.650759 [JUMBO BAG WOODLAND ANIMALS] 29 1.219089 [LUNCH BAG APPLE DESIGN] ... ... ... 64 1.017354 [RED RETROSPOT CHARLOTTE BAG] 65 0.631236 [RED RETROSPOT PICNIC BAG] 66 2.850325 [RED TOADSTOOL LED NIGHT LIGHT] 67 0.607375 [RETROSPOT PARTY BAG + STICKER SET] 68 0.969631 [REVOLVER WOODEN RULER] 69 0.709328 [ROUND SNACK BOXES SET OF 4 FRUITS] 70 1.373102 [ROUND SNACK BOXES SET OF4 WOODLAND] 71 0.659436 [SET OF 12 FAIRY CAKE BAKING CASES] 72 0.809111 [SET OF 20 KIDS COOKIE CUTTERS] 73 0.683297 [SET OF 60 I LOVE LONDON CAKE CASES] 74 0.676790 [SET OF 60 PANTRY DESIGN CAKE CASES] 75 1.301518 [SET OF 9 BLACK SKULL BALLOONS] 76 1.492408 [SET OF 9 HEART SHAPED BALLOONS] 77 0.728850 [SET/10 BLUE POLKADOT PARTY CANDLES] 78 1.041215 [SET/10 PINK POLKADOT PARTY CANDLES] 79 1.613883 [SET/10 RED POLKADOT PARTY CANDLES] 80 2.082430 [SET/20 RED RETROSPOT PAPER NAPKINS] 81 2.759219 [SET/6 RED SPOTTY PAPER CUPS] 82 2.420824 [SET/6 RED SPOTTY PAPER PLATES] 83 1.041215 [SMALL RED RETROSPOT WINDMILL] 84 0.806941 [SPACEBOY BIRTHDAY CARD] 85 1.787419 [SPACEBOY LUNCH BOX] 86 0.676790 [STARS GIFT TAPE] 87 0.702820 [STRAWBERRY LUNCH BOX WITH CUTLERY] 88 1.093275 [TEA PARTY BIRTHDAY CARD] 89 0.661605 [WOODLAND CHARLOTTE BAG] 90 1.145336 [WORLD WAR 2 GLIDERS ASSTD DESIGNS] 91 0.704989 [WRAP VINTAGE DOILY] 92 0.704989 [WRAP CHRISTMAS VILLAGE] 93 0.759219 [WRAP RED APPLES]
Hmm that’s much better, let’s try decreasing the support value since with a support value of 0.6 returns only items of one combination.
df = basket_sets frequent_itemsets = apriori(df, min_support=0.06, use_colnames=True) frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x)) frequent_itemsets support itemsets length 0 0.728850 [10 COLOUR SPACEBOY PEN] 1 1 0.260304 [12 COLOURED PARTY BALLOONS] 1 2 0.427332 [12 PENCIL SMALL TUBE WOODLAND] 1 3 0.323210 [12 PENCILS SMALL TUBE RED RETROSPOT] 1 4 0.336226 [12 PENCILS SMALL TUBE SKULL] 1 5 0.227766 [12 PENCILS TALL TUBE RED RETROSPOT] 1 6 0.156182 [12 PENCILS TALL TUBE WOODLAND] 1 7 0.069414 [18PC WOODEN CUTLERY SET DISPOSABLE] 1 8 0.088937 [20 DOLLY PEGS RETROSPOT] 1 9 0.208243 [3 PIECE SPACEBOY COOKIE CUTTER SET] 1 10 0.078091 [36 DOILIES DOLLY GIRL] 1 11 0.659436 [36 PENCILS TUBE RED RETROSPOT] 1 12 0.119306 [36 PENCILS TUBE SKULLS] 1 13 0.312364 [36 PENCILS TUBE WOODLAND] 1 14 0.084599 [3D VINTAGE CHRISTMAS STICKERS] 1 15 0.078091 [4 IVORY DINNER CANDLES SILVER FLOCK] 1 16 0.904555 [4 TRADITIONAL SPINNING TOPS] 1 17 0.104121 [5 HOOK HANGER RED MAGIC TOADSTOOL] 1 18 0.236443 [6 GIFT TAGS 50'S CHRISTMAS] 1 19 0.340564 [6 GIFT TAGS VINTAGE CHRISTMAS] 1 20 0.203905 [6 RIBBONS RUSTIC CHARM] 1 21 0.468547 [60 CAKE CASES DOLLY GIRL DESIGN] 1 22 0.104121 [60 CAKE CASES VINTAGE CHRISTMAS] 1 23 0.624729 [60 TEATIME FAIRY CAKE CASES] 1 24 0.260304 [72 SWEETHEART FAIRY CAKE CASES] 1 25 0.121475 [ABC TREASURE BOOK BOX] 1 26 0.147505 [ALARM CLOCK BAKELIKE CHOCOLATE] 1 27 0.735358 [ALARM CLOCK BAKELIKE GREEN] 1 28 0.173536 [ALARM CLOCK BAKELIKE IVORY] 1 29 0.208243 [ALARM CLOCK BAKELIKE ORANGE] 1 ... ... ... ... 662 0.086768 [PLASTERS IN TIN CIRCUS PARADE, PLASTERS IN TI... 2 663 0.125813 [PLASTERS IN TIN CIRCUS PARADE, POSTAGE] 2 664 0.088937 [PLASTERS IN TIN SPACEBOY, PLASTERS IN TIN WOO... 2 665 0.097614 [PLASTERS IN TIN SPACEBOY, POSTAGE] 2 666 0.117137 [PLASTERS IN TIN WOODLAND ANIMALS, POSTAGE] 2 667 0.140998 [POSTAGE, RABBIT NIGHT LIGHT] 2 668 0.069414 [POSTAGE, RED RETROSPOT CHARLOTTE BAG] 2 669 0.097614 [POSTAGE, RED RETROSPOT MINI CASES] 2 670 0.134490 [POSTAGE, RED TOADSTOOL LED NIGHT LIGHT] 2 671 0.091106 [POSTAGE, REGENCY CAKESTAND 3 TIER] 2 672 0.080260 [POSTAGE, ROUND SNACK BOXES SET OF 4 FRUITS] 2 673 0.125813 [POSTAGE, ROUND SNACK BOXES SET OF4 WOODLAND] 2 674 0.093275 [POSTAGE, SET/20 RED RETROSPOT PAPER NAPKINS] 2 675 0.099783 [POSTAGE, SET/6 RED SPOTTY PAPER CUPS] 2 676 0.091106 [POSTAGE, SET/6 RED SPOTTY PAPER PLATES] 2 677 0.082430 [POSTAGE, SPACEBOY LUNCH BOX] 2 678 0.097614 [POSTAGE, STRAWBERRY LUNCH BOX WITH CUTLERY] 2 679 0.075922 [POSTAGE, TEA PARTY BIRTHDAY CARD] 2 680 0.086768 [SET/20 RED RETROSPOT PAPER NAPKINS, SET/6 RED... 2 681 0.086768 [SET/20 RED RETROSPOT PAPER NAPKINS, SET/6 RED... 2 682 0.104121 [SET/6 RED SPOTTY PAPER CUPS, SET/6 RED SPOTTY... 2 683 0.060738 [ALARM CLOCK BAKELIKE GREEN, ALARM CLOCK BAKEL... 3 684 0.062907 [PLASTERS IN TIN CIRCUS PARADE, PLASTERS IN TI... 3 685 0.071584 [PLASTERS IN TIN CIRCUS PARADE, PLASTERS IN TI... 3 686 0.071584 [PLASTERS IN TIN SPACEBOY, PLASTERS IN TIN WOO... 3 687 0.071584 [POSTAGE, SET/20 RED RETROSPOT PAPER NAPKINS, ... 3 688 0.071584 [POSTAGE, SET/20 RED RETROSPOT PAPER NAPKINS, ... 3 689 0.086768 [POSTAGE, SET/6 RED SPOTTY PAPER CUPS, SET/6 R... 3 690 0.084599 [SET/20 RED RETROSPOT PAPER NAPKINS, SET/6 RED... 3 691 0.069414 [POSTAGE, SET/20 RED RETROSPOT PAPER NAPKINS, ... 4
Filtering out the item combinations of 2 and more.
frequent_itemsets[ (frequent_itemsets['length'] >= 2) ] support itemsets length 643 0.062907 [ALARM CLOCK BAKELIKE GREEN, ALARM CLOCK BAKEL... 2 644 0.067245 [ALARM CLOCK BAKELIKE GREEN, ALARM CLOCK BAKEL... 2 645 0.071584 [ALARM CLOCK BAKELIKE GREEN, POSTAGE] 2 646 0.062907 [ALARM CLOCK BAKELIKE PINK, ALARM CLOCK BAKELI... 2 647 0.075922 [ALARM CLOCK BAKELIKE PINK, POSTAGE] 2 648 0.073753 [ALARM CLOCK BAKELIKE RED, POSTAGE] 2 649 0.062907 [DOLLY GIRL LUNCH BOX, POSTAGE] 2 650 0.060738 [DOLLY GIRL LUNCH BOX, SPACEBOY LUNCH BOX] 2 651 0.069414 [JUMBO BAG RED RETROSPOT, POSTAGE] 2 652 0.065076 [JUMBO BAG WOODLAND ANIMALS, POSTAGE] 2 653 0.088937 [LUNCH BAG APPLE DESIGN, POSTAGE] 2 654 0.104121 [LUNCH BAG RED RETROSPOT, POSTAGE] 2 655 0.078091 [LUNCH BAG SPACEBOY DESIGN, POSTAGE] 2 656 0.086768 [LUNCH BAG WOODLAND, POSTAGE] 2 657 0.097614 [LUNCH BOX WITH CUTLERY RETROSPOT, POSTAGE] 2 658 0.069414 [MINI PAINT SET VINTAGE, POSTAGE] 2 659 0.071584 [PACK OF 72 RETROSPOT CAKE CASES, POSTAGE] 2 660 0.060738 [PAPER BUNTING RETROSPOT, POSTAGE] 2 661 0.078091 [PLASTERS IN TIN CIRCUS PARADE, PLASTERS IN TI... 2 662 0.086768 [PLASTERS IN TIN CIRCUS PARADE, PLASTERS IN TI... 2 663 0.125813 [PLASTERS IN TIN CIRCUS PARADE, POSTAGE] 2 664 0.088937 [PLASTERS IN TIN SPACEBOY, PLASTERS IN TIN WOO... 2 665 0.097614 [PLASTERS IN TIN SPACEBOY, POSTAGE] 2 666 0.117137 [PLASTERS IN TIN WOODLAND ANIMALS, POSTAGE] 2 667 0.140998 [POSTAGE, RABBIT NIGHT LIGHT] 2 668 0.069414 [POSTAGE, RED RETROSPOT CHARLOTTE BAG] 2 669 0.097614 [POSTAGE, RED RETROSPOT MINI CASES] 2 670 0.134490 [POSTAGE, RED TOADSTOOL LED NIGHT LIGHT] 2 671 0.091106 [POSTAGE, REGENCY CAKESTAND 3 TIER] 2 672 0.080260 [POSTAGE, ROUND SNACK BOXES SET OF 4 FRUITS] 2 673 0.125813 [POSTAGE, ROUND SNACK BOXES SET OF4 WOODLAND] 2 674 0.093275 [POSTAGE, SET/20 RED RETROSPOT PAPER NAPKINS] 2 675 0.099783 [POSTAGE, SET/6 RED SPOTTY PAPER CUPS] 2 676 0.091106 [POSTAGE, SET/6 RED SPOTTY PAPER PLATES] 2 677 0.082430 [POSTAGE, SPACEBOY LUNCH BOX] 2 678 0.097614 [POSTAGE, STRAWBERRY LUNCH BOX WITH CUTLERY] 2 679 0.075922 [POSTAGE, TEA PARTY BIRTHDAY CARD] 2 680 0.086768 [SET/20 RED RETROSPOT PAPER NAPKINS, SET/6 RED... 2 681 0.086768 [SET/20 RED RETROSPOT PAPER NAPKINS, SET/6 RED... 2 682 0.104121 [SET/6 RED SPOTTY PAPER CUPS, SET/6 RED SPOTTY... 2 683 0.060738 [ALARM CLOCK BAKELIKE GREEN, ALARM CLOCK BAKEL... 3 684 0.062907 [PLASTERS IN TIN CIRCUS PARADE, PLASTERS IN TI... 3 685 0.071584 [PLASTERS IN TIN CIRCUS PARADE, PLASTERS IN TI... 3 686 0.071584 [PLASTERS IN TIN SPACEBOY, PLASTERS IN TIN WOO... 3 687 0.071584 [POSTAGE, SET/20 RED RETROSPOT PAPER NAPKINS, ... 3 688 0.071584 [POSTAGE, SET/20 RED RETROSPOT PAPER NAPKINS, ... 3 689 0.086768 [POSTAGE, SET/6 RED SPOTTY PAPER CUPS, SET/6 R... 3 690 0.084599 [SET/20 RED RETROSPOT PAPER NAPKINS, SET/6 RED... 3 691 0.069414 [POSTAGE, SET/20 RED RETROSPOT PAPER NAPKINS, ... 4
Now it’s quite easy to generate association rules with MLxtend, the argument to generate these rules takes in two inputs. One for defining if the metric should be “confidence” or “lift” and the second input is for setting the minimum level for these metrics. Let’s try building some rules with confidence level as the metric and a minimum threshold level of 0.5.
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.5) rules.head()
As seen in the above output, the minimum confidence level starts at 0.5. Now let’s create rules with lift as the metric and a minimum threshold level of 1.
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1) rules.head()
As seen in the above output, the minimum lift level starts above 1. Since we have obtained the rules, it’s quite easy to filter out rules with a desired number for lift and confidence.
rules[ (rules['lift'] >= 5) & (rules['confidence'] >= 0.5)]
This method can be used for giving out recommendation for products and is useful in understanding what products are having a high lift rate.
Conclusion
As you can see in this simple demonstration of market basket analysis using Python, it’s easy to form association rules in Python with the MLxtend package. The data that is used in this article is not relatively large, but I am sure that all the concepts regarding market basket analysis have been explained. Now it’s up to you guys to go ahead and start working on it.
And if you enjoyed this demonstration, consider enrolling in our course on Python for Data Science over on LinkedIn Learning.