Frequent Itemsets#

1. The market-basket model#

Describe a common form of many-to-many relationship between two kinds of objects.
- A large set of items. e.g.: things sold in a supermarket.
- A large set of baskets, each of which is a small set of items. e.g.: things one customer buys on one trip to the supermarket.
  - Number of baskets cannot fit into memory.

A set of items that appears in many baskets is said to be frequent.
Assume a value s: support threshold.
If I is a set of items.
- The support for I is the number of baskets in which I is a subset.
I is frequent if its support is s or higher.

Items: products; Baskets: sets of products.
- Given that many people buy beer and diapers together: run a sale on diapers and raise price of beer.
- Given that many people buy hotdog and mustards together: run a sale of hotdog and raise price of mustards.
Items = documents; baskets = sentences/phrases.
- Items that appear together too often could represent plagiarism.
Items = words, basket = documents.
- Unusual words appearing together in large number of documents indicating interesting relationship.

Walmart sells hundreds of thousands of items, and has billions of transactions (shopping basket/cart at checkout).
The Web has billions of words and many billions of pages.

If-then rules abou the contents of baskets.
{i_1, i_2, ..., i_k} -> j means: “If a basket contains all of i_1,…,i_k then it is likely to contain j.”
Confidence of this association rule is the probability of j given {i_1,…,i_k}.
- The fraction of the basket with {i_1,…,i_k} that also contain j.
Example:
- B1: m,c,b
- B2: m,p,j
- B3: m,b
- B4: c,j
- B5: m,p,b
- B6: m,c,b,j
- B7: c,b,j
- B8: b,c
An association rule: {m,b} -> c
- Basket contains m and b: B1, B3, B5, B6
- Basket contains m, b, and c: B1, B6
- C = 2 / 4 = 50%

Data is stored in flat files on disk.
Most likely basket-by-basket.
Expand baskets into pairs, triples, etc as you read the baskets.
- Use k nested loops to generate all sets of size k.
I/O cost: per passes (all baskets read).

For many frequent-itemset algorithms, main memory is the critical resource.
We need to keep count of things (occurrences of pairs/triples/…) when we read baskets.
The number of different things we can count is limited by main memory.
Swapping counts is going to be horrible.

Hardest problem is finding frequent pairs because they are the most common.
Read file once, counting in main memory the occurences of each pair.
For each basket of n items, there will be n(n-1)/2 pairs, generated by double-nested loops.
If n^2 exceeds main memory, we fails.

Count all pairs using a triangular matrix.
- Requires 4 bytes per pair for all possible pairs: 2n^2
Keep a table ot triples [i, j, c] with c is the count of pair {i,j}.
- Requires 12 bytes only for pairs with count > 0: 12p with p is the number of pairs that actually occur.

Limit the need for main memory.
Key idea: monotonicity
- If a set of items appears at least s times, so does every subset of this set.
Contrapositive: If an item i does not appear in s baskets, then no pair containing i can appear in s baskets.
A-Priori algorithm:
- Pass 1: read baskets and count the item occurrences. Only keep items that appear at least s times - frequent items.
- Pass 2: read baskets again and only count in main memory only those pairs whose both items were found to be frequent from Pass 1.
- Repeat the process with increasing number of items added to only sets found to be frequent.
- C1 = all items
- In general, L_k are members of C_k with support greater than or equal to s.
- C_(k+1) includes (k+1) sets, each k of which is in L_k.

Under two passes.
SON: Savasere-Omiecinski-Navathe
Adaptable to a distributed data model (mapreduce).
- Repeatedly read small subsets of the baskets into main memory and perform a-priori on these subsets, using a support that is equal to the main support divided by the total numbers of subsets.
- Aggregate all candidate itemsets and determine which are frequent in the entire set.