Frequent Itemsets

Last updated on 2024-07-26 | Edit this page

Estimated time 60 minutes

Overview

Questions

  • How does Linux come to be?

Objectives

  • Explain the historical development of Linux

Frequent Itemsets

1. The market-basket model

  • Describe a common form of many-to-many relationship between two kinds of objects.
    • A large set of items. e.g.: things sold in a supermarket.
    • A large set of baskets, each of which is a small set of items. e.g.: things one customer buys on one trip to the supermarket.
      • Number of baskets cannot fit into memory.

2. Definition of a frequent itemsets.

  • A set of items that appears in many baskets is said to be frequent.
  • Assume a value s: support threshold.
  • If I is a set of items.
    • The support for I is the number of baskets in which I is a subset.
  • I is frequent if its support is s or higher.

3. Example: frequent itemsets

  • items = {milk, coke, pepsi, beer, juice}
    • B1: m,c,b
    • B2: m,p,j
    • B3: m,b
    • B4: c,j
    • B5: m,p,b
    • B6: m,c,b,j
    • B7: c,b,j
    • B8: b,c
  • Support value s = 3 (three baskets)
  • Frequent itemsets:
    • {m}, {c}, {b}, {j}, {m,b}, {b,c}, {c,j}

4. Applications

  • Items: products; Baskets: sets of products.
    • Given that many people buy beer and diapers together: run a sale on diapers and raise price of beer.
    • Given that many people buy hotdog and mustards together: run a sale of hotdog and raise price of mustards.
  • Items = documents; baskets = sentences/phrases.
    • Items that appear together too often could represent plagiarism.
  • Items = words, basket = documents.
    • Unusual words appearing together in large number of documents indicating interesting relationship.

5. Scale of the problem

  • Walmart sells hundreds of thousands of items, and has billions of transactions (shopping basket/cart at checkout).
  • The Web has billions of words and many billions of pages.

6. Association Rules:

  • If-then rules abou the contents of baskets.
  • {i_1, i_2, ..., i_k} -> j means: “If a basket contains all of i_1,…,i_k then it is likely to contain j.”
  • Confidence of this association rule is the probability of j given {i_1,…,i_k}.
    • The fraction of the basket with {i_1,…,i_k} that also contain j.
  • Example:
    • B1: m,c,b
    • B2: m,p,j
    • B3: m,b
    • B4: c,j
    • B5: m,p,b
    • B6: m,c,b,j
    • B7: c,b,j
    • B8: b,c
  • An association rule: {m,b} -> c
    • Basket contains m and b: B1, B3, B5, B6
    • Basket contains m, b, and c: B1, B6
    • C = 2 / 4 = 50%

7. Finding association rules

  • Find all association rules with support >= s and confidence >=c
  • Hard part: funding the frequent itemsets.

8. Computation model

  • Data is stored in flat files on disk.
  • Most likely basket-by-basket.
  • Expand baskets into pairs, triples, etc as you read the baskets.
    • Use k nested loops to generate all sets of size k.
  • I/O cost: per passes (all baskets read).

9. Main-memory bottleneck

  • For many frequent-itemset algorithms, main memory is the critical resource.
  • We need to keep count of things (occurrences of pairs/triples/…) when we read baskets.
  • The number of different things we can count is limited by main memory.
  • Swapping counts is going to be horrible.

10. Naive algorithm

  • Hardest problem is finding frequent pairs because they are the most common.
  • Read file once, counting in main memory the occurences of each pair.
  • For each basket of n items, there will be n(n-1)/2 pairs, generated by double-nested loops.
  • If n^2 exceeds main memory, we fails.

11. Naive algorithm: how do we count.

  • Count all pairs using a triangular matrix.
    • Requires 4 bytes per pair for all possible pairs: 2n^2
  • Keep a table ot triples [i, j, c] with c is the count of pair {i,j}.
    • Requires 12 bytes only for pairs with count > 0: 12p with p is the number of pairs that actually occur.

12. A-Priori algorithm.

  • Limit the need for main memory.
  • Key idea: monotonicity
    • If a set of items appears at least s times, so does every subset of this set.
  • Contrapositive: If an item i does not appear in s baskets, then no pair containing i can appear in s baskets.
  • A-Priori algorithm:
    • Pass 1: read baskets and count the item occurrences. Only keep items that appear at least s times - frequent items.

    • Pass 2: read baskets again and only count in main memory only those pairs whose both items were found to be frequent from Pass 1.

    • Repeat the process with increasing number of items added to only sets found to be frequent.

    • C1 = all items

    • In general, L_k are members of C_k with support greater than or equal to s.

    • C_(k+1) includes (k+1) sets, each k of which is in L_k.

13. A-Priori at scale

  • Under two passes.
  • SON: Savasere-Omiecinski-Navathe
  • Adaptable to a distributed data model (mapreduce).
    • Repeatedly read small subsets of the baskets into main memory and perform a-priori on these subsets, using a support that is equal to the main support divided by the total numbers of subsets.
    • Aggregate all candidate itemsets and determine which are frequent in the entire set.

14. Hands-on: SON on Spark

15. Hands-on: Processing XML data

  • Download the NSF Awards Data between 2011 and 2021.
  • Review the XML Schema
  • What is an interesting question?

{% include links.md %}