Locality Sensitive Hashing#

1. Applications#

Set-similarity

Many data mining problems can be expressed as finding similar sets:

  • Pages with similar words, e.g., for classification by topic.

  • NetFlix users with similar tastes in movies, for recommendation systems.

  • Dual: movies with similar sets of fans.

  • Entity resolution.

Similar documents

Given a body of documents, e.g., the Web, find pairs of documents with a lot of text in common, such as:

  • Mirror sites, or approximate mirrors.

    • Application: Don’t want to show both in a search.

  • Plagiarism, including large quotations.

  • Similar news articles at many news sites.

    • Application: Cluster articles by same story: topic modeling, trend identification.

Collaborative filtering
  • Recmmend to users items that were liked by other users who have exhibited similar tastes (online recommendation - recommendation systems).

  • Combine set-similarities with other techniques, such as clustering.

  • Example: Amazon

    • Customers with similar set of purchased items.

    • Products with similar sets of purchasers.

  • Example: Netflix

    • Movies are similar if they were rented or highly rated by the same customers.

    • Customers are similar if they rented or rated highly same movies.

    • Additional filtering options:

      • Ignore low-rated customer/movie pairs.

      • Divide high/low into two buckets of similarity.

2. Three essential techniques for similar documents#

Overview
  • Shingling : convert documents, emails, etc., to sets.

  • Minhashing : convert large sets to short signatures, while preserving similarity.

  • Locality sensitive hashing : focus on pairs of signatures likely to be similar.

../_images/013.png
Shingles
  • A k-shingle (or k-gram) for a document is a sequence of k characters that appears in the document.

  • Example: k = 2; doc = abcab. Set of 2-shingles: {ab, bc, ca}.

  • Represent a doc by its set of k-shingles.

  • Documents that are intuitively similar will have many shingles in common.

  • Changing a word only affects k-shingles within distance k from the word.

  • Reordering paragraphs only affects the 2k shingles that cross paragraph boundaries.

  • Example: k=3, The dog which chased the cat versus The dog that chased the cat.

    • Only 3-shingles replaced are g_w, _wh, whi, hic, ich, ch_, and h_c.

Minhashing: Jaccard Similarity
  • The Jaccard similiarity of two sets is the size of their intersection divided by the size of their union.

../_images/023.png
  • Convert from sets to boolean matrices.

    • Rows: elements of the universal set. In other words, all elements in the union.

    • Columns: individual sets.

    • A cell value of 1 in row e and column S if and only if e is a member of S.

    • A cell value of 0 otherwise.

    • Column similarity is the Jaccard similarity of the sets of their rows that have the value of 1.

    • Typically sparse.

  • This gives you another way to calculate similarity: column similarity = Jaccard similarity.

../_images/034.png
  • Generally speaking, given two columns, rows maybe classified as:

    • a: 1 1

    • b: 1 0

    • c: 0 1

    • d: 0 0

  • Sim(C1, C2) = a/(a+b+c)

Minhashing: signature matrix
  • Imagine the rows permuted randomly.

  • Define minhash function h(C) = the number of the first (in the permuted order) row in which column C has 1.

  • Use several (e.g., 100) independent hash functions to create a signature for each column.

  • The signatures can be displayed in another matrix called the signature matrix

  • The signature matrix has

    • its columns represent the sets and

    • the rows represent the minhash values, in order for that column.

  • Use the permuted row index to map to the original values!

    • 1 is permuted to 5, therefore H(1) is 0: C1(5).

    • 2 is permuted to 1, therefore H(2) is 1: C1(1).

    • 3 is permuted to 2, therefore H(3) is 1: C1(2).

    • 4 is permuted to 7, therefore H(4) is 1: C1(7).

    • 5 is permuted to 6, therefore H(5) is 1: C1(6).

    • 6 is permuted to 4, therefore H(6) is 0: C1(4).

    • 7 is permuted to 3, therefore H(7) is 0: C1(3).

Index

C1

C2

C3

C4

H1

H1(C1)

H1(C2)

H1(C3)

H1(C4)

1

1

0

1

0

2

0

1

0

1

2

1

0

0

1

3

1

0

1

0

3

0

1

0

1

7

1

0

0

1

4

0

1

0

1

6

1

0

1

0

5

0

1

0

1

1

1

0

1

0

6

1

0

1

0

5

0

1

0

1

7

1

0

1

0

4

0

1

0

1

  • The first ones of H1(C1), H1(C2), H1(C3), and H1(C4) are therefore 2, 1, 2, 1. This is the signature row of H1.

  • Challenge: Perform the calculation of the signature rows for H2 and H3.

    • H2’s permutation: (4, 2, 1, 3, 6, 7, 5)

    • H3’s permutation: (1, 3, 7, 6, 2, 5, 4).

Solution

2

1

2

1

2

1

4

1

1

2

1

1

H2

H2(C1)

H2(C2)

H2(C3)

H2(C4)

H3

H3(C1)

H3(C2)

H3(C3)

H3(C4)

4

0

1

0

1

1

1

0

1

0

2

1

0

0

1

3

0

1

0

1

1

0

0

0

1

7

1

0

0

1

3

1

0

1

0

6

1

0

1

0

6

1

1

1

0

2

1

0

1

0

7

0

1

0

1

5

0

1

0

1

5

1

0

1

0

4

0

1

0

1

../_images/044.png
Minhashing: signature matrix transformation demo
  • An alternative representation of the boolean matrix with the cells for C1, C2, C3, C4 have more descriptive values other than 0 and 1 to help demonstrate the data movement.

  • This update is based on recommendation of S. Whoriskey (Fall 2022).

  • Use the permuted row index to map to the original values!

    • 1 is permuted to 5, therefore H(1) is 0: C1(5).

    • 2 is permuted to 1, therefore H(2) is 1: C1(1).

    • 3 is permuted to 2, therefore H(3) is 1: C1(2).

    • 4 is permuted to 7, therefore H(4) is 1: C1(7).

    • 5 is permuted to 6, therefore H(5) is 1: C1(6).

    • 6 is permuted to 4, therefore H(6) is 0: C1(4).

    • 7 is permuted to 3, therefore H(7) is 0: C1(3).

Index

C1

C2

C3

C4

H1 Permutation

H1 Index

H1(C1)

H1(C2)

H1(C3)

H1(C4)

1

11

21

31

41

2

1

15

25

35

45

2

12

22

32

42

3

2

11

21

31

41

3

13

23

33

43

7

3

12

22

32

42

4

14

24

34

44

6

4

17

27

37

47

5

15

25

35

45

1

5

16

26

36

46

6

16

26

36

46

5

6

14

24

34

44

7

17

27

37

47

4

7

13

23

33

43

  • The first ones of H1(C1), H1(C2), H1(C3), and H1(C4) are therefore 2, 1, 2, 1. This is the signature row of H1.

Minhashing: signature matrix and Jaccard similarity
  • The probability (over all permutations of the rows) that h(C1) = h(C2) is the same as Sim(C1).

  • Both are a/(a+b+c)!

  • The similarity of signatures is the fraction of the minhash functions in which they agree.

  • The expected similarity of two signatures equals the Jaccard similarity of the columns.

    • The longer the signatures, the smaller the expected error will be.

../_images/053.png
Minhashing: actual implementation
  • Can’t realistically permute billion of rows:

    • Too many permutation entries to store.

    • Random access on big data (big no no).

  • How to calculate hashes in sequential order?

    • Pick approximately 100 hash functions (100 permutations).

    • Go through the data set row by row.

    • For each row r, for each hash function i,

      • Maintain a variable M(i,c) which will maintain the smallest value value of \(h_{i}(r)\) for which column c has 1 in row r.

  • Example: two hash functions:

    • i: row index

    • H1: \(i\ mod\ 5\)

    • H2: \((2i + 1)\ mod\ 5\)

    • All signature matrix values start out as positive infinity.

      • If \(C_i==1\), calculate hash of \(i\) and replace signature matrix value if result is smaller.

i

C1

C2

C3

C4

Hash

H(C1)

H(C2)

H(C3)

H(C4)

1

1

0

1

0

\(1\ mod\ 5\)

1

\(\inf\)

1

\(\inf\)

\(3\ mod\ 5\)

3

\(\inf\)

3

\(\inf\)

2

1

0

0

1

\(2\ mod\ 5\)

1(2)

\(\inf\)

1

2

\(5\ mod\ 5\)

0(3)

\(\inf\)

3

0

3

0

1

0

1

\(3\ mod\ 5\)

1

2

1

2(3)

\(7\ mod\ 5\)

3

1

3

0(2)

4

0

1

0

1

\(4\ mod\ 5\)

1

2(4)

1

2(4)

\(9\ mod\ 5\)

3

1(4)

3

0(4)

5

0

1

0

1

\(5\ mod\ 5\)

1

0(2)

1

0(0)

\(11\ mod\ 5\)

3

1(1)

3

0(1)

6

1

0

1

0

\(6\ mod\ 5\)

1(1)

0

1(1)

0

\(13\ mod\ 5\)

3(3)

1

3(3)

0

7

1

0

1

0

\(7\ mod\ 5\)

1(2)

0

1(2)

0

\(14\ mod\ 5\)

3(4)

1

3(4)

0

../_images/062.png
Minharshing: exercise
  • Create three additional row permutations, update the signature matrix, and recalculate the signature similarity.

  • Does the signature similarity become closer to the column similarity?

Locality-sensitive-hashing (LSH)
  • Generate from the collection of signatures a list of candidate pairs: pairs of elements where similarity must be evaluated.

  • For signature matrices: hash columns to many buckets and make elements of the same bucket candidate pairs.

    • Pick a similarity threshold t, a fraction < 1.

    • We want a pair of columns c and d of the signature matrix M to be a candidate pair if and only if their signatures agree in at least fraction t of the rows.

LSH implementation
  • Big idea: hash columns of signature matrix M several times and arrange that only similar columns are likely to hash to the same bucket.

  • Reality: we don’t need to study the entire column.

../_images/073.png
  • Divide matrix M into b bands of r rows each.

  • For each band, hash its portion of each column to a hash table with k buckets, with k as large as possible.

  • Candidate column pairs are those that hash to the same bucket for at least one band.

  • Fine tune b and r.

  • We will not go into the math here …

3. Hands-on preparation#