Locality Sensitive Hashing

Overview

Shingling : convert documents, emails, etc., to sets.
Minhashing : convert large sets to short signatures, while preserving similarity.
Locality sensitive hashing : focus on pairs of signatures likely to be similar.

Shingles

A k-shingle (or k-gram) for a document is a sequence of k characters that appears in the document.
Example: k = 2; doc = abcab. Set of 2-shingles: {ab, bc, ca}.
Represent a doc by its set of k-shingles.
Documents that are intuitively similar will have many shingles in common.
Changing a word only affects k-shingles within distance k from the word.
Reordering paragraphs only affects the 2k shingles that cross paragraph boundaries.
Example: k=3, The dog which chased the cat versus The dog that chased the cat.
- Only 3-shingles replaced are g_w, _wh, whi, hic, ich, ch_, and h_c.

Minhashing: Jaccard Similarity

The Jaccard similiarity of two sets is the size of their intersection divided by the size of their union.

Convert from sets to boolean matrices.
- Rows: elements of the universal set. In other words, all elements in the union.
- Columns: individual sets.
- A cell value of 1 in row e and column S if and only if e is a member of S.
- A cell value of 0 otherwise.
- Column similarity is the Jaccard similarity of the sets of their rows that have the value of 1.
- Typically sparse.
This gives you another way to calculate similarity: column similarity = Jaccard similarity.

Generally speaking, given two columns, rows maybe classified as:
- a: 1 1
- b: 1 0
- c: 0 1
- d: 0 0
Sim(C₁, C₂) = a/(a+b+c)

Minhashing: signature matrix

Imagine the rows permuted randomly.
Define minhash function h(C) = the number of the first (in the permuted order) row in which column C has 1.
Use several (e.g., 100) independent hash functions to create a signature for each column.
The signatures can be displayed in another matrix called the signature matrix
The signature matrix has
- its columns represent the sets and
- the rows represent the minhash values, in order for that column.
Use the permuted row index to map to the original values!
- 1 is permuted to 5, therefore H(1) is 0: C1(5).
- 2 is permuted to 1, therefore H(2) is 1: C1(1).
- 3 is permuted to 2, therefore H(3) is 1: C1(2).
- 4 is permuted to 7, therefore H(4) is 1: C1(7).
- 5 is permuted to 6, therefore H(5) is 1: C1(6).
- 6 is permuted to 4, therefore H(6) is 0: C1(4).
- 7 is permuted to 3, therefore H(7) is 0: C1(3).

Index	C1	C2	C3	C4	H1	H1(C1)	H1(C2)	H1(C3)	H1(C4)
1	1	0	1	0	2	0	1	0	1
2	1	0	0	1	3	1	0	1	0
3	0	1	0	1	7	1	0	0	1
4	0	1	0	1	6	1	0	1	0
5	0	1	0	1	1	1	0	1	0
6	1	0	1	0	5	0	1	0	1
7	1	0	1	0	4	0	1	0	1

The first ones of H1(C1), H1(C2), H1(C3), and H1(C4) are therefore 2, 1, 2, 1. This is the signature row of H1.
Challenge: Perform the calculation of the signature rows for H2 and H3.
- H2’s permutation: (4, 2, 1, 3, 6, 7, 5)
- H3’s permutation: (1, 3, 7, 6, 2, 5, 4).

H2	H2(C1)	H2(C2)	H2(C3)	H2(C4)	H3	H3(C1)	H3(C2)	H3(C3)	H3(C4)
4	0	1	0	1	1	1	0	1	0
2	1	0	0	1	3	0	1	0	1
1	0	0	0	1	7	1	0	0	1
3	1	0	1	0	6	1	0	1	0
6	1	1	1	0	2	1	0	1	0
7	0	1	0	1	5	0	1	0	1
5	1	0	1	0	4	0	1	0	1

Minhashing: signature matrix transformation demo

An alternative representation of the boolean matrix with the cells for C1, C2, C3, C4 have more descriptive values other than 0 and 1 to help demonstrate the data movement.
This update is based on recommendation of S. Whoriskey (Fall 2022).
Use the permuted row index to map to the original values!
- 1 is permuted to 5, therefore H(1) is 0: C1(5).
- 2 is permuted to 1, therefore H(2) is 1: C1(1).
- 3 is permuted to 2, therefore H(3) is 1: C1(2).
- 4 is permuted to 7, therefore H(4) is 1: C1(7).
- 5 is permuted to 6, therefore H(5) is 1: C1(6).
- 6 is permuted to 4, therefore H(6) is 0: C1(4).
- 7 is permuted to 3, therefore H(7) is 0: C1(3).

Index	C1	C2	C3	C4	H1 Permutation	H1 Index	H1(C1)	H1(C2)	H1(C3)	H1(C4)
1	11	21	31	41	2	1	15	25	35	45
2	12	22	32	42	3	2	11	21	31	41
3	13	23	33	43	7	3	12	22	32	42
4	14	24	34	44	6	4	17	27	37	47
5	15	25	35	45	1	5	16	26	36	46
6	16	26	36	46	5	6	14	24	34	44
7	17	27	37	47	4	7	13	23	33	43

The first ones of H1(C1), H1(C2), H1(C3), and H1(C4) are therefore 2, 1, 2, 1. This is the signature row of H1.

Minhashing: actual implementation

Can’t realistically permute billion of rows:
- Too many permutation entries to store.
- Random access on big data (big no no).
How to calculate hashes in sequential order?
- Pick approximately 100 hash functions (100 permutations).
- Go through the data set row by row.
- For each row r, for each hash function i,
  - Maintain a variable M(i,c) which will maintain the smallest value value of \(h_{i}(r)\) for which column c has 1 in row r.
Example: two hash functions:
- i: row index
- H1: \(i\ mod\ 5\)
- H2: \((2i + 1)\ mod\ 5\)
- All signature matrix values start out as positive infinity.
  - If \(C_i==1\), calculate hash of \(i\) and replace signature matrix value if result is smaller.

i	C1	C2	C3	C4	Hash	H(C1)	H(C2)	H(C3)	H(C4)
1	1	0	1	0	\(1\ mod\ 5\)	1	\(\inf\)	1	\(\inf\)
					\(3\ mod\ 5\)	3	\(\inf\)	3	\(\inf\)

2	1	0	0	1	\(2\ mod\ 5\)	1(2)	\(\inf\)	1	2
					\(5\ mod\ 5\)	0(3)	\(\inf\)	3	0

3	0	1	0	1	\(3\ mod\ 5\)	1	2	1	2(3)
					\(7\ mod\ 5\)	3	1	3	0(2)

4	0	1	0	1	\(4\ mod\ 5\)	1	2(4)	1	2(4)
					\(9\ mod\ 5\)	3	1(4)	3	0(4)

5	0	1	0	1	\(5\ mod\ 5\)	1	0(2)	1	0(0)
					\(11\ mod\ 5\)	3	1(1)	3	0(1)

6	1	0	1	0	\(6\ mod\ 5\)	1(1)	0	1(1)	0
					\(13\ mod\ 5\)	3(3)	1	3(3)	0

7	1	0	1	0	\(7\ mod\ 5\)	1(2)	0	1(2)	0
					\(14\ mod\ 5\)	3(4)	1	3(4)	0

Minharshing: exercise

Create three additional row permutations, update the signature matrix, and recalculate the signature similarity.
Does the signature similarity become closer to the column similarity?

Locality-sensitive-hashing (LSH)

Generate from the collection of signatures a list of candidate pairs: pairs of elements where similarity must be evaluated.
For signature matrices: hash columns to many buckets and make elements of the same bucket candidate pairs.
- Pick a similarity threshold t, a fraction < 1.
- We want a pair of columns c and d of the signature matrix M to be a candidate pair if and only if their signatures agree in at least fraction t of the rows.

LSH implementation

Big idea: hash columns of signature matrix M several times and arrange that only similar columns are likely to hash to the same bucket.
Reality: we don’t need to study the entire column.

Divide matrix M into b bands of r rows each.
For each band, hash its portion of each column to a hash table with k buckets, with k as large as possible.
Candidate column pairs are those that hash to the same bucket for at least one band.
Fine tune b and r.
We will not go into the math here …

Big Data Engineering

Index	C1	C2	C3	C4	H1	H1(C1)	H1(C2)	H1(C3)	H1(C4)
1	1	0	1	0	2	0	1	0	1
2	1	0	0	1	3	1	0	1	0
3	0	1	0	1	7	1	0	0	1
4	0	1	0	1	6	1	0	1	0
5	0	1	0	1	1	1	0	1	0
6	1	0	1	0	5	0	1	0	1
7	1	0	1	0	4	0	1	0	1

H2	H2(C1)	H2(C2)	H2(C3)	H2(C4)	H3	H3(C1)	H3(C2)	H3(C3)	H3(C4)
4	0	1	0	1	1	1	0	1	0
2	1	0	0	1	3	0	1	0	1
1	0	0	0	1	7	1	0	0	1
3	1	0	1	0	6	1	0	1	0
6	1	1	1	0	2	1	0	1	0
7	0	1	0	1	5	0	1	0	1
5	1	0	1	0	4	0	1	0	1

Index	C1	C2	C3	C4	H1 Permutation	H1 Index	H1(C1)	H1(C2)	H1(C3)	H1(C4)
1	11	21	31	41	2	1	15	25	35	45
2	12	22	32	42	3	2	11	21	31	41
3	13	23	33	43	7	3	12	22	32	42
4	14	24	34	44	6	4	17	27	37	47
5	15	25	35	45	1	5	16	26	36	46
6	16	26	36	46	5	6	14	24	34	44
7	17	27	37	47	4	7	13	23	33	43

Index	C1	C2	C3	C4	H1	H1(C1)	H1(C2)	H1(C3)	H1(C4)
1	1	0	1	0	2	0	1	0	1
2	1	0	0	1	3	1	0	1	0
3	0	1	0	1	7	1	0	0	1
4	0	1	0	1	6	1	0	1	0
5	0	1	0	1	1	1	0	1	0
6	1	0	1	0	5	0	1	0	1
7	1	0	1	0	4	0	1	0	1

H2	H2(C1)	H2(C2)	H2(C3)	H2(C4)	H3	H3(C1)	H3(C2)	H3(C3)	H3(C4)
4	0	1	0	1	1	1	0	1	0
2	1	0	0	1	3	0	1	0	1
1	0	0	0	1	7	1	0	0	1
3	1	0	1	0	6	1	0	1	0
6	1	1	1	0	2	1	0	1	0
7	0	1	0	1	5	0	1	0	1
5	1	0	1	0	4	0	1	0	1

Index	C1	C2	C3	C4	H1 Permutation	H1 Index	H1(C1)	H1(C2)	H1(C3)	H1(C4)
1	11	21	31	41	2	1	15	25	35	45
2	12	22	32	42	3	2	11	21	31	41
3	13	23	33	43	7	3	12	22	32	42
4	14	24	34	44	6	4	17	27	37	47
5	15	25	35	45	1	5	16	26	36	46
6	16	26	36	46	5	6	14	24	34	44
7	17	27	37	47	4	7	13	23	33	43

Index	C1	C2	C3	C4	H1	H1(C1)	H1(C2)	H1(C3)	H1(C4)
1	1	0	1	0	2	0	1	0	1
2	1	0	0	1	3	1	0	1	0
3	0	1	0	1	7	1	0	0	1
4	0	1	0	1	6	1	0	1	0
5	0	1	0	1	1	1	0	1	0
6	1	0	1	0	5	0	1	0	1
7	1	0	1	0	4	0	1	0	1

H2	H2(C1)	H2(C2)	H2(C3)	H2(C4)	H3	H3(C1)	H3(C2)	H3(C3)	H3(C4)
4	0	1	0	1	1	1	0	1	0
2	1	0	0	1	3	0	1	0	1
1	0	0	0	1	7	1	0	0	1
3	1	0	1	0	6	1	0	1	0
6	1	1	1	0	2	1	0	1	0
7	0	1	0	1	5	0	1	0	1
5	1	0	1	0	4	0	1	0	1

Index	C1	C2	C3	C4	H1 Permutation	H1 Index	H1(C1)	H1(C2)	H1(C3)	H1(C4)
1	11	21	31	41	2	1	15	25	35	45
2	12	22	32	42	3	2	11	21	31	41
3	13	23	33	43	7	3	12	22	32	42
4	14	24	34	44	6	4	17	27	37	47
5	15	25	35	45	1	5	16	26	36	46
6	16	26	36	46	5	6	14	24	34	44
7	17	27	37	47	4	7	13	23	33	43