Locality Sensitive Hashing
Last updated on 2024-07-26 | Edit this page
Overview
Questions
- How does Linux come to be?
Objectives
- Explain the historical development of Linux
Locality Sensitive Hashing
1. Applications of set-similarity
Many data mining problems can be expressed as finding
similar
sets:
- Pages with similar words, e.g., for classification by topic.
- NetFlix users with similar tastes in movies, for recommendation systems.
- Dual: movies with similar sets of fans.
- Entity resolution.
2. Application: similar documents
Given a body of documents, e.g., the Web, find pairs of documents with a lot of text in common, such as:
- Mirror sites, or approximate mirrors.
- Application: Don’t want to show both in a search.
- Plagiarism, including large quotations.
- Similar news articles at many news sites.
- Application: Cluster articles by
same story
: topic modeling, trend identification.
- Application: Cluster articles by
3. Three essential techniques for similar documents
- Shingling : convert documents, emails, etc., to sets.
- Minhashing : convert large sets to short signatures, while preserving similarity.
- Locality sensitive hashing : focus on pairs of signatures likely to be similar.

4. Shingles
- A k-shingle (or k-gram) for a document is a sequence of k characters that appears in the document.
- Example: k = 2; doc = abcab. Set of 2-shingles: {ab, bc, ca}.
- Represent a doc by its set of k-shingles.
- Documents that are intuitively similar will have many shingles in common.
- Changing a word only affects k-shingles within distance
k
from the word. - Reordering paragraphs only affects the
2k
shingles that cross paragraph boundaries. - Example: k=3,
The dog which chased the cat
versusThe dog that chased the cat
.- Only 3-shingles replaced are
g_w
,_wh
,whi
,hic
,ich
,ch_
, andh_c
.
- Only 3-shingles replaced are
5. Minhashing: Jaccard Similarity
- The
Jaccard similiarity
of two sets is the size of their intersection divided by the size of their union.

- Convert from sets to boolean matrices.
- Rows: elements of the universal set. In other words, all elements in the union.
- Columns: individual sets.
- A cell value of
1
in row e and column S if and only if e is a member of S. - A cell value of
0
otherwise. - Column similarity is the Jaccard similarity of the sets of their
rows that have the value of
1
. - Typically sparse.
- This gives you another way to calculate similarity: column similarity = Jaccard similarity.

- Generally speaking, given two columns, rows maybe classified as:
- a: 1 1
- b: 1 0
- c: 0 1
- d: 0 0
- Sim(C1, C2) = a/(a+b+c)
6. Minhashing
- Imagine the rows permuted randomly.
- Define
minhash
function h(C) = the number of the first (in the permuted order) row in which column C has 1. - Use several (e.g., 100) independent hash functions to create a signature for each column.
- The signatures can be displayed in another matrix called the signature matrix
- The signature matrix has
- its columns represent the sets and
- the rows represent the
minhash
values, in order for that column.

6. Minhashing: surprising property
- The probability (over all permutations of the rows) that h(C1) = h(C2) is the same as Sim(C1).
- Both are a/(a+b+c)!
- The similarity of signatures is the fraction of the minhash functions in which they agree.
- The expected similarity of two signatures equals the Jaccard
similarity of the columns.
- The longer the signatures, the smaller the expected error will be.

8. Hands-on Minhashing: implementation
- Can’t realistically permute billion of rows:
- Too many permutation entries to store.
- Random access on big data (big no no).
- Too many permutation entries to store.
- How to calculate hashes in sequential order?
- Pick approximately 100 hash functions (100 permutations).
- Go through the data set row by row.
- For each row r, for each hash function i,
- Maintain a variable M(i,c) which will maintain the smallest value value of hi(r) for which column c has 1 in row r.

9. Minhashing:
- Create three additional row permutations, update the signature matrix, and recalculate the signature similarity.
- Does the signature similarity become closer to the column similarity?
10. Locality-sensitive-hashing (LSH)
- Generate from the collection of signatures a list of candidate pairs: pairs of elements where similarity must be evaluated.
- For signature matrices: hash columns to many buckets and make
elements of the same bucket candidate pairs.
- Pick a similarity threshold
t
, a fraction < 1. - We want a pair of columns
c
andd
of the signature matrix M to be acandidate pair
if and only if their signatures agree in at least fractiont
of the rows.
- Pick a similarity threshold
11. LSH
- Big idea: hash columns of signature matrix M several times and arrange that only similar columns are likely to hash to the same bucket.
- Reality: we don’t need to study the entire column.

- Divide matrix M into
b
bands ofr
rows each. - For each band, hash its portion of each column to a hash table with
k
buckets, withk
as large as possible. -
Candidate
column pairs are those that hash to the same bucket for at least one band. - Fine tune
b
andr
. - We will not go into the math here …
12. Hands on LSH
- Download the set of inaugural speeches from https://www.cs.wcupa.edu/lngo/data/inaugural_speeches.zip.
- Launch a Spark notebook called
spark-6.ipynb
and create the two initial setup cells.
{% include links.md %}