Clustering
Last updated on 2024-07-26 | Edit this page
Estimated time 60 minutes
Overview
Questions
- How does Linux come to be?
Objectives
- Explain the historical development of Linux
Clustering
1. General problem statement:
- Given a set of data points, with a notion of
distance between points, group the
points into some number of clusters so that:
- Members of a cluster are close/similar to each other.
- Members of different clusters are dissimilar.
- Usually
- Points are in high-dimensional space (observations have many attributes).
- Similarity is defined using a distance measure: Euclidean, Cosine, Jaccard, edit distance …
2. Clustering is a hard problem
- Clustering in two dimensions looks easy.
- Clustering small amounts of data looks easy.
- In most cases, looks are not deceiving.
- But:
- Many applications involve not 2, but 10 or 10,000 dimensions.
- High-dimensional spaces look different.
3. Example:
- Clustering Sky Objects:
- A catalog of 2 billion sky objects represents objects by their radiation in 7 dimensions (frequency bands)
- Problem: cluster into similar objects, e.g., galaxies, stars, quasars, etc.
- Clustering music albums
- Music divides into categories, and customer prefer a few categories
- Are categories simply genres?
- Similar Albums have similar sets of customers, and vice-versa
- Clustering documents
- Group together documents on the same topic.
- Documents with similar sets of words maybe about the same topic.
- Dual formulation: a topic is a group of words that co-occur in many documents.
4. Cosine, Jaccard, Euclidean
- Different ways of representing documents or music albums lead to different distance measures.
- Document as
set of words
- Jaccard distance
- Document as
point in space of words
.- x_i = 1 if
i
appears in doc. - Euclidean distance
- x_i = 1 if
- Document as
vector in space of words
.- Vector from origin to …
- Cosine distance.
5. Overview: methods of clustering
- Hierarchical:
- Agglomerative (bottom up): each point is a cluster, repeatedly combining two nearest cluster.
- Divisive (top down): start with one cluster and recursively split it.
- Point assignment:
- Maintain a set of clusters
- Points belong to
nearest
cluster
6. Point assignment: K-means clustering
- Assumes
Euclidean
space/distance - Pick
k
, the number of clusters. - Initialize clsuters by picking on point per cluster.
- Until converge
- For each point, place it in the cluster whose current centroid it is
nearest.
- A cluster centroid has its coordinates calculated as the averages of all its points’ coordinates.
- After all points are assigned, update the locations of centroids of
the
k
clusters. - Reassign all points to their closest centroid.
- For each point, place it in the cluster whose current centroid it is
nearest.
7. The big question
- How to select
k
? - Try different
k
, looking at the change in the average distance to centroid, ask
increases. - Approach 1: sampling
- Cluster a sample of the data using hierarchical clustering, to
obtain
k
clusters. - Pick a point from each clsuter (e.g. point closest to centroid)
- Sample fits in main memory.
- Cluster a sample of the data using hierarchical clustering, to
obtain
- Approach 2: Pick
dispersed
set of points- Pick first point at random
- Pick the next point to be the one whose minimum distance from the selected points is as large as possible.
- Repeat until we have
k
points.