Clustering#

1. General problem statement:#

Given a set of data points, with a notion of distance between points, group the points into some number of clusters so that:
- Members of a cluster are close/similar to each other.
- Members of different clusters are dissimilar.
Usually
- Points are in high-dimensional space (observations have many attributes).
- Similarity is defined using a distance measure: Euclidean, Cosine, Jaccard, edit distance …

Clustering in two dimensions looks easy.
Clustering small amounts of data looks easy.
In most cases, looks are not deceiving.
But:
- Many applications involve not 2, but 10 or 10,000 dimensions.
- High-dimensional spaces look different.

Clustering Sky Objects:
- A catalog of 2 billion sky objects represents objects by their radiation in 7 dimensions (frequency bands)
- Problem: cluster into similar objects, e.g., galaxies, stars, quasars, etc.
Clustering music albums
- Music divides into categories, and customer prefer a few categories
- Are categories simply genres?
- Similar Albums have similar sets of customers, and vice-versa
Clustering documents
- Group together documents on the same topic.
- Documents with similar sets of words maybe about the same topic.
- Dual formulation: a topic is a group of words that co-occur in many documents.

Different ways of representing documents or music albums lead to different distance measures.
Document as set of words
- Jaccard distance
Document as point in space of words.
- x_i = 1 if i appears in doc.
- Euclidean distance
Document as vector in space of words.
- Vector from origin to …
- Cosine distance.

Hierarchical:
- Agglomerative (bottom up): each point is a cluster, repeatedly combining two nearest cluster.
- Divisive (top down): start with one cluster and recursively split it.
Point assignment:
- Maintain a set of clusters
- Points belong to nearest cluster

Assumes Euclidean space/distance
Pick k, the number of clusters.
Initialize clsuters by picking on point per cluster.
Until converge
- For each point, place it in the cluster whose current centroid it is nearest.
  - A cluster centroid has its coordinates calculated as the averages of all its points’ coordinates.
- After all points are assigned, update the locations of centroids of the k clusters.
- Reassign all points to their closest centroid.

How to select k?
Try different k, looking at the change in the average distance to centroid, as k increases.
Approach 1: sampling
- Cluster a sample of the data using hierarchical clustering, to obtain k clusters.
- Pick a point from each clsuter (e.g. point closest to centroid)
- Sample fits in main memory.
Approach 2: Pick dispersed set of points
- Pick first point at random
- Pick the next point to be the one whose minimum distance from the selected points is as large as possible.
- Repeat until we have k points.

{% include links.md %}