14

Clustering

instructor: Ross A. Lippert

http://www-math.mit.edu/~lippert/18.417/

Announcements:
  • Start chapter 10
  • New problem set out


What is clustering and why do people want it?

Basic idea:

Bird's eye view:

A quadrant II task

specificvague
importantIII
unimportantIIIIV

Applications


Methods

Kinds of clustering

Heuristic styles


Doing clustering for biologists

Science demands repeatability and robustness. Simpler methods often have an advantage.

Hierarchical methods produce trees as a bi-product


k-Means

k-means is simple and popular

   randomly generate k centers c_i
   while 1
     Assign x_i to block B_j and min_j d(x_i,c_j)
     c'_j = average of x_i in B_j
     if all c'_j == c_j then stop!

Equivalent to a local optimization of:

  F(P) = sum of Var(B_j) = Score(P)
For a partition P of the x_i with k blocks

Lloyd algorithm can be replaced by something else

Score(P) just penalizes for bad blocks


k-Means generalizations

Change the optimization score to another block-wise penalty

  F(P) = sum of cost(B_j) = Score(P)
For a partition P of the x_i with k blocks

Common variations

All variations are computationally intractable


Corrupted Clique Problems

Defn: A clique graph - exists a partition P so that every block of P is a clique

Can obtain a graph from the data by thresholding

Corrupted cliques I

Corrupted cliques II


Parallel Classification with Cores (PCC)

Model: G = clique graph + random noise

PCC is correct with high probability in the limit of large graphs

  Divide nodes up into S1, S2, S3
  for-each P1 on S1
    create P2 from P1 by extending each block B of P1
      x in S2 is added to B where max_B affinity(x,B)
    create P3 from P2 by extending each block of P2
      x in S3 is added to B where max_B affinity(x,B)
    P = better of P and P3

  affinity(x,B) = #(edges from x to B)/#(elements in B)

A more flexible heuristic called Cluster Affinity Search Technique

  P = empty partition (no blocks)
  while S not empty
    C = {s} where s is maximum degree
    while 1
      if v in S-C s.t. affinity(v,C)>=t
        C = C+v  [add highest such v]
      else v in C s.t. affinity(v,C)<t
        C = C-v  [remove lowest such v]
      else
        stop
    S = S-C
    add C to P

CAST: unbounded runtime, but seems to work well


Spectral methods

Spectral bisection

Corrupted cliques

Low dimensional projection