Clustering

Improvements of k-means algorithm and application in cartography and imaging.

Authors: Hadrien van Lierde & Raphael Peschi.

= Review of Clustering Techniques = The problem of finding the natural grouping of objects is a very important problem, yet difficult to define in a general context. Basically, we can define a good clustering as a partition of objects in groups, such that the objects within a same group are "similar" to each other, while objects from different groups are "not similar" with each other.The notion of similarity is of course very application-dependent, which makes difficult to design a general clustering method.

K-means problem has been formalized and studied in the 1950's. The algorithm developed by Lloyd to solve the k-means (cf. section Clustering with k-means: Problem and Algorithm) problem is still widely used because of its simplicity and its efficiency on some data-sets.

The advances in storage technologies has lead to the multiplication of the number of very big databases, raising a demand for data analysis techniques allowing to manipulate these data, i.e. to capture and summarize the information they contain. In this objective of summarizing big data, dozens of clustering algorithms have been developed. Some of them are variants of k-means, while others are based on completely different techniques, like for example hierarchical clustering, and more recently spectral clustering

Clustering has other applications than analyzing big databases, it can also be used in image processing to compress images, identify objects on an image, etc. We further develop the use of clustering in these applications in the next sections.

This paper starts by a detailed analysis of k-means algorithm (advantages and drawbacks) and gives some simple improvements that can reduce the impact of some drawbacks of k-means (such as local optimality). We then analyse the spectral clustering algorithm, a powerful tool for cluster extraction. We also develop a way to combine k-means and spectral clustering methods to be able to detect the natural number of clusters in a dataset and to detect non convex shaped clusters. Finally we present hierarchical clustering techniques and CURE algorithm and compare the performances of spectral clustering and hierarchical clustering. = Point Assignment Clustering: k-means =

In this chapter, we focus on k-means problem and algorithms.

K-means Problem
Given a set of points $$ P = \{p_1, ..., p_n\} \subset R^m $$ and an integer $$ k $$, we want to find a set of points, called centroids, $$ M = \{m_1, ..., m_k\} \subset R^m $$ that minimizes $${\sum_{i=1}^N (\min_{m \in M} d(p_i,m))^2}$$. A cluster $$ L_i $$ is then quite naturally defined by the set of points that are closest to $$ m_i $$ than from any other centroïd.

We can show that this problem is NP-hard, we thus need approximations algorithms to solve it. We presents such algorithms in the next section K-means Algorithms.

K-means Algorithms
The best known k-means algorithm is Lloyds' algorithm. Given the set of points P, the k-means problem can be expressed as the coordinate by coordinate minimization of function $$ g(M,L) = \sum_{i=1}^k \sum_{p \in L_i} d(p,m_i)^2 $$


 * Initialize the centroids $$ M^{(0)}$$
 * Iterate until $$ M^{(i)} = M^{(i-1)} $$
 * $$i \leftarrow i+1$$
 * Assign each point to its closest centroïd : $$ L^{(i)} = \arg \min_{L} g(M^{(i-1)},L)$$
 * Update the position of the centroïd to the one that minimize the inter-cluster variance: $$ C^{(i)} = \arg \min_{M} g(M,L^{(i)}) $$. This corresponds to set $$ m_i $$ as the geometric center of the points in $$L_i$$.

We can prove that it converges to local optimum $$ (\bar{C},\bar{L}) $$ in a finite number of steps, but is not guaranteed to find a global optimum.

A trivial example of sub-optimality is when there is only one non empty centroid $$ m_1 $$ that is the geometrical center of all points, and all other centroïds are such that for every point p, $$ d(m_1,p) < d(m_i,p) $$. Even though it may be a local optimum, it it obvious that placing one empty centroïd on any point p would decrease the cost. Such a case would only arise when the initialization of the centroids is particularly bad, but in practice there are other cases where Lloyd's algorithm finds a local optimum of the k-means problem, even when no clusters are empty.

The proof of the finiteness of the number of steps needed to reach convergence is quite simple: at each iteration, the cost decreases strictly (except at the last iteration, which is precisely allows to detect the convergence). As there is only a finite number of configurations, the number of iteration must be finite. From a practical point of view, this bound on the number of steps required is not very interesting because it is exactly the complexity of the brute force algorithm, which guarantees to find a global minimum. However we can find in the literature some tighter bounds on the number of iterations required, and in practice, only a few (depending on n,k and the dimensional) iterations are needed to reach convergence.

If we require an Euclidian space, the complexity of each iteration is in $$ O(k\cdot n) $$. Indeed the assignment step (minimization of $$ g_1^{(i)}(L) $$ requires computing for each point, the distance for that point to every centroid. The second step is where the assumption of an Euclidian space matters: optimizing the position of each centroïd's position is simply the geometric mean of the points assigned to that centroïd. This takes $$ O(|L_i|) $$ for centroïd $$ m_i $$, and thus $$ O(n) $$ for all centroïds.

If we do not require an Euclidian space, then the second step is not easy anymore: what would for example be the mean of a set of strings ? We could define it as the string that minimizes the sum of the square "edit distances".

The implementation of Lloyd's algorithm in Python is given below. Note that this code is fully functional and can be copy-pasted in the editor: it works!

Here is a script that creates data with the right format:

If you had the curiosity to run the code, you might observe (though not often with this points configuration) some unsatisfactory results if a local minimum found. One way to avoid this in practice is to initialize the centroïds cleverly instead of randomly (among the data points). A very efficient heuristic is to choose the first centroïd randomly amound the data points, and then chose as next centroïd, the data point that maximizes the minimum distance between that point and the previously chosen centroïd. This is exactly what does the function 'initializationFct' (cf. below).

Just add it to the previous code and replace the call

by

The time complexity of this initialization is in $$ O(n\cdot k) \leq O(n^2) $$, which is the same as one pass of the algorithm and thus not negligible. However, on the different points configuration we tried, the results were often at least as good as with the random initialization, and sometimes much better (when the random initialization causes the algorithm to find a very bad local minimum). One further heuristic improvement would be to reallocate the first centroid (that was randomly chosen) as the data point being the furthest from the k-1 other centroids, but it doesn't seem to give better results in practice.

One typical result of the algorithm is displayed on the picture below. Note that for this example, it is not easy to see on naked eye if we found the global optimum. Indeed, running the algorithm several times revealed that some local optimum give almost the same cost, namely 343.34 (which seems to be the global optimum, but we cannot be sure of that as we didn't compute the global optimum by brute force); 343.78, 343.9; etc. This suggests that there are many local minima.



Of course such an initialization does not guarantee to find the optimal solution, as the problem is MP-hard. As the algorithm is polynomial, this would have meant that we have proven P=NP, which is unfortunately not the case...

Improving k-means and Lloyd's Algorithm
As presented in the previous section, the formulation of k-means problem stated as previously is an NP-hard problem. In order to find efficient algorithm (namely with polynomial time of execution), one may formulate a polynomial time algorithm with a provable solution quality. However, approximation algorithms for k-means problem that were developed so far do not give a satisfactory guarantee of sub-optimality and are not often used in practice for this reason.

Another option to give an approximate solution of k-means problem is to formulate a heuristic algorithm such as Lloyd's algorithm. The main advantage of Lloyd's algorithm is that it is easy to implement and quite intuitive. However, as mentioned before, the following difficulties arise. These are related both to Lloyd's algorithm and to k-means problem in general.


 * Whereas the k-means problem does not explicitly require a Euclidean space, Lloyd's requires data taken in a Euclidean space in order to update the positions of centroids efficiently. Algorithms presented in the next sections (such as spectral and hierarchical clustering) will show how to cluster data in non-Euclidean space.
 * Lloyd's algorithm might be sensitive to outliers. In order to avoid the negative influence of outliers, one may remove them or at least carefully initialize the centroids.
 * Lloyd's algorithm only converges to a local minimum of k-means optimization problem, without any guarantee on the quality of this optimum. To improve the quality of the result returned by Lloyd's algorithm, one may repeat the process a certain number of times and only keep the best result.
 * To solve k-means problem, one must know the number of clusters a priori. In some cases, the user has a clear idea of how many clusters must appear but in most situations encountered in data analysis, the number of clusters is not known in advance and one must find criteria that select the right number of clusters.
 * Finally, another drawback limits the field of applications where k-means problem may be used: it only finds convex-shaped clusters. In order to detect non-convex clusters, one should either use other algorithm, either combine k-means procedure with another clustering technique such as spectral clustering. This technique is further explained in section Applications.

Determining the right number of clusters
One limitation of k-means is that we have to decide in advance what is the number of clusters. In low dimension, human often have an intuition on the "right" number of cluster to choose. However it is not easy to formalize an objective mathematical criterion: what do we want to optimize? Obviously k cannot be a variable of the k-means optimization problem, otherwise we would just take k equal the number of data points, place each centroids on a different data point and get an objective of 0. But this is not interesting at all.


 * On the one hand, we want the clusters to be quite dense and reasonably small. This criterion clearly favors a large number of clusters.
 * On the other hand, we do not want do divide in different clusters data points that we judge "similar". This argument tends thus to limit the number of clusters.

As always in engineering, it is thus a question of trade-off. The easiest case if of course when for some reason, we know in advance the number of cluster. In such a situation, the clustering is useful to summarize the data. Suppose for example that we have a database containing the characteristics (weight, length, ...) of dogs. If we know the number of races, the we can assume/hope that each cluster will correspond to a race of dogs. The information provided by the clustering will in that case be the general features of each race of dogs (given by the centroïd of each cluster), and the disparity of dogs of the same race (by looking at the diameter of the cluster, for example).

However it would be nice to find a criterion to detect automatically the "right" number of clusters. In the case below, our intuition clearly tells us that the right number of clusters is 4. Applying k-means for different k's gives the following evolution of the sum of the squared distances between every point and its centroïd: The results obtained for k equals 1 to 8 are display below:

Based on this example, it seems that the "optimal" (in the intuitive sense) $$ k=k^*$$ has the following properties:
 * While $$ k < k^*$$, increasing k decreases significantly the cost
 * When $$ k \geq k^*$$, the cost doesn't decreases significantly (but of course still decreases)

We are thus tempted to say that the right k is the one that maximizes the second derivative. However we know that numerical differentiation does not work well, especially when we only have a few data points.

In the section "Hybrid Algorithm", we will present qa more advanced and reliable method to detect the right number of clusters, using spectral clustering.

A simple repeat procedure to avoid local optimum
As stated before, Lloyd's algorithm might get stuck in some local optimum without any guarantee on the quality of the result obtained. One way to improve the quality of the result is to run Lloyd's algorithm a fixed number of times, changing the initial centroids at each call, and then keep the clustering with smallest associated cost.

Simple restart procedure
In the following subsections, we denote by $$S$$ the set of points that we wish to cluster, $$C$$ the set of centroids and $$k$$ the number of clusters. We remind that the cost associated to a clustering with associated centroids $$C$$ is given by

$$ \sum_{p\in P} \left( \min_{m\in M} d(p,m)\right)^2 $$

Let us first reformulate Lloyd's algorithm in a convenient way.

The following algorithm shows one iteration of Lloyd's algorithm. It first assigns all points to the closest current centroid and then updates the position of each centroid (as the geometric center of its neighborhood).

Based on this simple iteration Lloyd's algorithm can be formulated in the following way.

The repeat algorithm is then stated as follows.

Obviously, this algorithm cannot return a worse result than the simple Lloyd's algorithm (in terms of total cost of the set of cost of the set of centroids returned eventually). The time complexity of one iteration of $$LloydIteration$$ is $$O(k.n)$$ or $$O(n^2)$$ (as we cannot seek more clusters than the number of points in $$P$$). The time complexity of $$LloydAlgorithm$$ depends on the convergence rate of the algorithm.

How could we make this algorithm more efficient. It would save time if could predict when a given call of $$LloydAlgorithm$$ is certainly not going to give a better solution than the ones found previously and avoid running running the whole algorithm for this iteration.

Bound on the solution of Lloyd algorithm
Given a set of points $$S$$ and a set of centroids $$C$$, our purpose is to compute a lower bound on the locally optimal cost returned by $$LloydAlgorithm$$ if we could run it until convergence, starting with points $$S$$ and centroids $$C$$. A proof of correctness of this bound was given by ZHANG(ZHENJIE) and TUNG (ANTHONY K.H.), on the lower bound of local optimums in k-means algorithm, School of computing, National university of Singapore, DATE. The purpose of this section is to explain the general idea underlying the proof of correctness of this algorithm (a complete proof of the correctness of the bound is not done here, but the insight of the reasoning is given). In next section, an algorithm based on the computation of this bound is formulated. In the following sections, we use the notation $$cost(C,S)$$ to denote the cost of clustering set of points $$S$$ with centroids in set $$C$$

We may summarize the overall procedure in the following way. Given a data set $$S$$ and a set of centroids $$C$$, we want to find a region $$r_c$$ around each centroid $$c$$ such that centroid $$c$$ will remain in region $$r_c$$ if we run $$LloydAlgorithm$$ with points $$S$$ and initial set of centroids $$C$$. Then, we use this result to compute a lower bound on the cost one may obtain by running Lloyd's algorithm.

The reasoning that we follow to obtain the bound is decomposed into four steps. The notation used is $$C$$ for the current set of centroids, $$S$$ for the set of input points and $$cost(C,S)$$, the cost obtained when choosing the centroids as defined by $$C$$.

Step 1 : Definition of maximal region In this section, we show how to find a region $$r_c$$ around each centroid $$ c\in C $$ such that centroid $$c$$ will remain in region $$r_c$$ if we run $$LloydAlgorithm$$. First we formulate the following intermediate result.

An iteration of Lloyd's algorithm first assigns each point to the closest centroid and then move the centroids according the new assignment of points. Let us consider centroid $$c_i\in C$$, $$\tilde{c}_i$$, the new position of $$c_i$$ after updating its position and $$\tilde{C}=C\setminus \{c_i\} \bigcup \{\tilde{c}_i\}$$. Then, moving $$c_i$$ to $$\tilde{c}_i$$, we obviously improve the total cost of $$C$$:

$$ cost(\tilde{C},S)<cost(C,S)$$

The following (less trivial) result can be proved:

Lemme If we consider any point $$t$$ on the line segment joining $$ c_i $$and $$\tilde{c}_i$$ (see figure below), we have:

$$ cost(C\setminus \{c_i\} \bigcup \{t\},S)<cost(C,S)$$

Based on the preceding result, one may easily prove the following theorem:

Theorem whenever a centroid $$c_i$$ lies in a bounded region $$R_i$$ such that the cost strictly increases if we move $$c_i$$ to any point of the boundary of $$R_i$$, then $$c_i$$ must necessarily stay in $$R_i$$ if we run Lloyd algorithm with $$m_i$$ as one of the initial centroids.

The region $$R_i$$ defined in the theorem above is called a maximal sub-region for centroid $$m_i$$. Based on this notion, we provide a formal definition of maximal region:

Definition $$R=\bigcup_iR_i$$ is a maximal region for set of data $$S$$ and set of centroid $$S$$ if for all $$i$$, $$R_i$$ is a maximal sub-region for centroid $$c_i \in C$$

Step 2 : defining a ball shaped maximal region The preceding result is quite powerful but does not give any information about the shape of the maximal region. For the sake of simplicity, we look for a specific type of maximal region. Given a set of centroids $$C$$ and a positive real number $$\Delta$$, this region is defined in the following way.

$$ R(C,\Delta)=\{C'=\{c_1',...,c_k'\}|\forall i,\Vert c_i-c_i'\Vert\leq\Delta\} $$

Now, we must verify that $$R(C,\Delta)$$ is indeed maximal. This is done by computing two terms, corresponding to the costs associated respectively to two different choices of centroid sets $$C$$ and $$C'$$. We remind that the cost associated to a set of centroid $$C$$ and a set of points $$S$$ is given by

$$ \sum_{p\in S} \left( \min_{c\in C} d(p,c)\right)^2 $$

The two terms that are needed are:


 * $$T_1=cost(C,S)$$: the cost of the current set of centroids, expressed as a function of the positions of centroids and the distance between each point and its centroid.
 * $$T_2^j\leq cost(C',S)$$: a lower bound on the cost of any set of centroid taken in

$$ B(c_j,C,\Delta):=\{C'=\{c_1',...,c_k'\}|\forall i,\Vert c_i-c_i'\Vert\leq\Delta\text{ and }\Vert c_j-c_j'\Vert=\Delta\} $$

Intuitively, we define a ball of radius $$\Delta$$ around each centroid $$c_i$$ and allow $$c_i'$$ to be taken anywhere in the ball except for one specific centroid $$c_j'$$ that must be taken on the boundary of this ball around $$c_j$$.

One can show that the boundary of $$R(C,\Delta)$$ is defined by

$$ \bigcup_j B(c_j,C,\Delta):=\{C'=\{c_1',...,c_k'\}|\forall i,\Vert c_i-c_i'\Vert\leq\Delta\text{ and }\Vert c_j-c_j'\Vert=\Delta\} $$

In order to show that $$R(C,\Delta)$$ is maximal, we only have to show that for all $$j$$ and for all $$C'\in B(c_j,C,\Delta)$$, cost(C',S)>cost(C,S). A sufficient condition to ensure that this inequality is satisfied is

$$ T_2^j-T_2>0 $$

The term $$T_2^j-T_2$$ is obviously a function of $$\Delta$$ that we denote by $$f$$. $$R(C,\Delta)$$ is thus maximal if

$$ f(\Delta)>0 $$

Step 3: computation of the cost Given a set of centroids $$C$$ and a set of points $$S$$, it is easy to show the following result:

If k-means algorithm converges to a set of centroids $$C'\in R(C,\Delta)$$,

$$ cost(C',S)\geq cost(C,S)-|S|\Delta^2 $$

as the square distance between any point and its centroid in $$C'$$ cannot decrease more than $$\Delta^2$$ compared to cost(C,S).

Summarizing steps 1 through 3, we have: Given a current centroid set $$C$$, a set of points $$S$$, and a constant $$\Delta$$ verifying $$f(\Delta)>0$$, if Lloyd's algorithm converges to a set of centroids $$C'$$,

$$ cost(C',S)\geq cost(C,S)-|S|\Delta^2 $$

obviously, the smaller $$\Delta$$ is, the tighter the bound we obtain.

An adapted version of the repeat algorithm The results described above provide a way to accelerate the naive repeat algorithm described before.

We keep a record of $$costOpt$$, the minimum cost achieved so far. Then, at each iteration of Lloyd's algorithm (where $$C$$ denotes the set of centroids obtained at this stage), we add three extra steps:


 * find the smallest $$\Delta>0$$ such that $$f(\Delta)>0$$
 * compute the bound: $$\beta(C,S)=cost(C,S)-|S|\Delta^2$$
 * verify if $$\beta<optCost$$, if not, restart the whole Lloyd algorithm (with other initial centroids)

Following this idea, we formulate the whole repeat algorithm using bound estimation.

This algorithm is essentially equivalent to the simple repeat procedure but it makes use of the bound calculation to avoid searching a certain space of solution completely if we know for sure that it will not beat the best solution found so far. In practice, finding the bound $$\beta(C,S)$$ and especially the smallest value of $$\Delta\geq 0$$ ensuring $$f(\Delta)>0$$ is not trivial. Further mathematical developments show that this can be done in $$2|S|$$ iterations by only focusing on some critical values of $$\Delta$$ (see for further details).

Implementation
We implemented the restart with bound algorithm on matlab. The bound restart algorithm tends to be faster that the normal Lloyd algorithm when the number of data points and the number of clusters are large. The graph below shows the evolution of the bound for a set of 1000 points and 8 clusters in dimension 6.

The matlab code of the implementation is the following. It consists of three functions, the first one for the restart scheme, the second one for the accelerated k-means algorithm, the third one for a simple iteration with bound computation. Details about the formula are given in the article

Basic Spectral Clustering
Up to now, we focused on k-means problem and algorithms providing an approximate solution to k-means problem. We faced some of the inherent problems of the k-means problem formulation and the associated algorithms (mainly Lloyd's algorithm). We provided methods allowing to avoid some of the difficulties encountered. However, two major issues were not treated in the preceding chapters.


 * The formulation of k-means problem makes the assumption that the clusters we seek are convex-shaped. However, as we will see in the following sections, some applications may involve non convex clusters for which k-means algorithms cannot be used.
 * Theoretically, k-means problem can be formulated for data belonging to any metric space (with a consistent definition of distance). However, some approximation algorithms for k-means problem (such as Lloyd's algorithm) only work efficiently on Euclidean space for which the definition and computation of centroids of clusters is straightforward. This is a very strong restriction as data we want to treat (such as strings for instance) may belong to non Euclidean spaces.

Now the question is: how can we design an algorithm that offers the same advantages of simplicity as k-means algorithms but offers more freedom in the definition of clusters and their shapes and allows the processing of data belonging to a wider class of spaces.

Spectral clustering is a recent and popular clustering method. In this section, we present the general principle of this technique (see for a detailed overview of the method). In next chapter, we show how to combine spectral clustering with k-means for specific applications.

Apart from its simplicity, one of the major advantages of spectral clustering techniques is the large class of data that these methods can process. The data treated are objects that are symmetrically related to each other. Following this idea, one may build an undirected graph (called similarity graph) where objects are associated to nodes and relations are represented as edges. The algorithm extracting the clusters from this similarity graph makes use of some spectral properties of the adjacency matrix of the similarity graphs. These steps of the methods are explained in details in the following sections.

Definition of the similarity graph
The key element of spectral clustering techniques is the definition of the similarity graph, represented by its adjacency matrix, simply called similarity matrix. If we assume the data consists of objects $$o_1,...,o_n$$, the entry $$S_{ij}$$ of the similarity matrix is a measure of the similarity between objects $$o_1$$ and $$o_2$$. The similarity between object must be a positive real number. The clustering process is expected to partition the objects in groups such that a pair of objects belonging to the same group have a high similarity whereas a pair of objects belonging to different groups is highly dissimilar. A wide variety of similarity matrices can be used. Some of these are given below.


 * Classical definitions based on the distance between objects: these definitions are based on the existence of a distance measure $$d_{ij}$$ between each pair $$(i,j)$$ of objects.


 * The $$\epsilon$$-neighborhood graph: two objects are connected if they are sufficiently close to each other:

$$ S_{ij}=\left\{\begin{array}{ll} 1 & \text{ if } d_{ij}<\epsilon \\ 0 & \text{ otherwise}\end{array}\right. $$


 * The k-nearest neighbor graph: connect two objects $$o_i$$ and $$o_j$$ if either $$o_i$$ is among the k-nearest neighbors of $$o_j$$ or $$o_j$$ is among the k-nearest neighbors of $$ o_i$$ . The weight of the corresponding edge is simply the distance $$d_{ij}$$.


 * More advanced definitions of the similarity can be used. For instance, it is possible to use a result provided by k-means algorithm in order to compute the similarity (data points in the same cluster are considered similar).

' Graph Laplacian of the Similarity Graph'
The spectral clustering algorithm makes use of the graph Laplacian of the similarity graph which defined as:

$$L=D-S$$

where $$S$$ is the similarity matrix and $$D$$ is the diagonal degree matrix, namely, $$d_{ii}=\sum_jS_{ij}.$$

Basic properties of the graph Laplacian include the fact that $$L$$ is symmetric positive semi-definite and $$0$$ is always an eigenvalue of $$L$$ with associated eigenvector of ones.

A remarkable property of the graph Laplacian is the following (see for more details):

Theorem:  For any undirected graph $$G$$ with adjacency matrix $$A$$ and graph Laplacian $$L=D-A$$, the multiplicity of the eigenvalue $$0$$ is the number of connected components of the graph.

This remarkable fact gives a reliable tool to find the number of connected components of a graph, we will come back on this property and use it in order to detect the number of clusters in a graph.

Spectral clustering algorithm
To use the graph Laplacian for the spectral clustering method, we must first define a normalized version of it.

$$ L_n=D^{-1/2}LD^{-1/2} $$

Other versions of the normalized graph Laplacian exist but this one has the advantage that it is symmetric.

The spectral clustering algorithm is formulated as follows (see for more details):

Algorihtm spectral_clustering(S,k)
 * Define the normalized Laplacian L
 * Compute the first k eigenvectors of L
 * Store these eigenvectors of a $$n\times k$$ matrix U
 * Cluster the n rows of U with k-means algorithm

The clustering of the rows of matrix U gives the final clustering of objects: if row i of U happens to be in the j-th cluster then the i-th object $$o_i$$ should also be placed in the j-th cluster.

Proving the correctness of this algorithm is not obvious and is out of the scope of this paper (ideas of proofs can be found in ).

Selecting the number of clusters for spectral clustering
The number k of clusters for spectral clustering is a meta-parameter of the algorithm that should be chosen in advance. However, when the clusters are well separated, there exist one straightforward technique to predict what number of clusters should be chosen.

When considering the graph Laplacian of an undirected graph, we recall that the multiplicity of 0 as an eigenvalue of the graph Laplacian equals the number of connected components of the graph. Actually, when applying the spectral clustering technique to detect k clusters in the similarity graph associated to a network, the magnitude of the k first eigenvalues is a measure of how disconnected these clusters are. This idea and its applications will be further developed the following sections.

Examples of application of the spectral clustering algorithm can be found in the section "a hybrid algorithm" and in the section "applications".

A Hybrid Algorithm : Combination of Spectral Clustering and k-means
In this section, we present a hybrid algorithm allowing to detect non convex shaped clusters and allowing to determine the correct number of clusters.

Detecting non convex clusters
We know that k-means fails at detecting interlacing non-convex shapes. Hierarchical algorithms can be used, however, they are quite slow. On the other hand, spectral clustering works well, but to use it, we need a similarity matrix. The idea is to use k-means in order to build a similarity : we run k-means a large number of times with each tilme a different k. The similarity matrix $$ S $$ is defined by

$$ S_{ij} = \text{ the number of time } i \text{ and } j \text{ are in the same cluster } $$

Then perform a spectral clustering based on that similarity matrix. The idea is that the notion of similarity between points propagates from close to close. Let us show on an example what happens: suppose we want to cluster these points:

Clustering the points with k-means with different k's give thus kind of result: we display here a fictive result of how k-means could behave with a very small k (k=2, yellow), a bigger k (red) and a very big k (green, these ones are not all displayed, but you get the idea...). We see here that taking large clusters (thus a small k) is not really relevant because it groups together points that we don't want to be similar. We will thus use larges k's. The key point is that the clusters obtained with one given k=k1 are overlapping with other clusters obtained when k=k2, k3, ... This will allow to propagate the notion of similarity between the branches of each spiral. Finally, using the spectral clustering on the similarity matrix obtained gives the following (non fictive !) result, so we can see that it works really well:



To summarize, the Hybrid algorithm consist in applying step y step the following points:


 * For $$k \in \{k_1..k_N\} \doteq nKmeans$$, run k-means and store the points assignment (idx)
 * Initialize a similarity matrix $$S = zeros(n,n)$$, where $$n$$ is the number of points, and fill it :


 * Perform a spectral clustering on the similarity matrix S

Hybrid algorithm for determining the number of clusters
The clustering technique described above gives good result for determining the clusters. However, it requires to know the number of clusters in advance. However, as we mentioned before, we could still use the graph Laplacian to determine the number of clusters. Let us remind one interesting property of spectral clustering. When considering the graph Laplacian of an undirected graph, we recall that the multiplicity of 0 as an eigenvalue of the graph Laplacian equals the number of connected components of the graph. Actually, when applying the spectral clustering technique to detect k clusters in the similarity graph associated to a network, the magnitude of the k first eigenvalues is a measure of how disconnected these clusters are. In particular, if we can find clusters that are completely disconnected as defined by the similarity matrix, then the multiplicity of the eigenvalue 0 is the number of clusters we seek. Finding the number of clusters that we need is often a complicated operations and there is no automatic heuristic that works 100 percent well. So when the clusters are disconnected enough, spectral clustering gives a very reliable technique to avoid this difficulty. We will do this like that
 * Build the similarity matrix like previously (with large values of k, such two points belonging to different "object" are (almost) never together in the same cluster
 * Filter the very small values of the similarity matrix, in order to compensate an eventual failure of the previous point
 * Count the number of eigenvalues of the graph Laplacian whose magnitudes are below a chosen threshold: it corresponds to the number of clusters.

Further example of applications of this technique are given in next section.

Applications
Besides our theoretical reasoning, it is important for us to give a more practical aspect to our project, we interest ourselves in some nice applications of clustering in image processing. In particular, we will see how we can compress an image without losing too much quality, how we can detect, isolate and count similar items on a picture.

Images compression
Image compression is a subject that has already been broadly studied because of its interest for a large public (Many people are interested in stocking a large amount of images and movies on their computers, smartphones, ...).

We distinguish two kinds of image compression : lossy or not. While lossy compression may be prohibited in some application, like for example medical imaging, where every detail may count, however, unlossy compression is quite limited in the sense that it generally does not allow to compress significantly the data. Lossy compression, however is very useful in many case, where compression the image by a big factor may be worth losing some details.

There are many techniques of lossy image compression, from the simplest (basically sampling the image) to the most complicated. The method we are going to present here is of course based on clustering : the idea is to reduce the number of colors used by clustering the pixels in the RGB-space and replace the RGB value of every pixel (x,y) by the RGB value of the centroïd of the cluster where (x,y) belongs.

The rationale behind that is that points that neighbors in the (x,y)-plane are likely to have similar colors. Thus, after replacing the similar colors by a single one, which we will do thanks to clustering, the resulting picture will have big area with the same color, allowing to store the information in a mode compact form.

Before going deeper into explanations, let us take a look at what the final result looks like: we achieve a compression by almost a factor 10 (49 kB instead of 420 kB) without dramatically decreasing the quality.



K-means is particularly suited for clustering points in the RGB-space, because it a is fast (and quite intuitive) method. Moreover, its main deficiency, namely the difficulty to handle interlacing non-convex shapes, is not a problem here. Indeed, our perception of the colors is such that pixels with similar colors form a convex shape in the RGB-space. The figure below shows the clustering (with 10 clusters) of the pixels of the initial image in the RGB-space:



We will now develop a little bit more the way the method works by showing the different steps on a very dummy and very simple example. First of all, let us introduce the notations: an image is fully represented by the position (x,y) of each pixel and the (R,G,B) components a set of points, i.e.

$$ (x,y,R,G,B) \in \{1..n_x\} \times \{1..n_y\} \times \{1..255\}^3$$

Our dummy example is the following: we have a 9 pixels image Im, containing thus up to 9 different RGB values, and we would like to compress it by only keeping 2 colors, in order to only have 2 different RGB values to store.

$$ Im = \left(\begin{array}{lll} (100,200,200) & (110,210,210) & (120,220,220)\\ (10,20,20) & (11,21,21) & (12,22,22)\\ (110,220,220)& (100,200,200)& (120,210,210) \end{array}\right) $$

Obviously, all the pixels of the first row and all the points from the third line are in the same cluster, with centroïd (110,210,210), while all the pixels of the second row belong to the second cluster, with centroid (11,21,21). We then assign to each pixel (x,y) the RGB value of the corresponding centroïd:

$$ Im = \left(\begin{array}{lll} (110,210,210)& (110,210,210) & (110,210,210)\\ (11,21,21)& (11,21,21) & (11,21,21)\\ (110,220,220)& (110,210,210)&(110,210,210) \end{array}\right) $$

Which allows now format like .png, .jpg, etc. to store the information in a more compact form, like for example,

$$ RGB(x,y)= \left\{\begin{array}{ll} (110,210,210) & \text{ if } y=1 \text{ or } 3\\ (10,20,20) & \text{ else }\\ \end{array}\right. $$

Object Localization
A very nice application of clustering techniques is the detection and the counting of "similar objects" on a picture. Imagine for example that the YellowCab company in Manhattan is interested to have some statistics on the number an spatial disposition of taxis at strategical places. To achieve this objective, the company has installed camera on the top of some skyscrapers, that periodically takes a picture of the road, but they do not want to analyse manually each picture. This situation may seem unrealistic, but why not...

First of all, let us clarify the notion of similarity of objects: we understand that two objects are "similar" if (and only if) they have "similar" colors and shapes. Typically, this will be the case of the yellow taxis. Let us begin by assuming that the number of clusters is known, then, to localize the different objects, we must follow the following steps:


 * 1) Cluster the original picture in the RGB-space such that all the similar items have exacty the same color. The choice of the number of clusters is important : taking too few clusters would leea to confuse the color yellow (of the taxis) with other similar colors, like the white road markings. On the other hand, takin too many clusters may lead to separate the yellow color in two sort of yellow (if for example the shadow of the skyscraper makes some cabs appear darker than others). However, yellow is a quite distinguishable color, thus it is preferable to take too much clusters than too few. Typically 3 to 10 clusters works. But remember that the more cluster you want, the slower is k-means. 3,4 or 5 is thus a good choice. It is thus preferable to take to take too much than too few
 * 2) Only keep the interesting points, i.e. the ones that have the targeted color (yellow in the case of yellow cabs). This can be done easily by measuring the (Euclidian) distance between every color in new image (i.e. every centroïd from step 1) and the RGB of yellow : (255,255,0). Store the position (x,y) of these inter resting.
 * 3) Cluster these points in the (x,y) space with different k, where k is the number of clusters, and minimize some error criterion in order to choose the right k. The clustering can be done with k-means if the objects have a "convex" shape. In fact they don't need to be convex stricto sensu: they can have holes (for example at the location of the windows) as long as each object can be enlarged in a convex shape in which the density of the colored points is "high enough". While not very rigorous, this criterion is typically observed in the case of taxis, or even objects more complex shapes (like planes, as we shall see later). The definition of an error criterion is obviously the most difficult part: clearly we have to penalize clusters that are too large (because the are likely to contain several taxis), but also penalize too small clusters, too avoid splitting a taxi in several clusters.



Object counting
One of the drawback of the approach described above is that it requires to know the number of clusters (namely the number of objects we wish to detect) in advance. We can of course rely on the minimization of some error criterion but this approach gives result that are not very accurate. To avoid this difficulty, we give a reliable algorithm (to our mind, it is one of the most interesting result of this report). The technique involves the hybrid algorithm (combination of k-means and spectral clustering) described before.

The overall technique can be described as follows.
 * As described before, filter the points (corresponding to pixels) to only keep the ones associated to the color of the target objects.
 * Run k-means algorithm several times on the set of points for large values of k (typically 200).
 * Define the similarity matrix: the similarity $$ S_{ij}$$ between two objects is defined as the number of clusters obtained at step 2 for which i and j lie in the same cluster.
 * Calculate the graph Laplacian associated to the similarity graph and count its eigenvalues of magnitude zero (or nearly zero), this gives the number of clusters.

How are we sure that this gives the correct answer? It relies on the fact that when the number k of clusters considered in the second step is very large, two points belonging to different objects are very unlikely to be in the appear in the same cluster. This means that objects we wish to isolate correspond exactly to connected components of the similarity graph. In practice, as we are not sure that the clusters are fully disconnected in the similarity graph, we add a small tolerance on the eigenvalues: an eigenvalues is considered as 0 if it is below a certain threshold.

The following graph shows the evolution of the magnitude of the eigenvalues of the graph Laplacian for the example of 5 planes. We see that the first 5 eigenvalues are indeed equal to zero.



Object Localization and counting
How can we combine the preceding results? We apply the algorithm formulated in the preceding section in order to find the number of objects. Then we may use another technique to estimate the localization of the objects: for instance, we can run the spectral algorithm until completion in order to detect the clusters present on the image. However, whereas it is very powerful to detect the number of objects, this algorithm does not give very accurate results to separate objects. Another technique we could use is running k-means another time with the right number of clusters. Finally, the way that would give the most reliable result is to run the spectral algorithm with another definition of similarity graph. We simply use the m nearest neighbors similarity graph: objects i and j are connected (nonzero similarity) if and only if either i is among j's m nearest neighbors, either j is among i's m nearest neighbors (according for instance to the Euclidean distance). If objects i and j are connected, there similarity is given by the distance between them. We then run the spectral clustering algorithm on this similarity matrix. If we choose m to be about half of the number of points representing an object (which can be approximated by knowing the resolution of the image and the approximate size of each object on the picture) and if the distance between objects is large enough, this technique gives satisfactory results, as shown in the figure below for 5 planes.



The overall technique can thus be summarized in the following steps:
 * filter the points (corresponding to pixels) to only keep the ones associated to the color of the target objects.
 * Run k-means algorithm several times on the set of points for large values of k (typically 200).
 * Define the similarity matrix: the similarity $$ S_{ij}$$ between two objects is defined as the number of clusters obtained at step 2 for which i and j lie in the same cluster.
 * Calculate the graph Laplacian associated to the similarity graph and count its eigenvalues of magnitude zero (or nearly zero), this gives the number of clusters.
 * Define the m nearest neighbours similarity matrix of the data point.
 * Apply spectral clustering to this similarity matrix, knowing the number of clusters we seek (by step 4).

= Hierarchical Clustering =

Up to now we have analysed three general algorithms: k-means, spectral clustering and the hybrid algorithm. All these methods are more or less based on k-means algorithm. We have seen that spectral clustering and the hybrid algorithm can overcome some of the difficulties encountered with k-means. Now, one could ask the following question: are there some other algorithms, not based on k-means, that could potentially overcome the difficulties encountered with k-means? Hierarchical clustering techniques and especially Cure algorithm are powerful tool that allow to cluster non-convex shaped clusters. The purpose of this section is to give some insight in the definition of this algorithm and most of all, to give an efficient implementation of it. Further extensions could include a combination of hierarchical clustering tools and k-means or spectral clustering algorithm.

Basic Hierarchical Clustering
The principle of hierarchical is quite intuitive. We start with each point in a cluster and iteratively merge clusters that are "close" to each other and stop the process when some stopping criterion is fulfilled.

The greatest advantages of hierarchical clustering are:
 * there is no restriction in the shape of clusters, only the merging rule defines what the cluster will look like,
 * data might belong to a non-Euclidean space as the only assumption is the existence of a distance measures between each data point (metric space).

The key elements of hierarchical clustering are the merging and the stopping rule. These elements were already part of the slides of the presentation, however, we think it is useful to remind the main aspects of the technique.

Merging rule
Depending on the application, several merging rules are available: most of them are based on a definition of "pseudo-distance" between clusters $$ C_1 $$ and $$ C_2 $$(none of these functions truly give a consistent definition of distance in a mathematical sense):
 * $$ d(C_1,C_2)$$ is the distance between the centroids of the two clusters (these definition works for convex clusters),
 * $$ d(C_1,C_2)$$ is the minimum distance between any two points of $$ C_1$$ and $$ C_2$$ (this defintion works well for non convex clusters),
 * $$ d(C_1,C_2)$$ is the average distance of all pairs of points from $$C_1$$ and $$C_2$$.

Other merging rules are also available (such as merging the pair of clusters resulting in the cluster of smaller diameter or radius).

Stopping rule
Several stopping rules are available, for instance:
 * stopping rule based on the number of clusters if we already know the number of clusters,
 * stopping rule based on the shape of the clusters: we stop when the density of the cluster formed by the last merging is below a threshold,
 * we continue till there is only one cluster (and build the underlying tree tracking the merging process).

The last rule is only useful if we also want to have some knowledge of the internal structure of the network. An interesting rule that might be formulated would be to combine hierarchical and spectral clustering techniques in order to predict the number of clusters.

Implementation
The major drawback of the basic hierarchical clustering method is that it cannot be used on very large databases (because we start with as many clusters as points). For this reason, it is essential to make the implementation very efficient.

A naive implementation of the algorithm would have a time complexity of $$ O(n^3)$$ ($$ n $$ being the number of points considered): the maximum number of merging steps is $$ n $$ and finding the pair of clusters that are closest to each other requires time $$ O(n^2)$$.

This implementation can be improved to $$ O(n^2\log(n))$$ if we use store pairs of clusters in a binary heap. The overall scheme of the algorithm is then the following:
 * compute the distance between all pairs of points,
 * store all pairs of points in a priority queue (binary heap),
 * Merging process for clusters $$C_1$$ and $$C_2$$:
 * extract the pair with minimum associated distance ($$O(1)$$),
 * create the new cluster $$C_1 = C_1 \bigcup C_2$$,
 * find all pairs $$(C_i,C_1)$$ where $$C_1$$ appears and update the associated distance ($$O(n\log(n))$$),
 * find all the pairs where $$C_2$$ appears and delete them ($$O(n\log(n))$$).

The python code for the implementation of this algorithm includes the following definition of classes The Python code of the functions is the following (in the same order as described above).
 * BasicCluster and Cluster2 provide both a different representation of clusters. BasicCluster simply stores the points of a cluster into a list whereas Cluster2 represent them as a tree (where each node represents a point that we virtually connect through the tree to two other points belonging to the same cluster) in order to facilitate the merging process when running the algorithm.
 * GeneralPoint is the representation of points allowing to define clusters of type Cluster2 (points stored in a tree).
 * ClusterPair represents a pair of clusters with associated distance, each pair will be stored of the binary heap,
 * Heap is the binary heap of pairs of clusters with classical methods to extract the minimum pair of clusters and introduce or delete a pair of clusters,
 * HierarchicalClusterFun is the function that executes hierarchical clustering algorithm and updates the binary heap of pairs of clusters when needed.
 * createData2 is function used to create some data points that we wish to cluster.
 * readData2 is a function used to read data in a file.

CURE Algorithm
Whereas it gives a lot of freedom for the shape of clusters and the type of data treated, it is absolutely not suitable for very large database. In these cases, one can use advanced algorithms based on the principle of hierarchical clustering, such as Cure algorithm. One assumption of CURE algorithm is that the data belong to an Euclidean space: this is not an explicit requirement but it makes the computation of the position of centroids straightforward and CURE algorithm makes use of centroids.

Cure initialization step
This step makes use of the hierarchical clustering algorithm. It consists of the following procedure:
 * cluster a small sample of data in main memory (use a hierarchical clustering method with chosen distance that depends on the goal),
 * for each cluster, select a set of "representative" points that are as far as possible from each other,
 * move each representative points towards the centroid of its cluster (for instance, decrease by 20 percent the distance between each representative point and its centroid).

Completion of the algorithm
After the initialization step, the completion of Cure algorithm consists of the following steps:
 * iteratively, merge clusters for which the representative points are close from each other and update the representative points of the cluster created;
 * point assignment:
 * take a point $$p$$ in secondary memory,
 * find the set of representative points that are the closest to $$p$$ ,
 * we assign $$p$$ to the cluster associated to this set of representative points.

Implementation
If we use a binary heap for the hierarchical clustering step and for the merging procedure, the overall complexity of the algorithm is $$O(n^2\log(n))$$.

The implementation of the algorithm includes the following functions and classes.
 * ClusterRepresentative provides a representation of clusters in terms of representative points
 * ClusterRepresentative represents a pair of ClusterRepresentative's used to define the binary heap
 * Heap is defined exactly in the same way as for classical hierarchical clustering algorithm (we do not give the code for it).
 * Solver contains the algorithm itself, calling all necessary function

Applying CURE Algorithm
Cure algorithm performs well on non-convex clusters. The image below shows how it performs for concentric rings.



As a further extension of Cure algorithm, it would be interesting to apply it to a large database. It should perform well even with a large number of data as the representative points technique reduces the space complexity (compared to simple hierarchical algorithm).

Conclusion
As a summary of our project, we will analyse the solutions we proposed to solve the problem addressed by k-means. Either a simple improvement of Lloyd's algorithm was possible, or because of the intrinsic properties of k-means problem, we had to use more advanced algorithms: hierarchical clustering or spectral clustering.



The two most advanced methods we analysed are CURE and spectral clustering. the advantages of each method are the following:
 * Spectral clustering:
 * Advantages:
 * More freedom for the data format: they just need to be related in a symmetric way (positive similarity) but do not have to belong to a Euclidean space and the relation must not be a proper distance in the mathematical sense.
 * The analysis of the eigenvalues of the graph Laplacian offer a direct way to find the right number of clusters
 * Drawbacks:
 * Difficult to handle big data: when considering a new data point, it is hard to predict to which cluster it belongs. Depending on the definition of the similarity, there might not be any direct way to predict this (such as distance to centroid for instance).
 * CURE
 * Advantages:
 * Good at handling big data: this is because it uses representative points to summarize the clusters. It is easy to predict to what cluster a new data point belongs by comparing it to the representative points.
 * Drawbacks:
 * Requires a Euclidean space to perform well: working with a Euclidean space allows to use Euclidean distance which makes the computation of the position of centroids straightforward.

As a conclusion, spectral clustering is more powerful in terms of the field of data it may cover (and some field remain still mysterious, such as unsymmetric similarity relation). At the same time, CURE provides a more simple tool that can easily be used for large databases.