Cluster analysis is the process of grouping a set of items such that items in the same group (also called a cluster) are more similar to each other compared to items from another group. The process of assigning particular objects to each cluster is called classification. Cluster analysis groups data based only on information found in the data that describes the objects and their relationships. Cluster analysis helps in the initial exploratory data part of data analysis to understand the nature and structure of groups that may be present in the data. Here are some examples of use of cluster analysis in the real world:
- Biologists have classified all living things into various families. All the cat species belong to the Felidae family. Animals that belong to one family have similar characteristics compared to animals from another family.
- World Wide Web (WWW) contains billions of web pages and any search result would probably return thousands of pages. Clustering can be used to group similar search results together which will help users understand the search results better.
- Businesses may collect large amount of information on customers. Clustering can be used to understand the difference between customers and group them into a smaller number of groups for targeted marketing activities.
- If you are trying find the optimal location of a Pizza restaurant chain, finding the nearest neighbors and grouping the population into clusters can help determine the best possible location for the Pizza chain.
Types of Clusters
Clusters can be classified into the following types:
- Hierarchical clusters – In this case, the clusters are nested. One cluster can contain multiple sub-clusters underneath. The opposite of hierarchical clusters is partitional clusters in which the data items are divided into separate non-overlapping clusters. For example, classification of animals into various families is a hierarchical cluster.
- Clusters can also be divided into exclusive clusters where each data item is assigned to one unique cluster, overlapping where one data item may belong to multiple clusters or fuzzy clusters where all data items can belong to all clusters and a probability is assigned for it to belong to a cluster. For example, for web searches, some topics may be classified under math, some topics may be classified under science and some topics may fall under both these categories.
- Cluster analysis can also be broken up into complete vs. partial clustering. A complete clustering implies that every object is assigned to a cluster while in partial clustering; some items may not belong to any identified cluster and hence may be excluded from the classification. For example, during classification of business customers some customers may be classified under key customers and some of them may be classified as non-key customers while a third data set may not be sufficient yet to put them in either category and hence may be eliminated from the analysis for the time being. This would be an example of partial clustering.
There are a number of algorithms that can be used to perform cluster analysis. The two main algorithms are K-means and agglomerative hierarchical clustering.
- K-means Clusters: This is a prototype-based, partitional clustering method. This algorithm requires that we know the number of clusters (K) upfront. The initial K centers are randomly assigned and then the algorithm determines which of the data points lies close to these centers and assigns the data items to this cluster. Based on the data items allocated to that cluster, it then recalculates the new centroid and repeats the process until there is not much variation in the centroids. The end result is the location of the K centers and the data items assigned to each cluster.
- Agglomerative Clusters: This produces a hierarchical cluster by starting with each point as a singleton cluster and then repeatedly merging the two closest clusters that are closest to each other. Once the two clusters are merged, update the proximity matrix to reflect the proximity between the new cluster and the original clusters. Repeat the process until only one cluster remains. Since there are multiple points within each cluster, there are multiple ways in which the distance between clusters can be calculated.
Each of the algorithms described above has some limitations:
- K-means clustering is simple and efficient and can be used for a wide variety of data types. However, it cannot handle non-globular clusters or clusters of different sizes and densities. It can have trouble clustering data that contain outliers.
- Agglomerative hierarchical clusters are expensive in terms of computation and storage but can produce better quality clusters. They can have trouble coming up with the right clusters if the original data set is noisy and high dimensional data like document data.