Cluster validity algorithms
Cluster validity – measuring goodness of a clustering relative to others created by other clustering algorithms, or by the same algorithms using different parameter values. Cluster validation is very important issue in clustering analysis because the result of clustering needs to be validated in most applications. In most clustering algorithms, the number of clusters is set as user parameter. There are a lot of approaches to find the best number of clusters.
This technique (Dunn, 1974) is based on the idea of identifying the cluster sets that are compact and well separated. For any partition of clusters, where ci represent the i-cluster of such partition, the Dunn’s validation index, D, could be calculated with the following formula:
,
where d(ci,cj) – distance between clusters ci, and cj (intercluster distance); d'(ck)} – intracluster distance of cluster ck , n – number of clusters. The minimum is calculating for number of clusters defined by the mentioned partition. The main goal of the measure is to maximise the intercluster distances and minimise the intracluster distances. Therefore, the number of cluster that maximise D is taken as the optimal number of the clusters.
This index (Davies and Bouldin, 1979) is a function of the ratio of the sum of within-cluster scatter to between-cluster separation
,
where
-
number of clusters,
-
average distance of all objects from the cluster to their cluster centre,
-
distance between clusters centres. Hence the ratio is small if the clusters
are compact and far from each other. Consequently, Davies-Bouldin index will
have a small value for a good clustering.
The Silhouette validation technique (Rousseeuw, 1987) calculates the silhouette width for each sample, average silhouette width for each cluster and overall average silhouette width for a total data set. Using this approach each cluster could be represented by so-called silhouette, which is based on the comparison of its tightness and separation. The average silhouette width could be applied for evaluation of clustering validity and also could be used to decide how good is the number of selected clusters.
To construct the silhouettes S(i) the following formula is used:
,
where a(i) –average dissimilarity of i-object to all other objects in the same cluster; b(i) – minimum of average dissimilarity of i-object to all objects in other cluster (in the closest cluster).
It is followed from the
formula that
.
If silhouette value is close to 1, it means that sample is
“well-clustered” and it was assigned to a very appropriate cluster. If
silhouette value is about zero, it means that that sample could be assign to
another closest cluster as well, and the sample lies equally far away from
both clusters. If silhouette value is close to –1, it means that sample is
“misclassified” and is merely somewhere in between the clusters. The
overall average silhouette width for the entire plot is simply the average of
the S(i) for all objects in the whole dataset.
The largest overall average silhouette indicates the best clustering (number of cluster). Therefore, the number of cluster with maximum overall average silhouette width is taken as the optimal number of the clusters.
This index (Hubert and Schultz, 1976) is defined as follows:

where S is the sum of distances over all pairs of patterns from the same cluster. Let l be the number of those pairs. Then Smin is the sum of the l smallest distances if all pairs of patterns are considered (i.e. if the patterns can belong to different clusters). Similarly Smax is the sum of the l largest distance out of all pairs. Hence a small value of C indicates a good clustering.
For this index (Goodman and Kruskal, 1954) all possible quadruples (q, r, s, t) of input parameters are considered. Let d(x,y) be the distance between the samples x and y. A quadruple is called concordant if one of the following two conditions is true:
d(q,r) < d(s,t), q and r are in the same cluster, and s and t are in different clusters.
d(q,r) > d(s,t), q and r are in different clusters, and s and t are in the same cluster.
By contrast, a quadruple is called disconcordant if one of following two conditions is true:
d(q,r) < d(s,t), q and r are in different clusters, and s and t are in the same cluster.
d(q,r) > d(s,t), q and r are in the same cluster, and s and t are in different clusters.
Obviously, a good clustering is one with many concordant and few disconcordant quadruples. But some quadruples may be neither concordant nor disconcordant. Let Sc and Sd denote the number of concordant and disconcordant quadruples, respectively. Then the Goodman-Kruskal index is defined as:
![]()
This tecnique (Pauwels and Frederix, 1999) is based on the assertion that neighbouring instances in feature space often occur in the same natural cluster. The isolation of each cluster is measured using the k-nearest neighbour norm, where the norm a of each instance a is defined as the proportion of its k nearest neighbours that have been assigned to the same cluster as a. By averaging over all n instances in the data, the homogeneity of a partition may be computed as follows:
![]()
A high value for this measure indicates well-separated clusters. The authors acknowledge that, while this index rewards partitions that assign regions of well-connected instances to the same cluster, it fails to penalise partitions where well-separated clusters are merged, since only a limited locality is considered for each point.
In this index (Jaccard, 1912), which has been commonly applied to assess the similarity between different partitions of the same dataset, the level of agreement between a set of class labels C and a clustering result K is determined by the number of pairs of points assigned to the same cluster in both partitions:
![]()
where a denotes the number of pairs of points with the same label in C and assigned to the same cluster in K, b denotes the number of pairs with the same label, but in different clusters and c denotes the number of pairs in the same cluster, but with different class labels. The index produces a result in the range [0, 1], where a value of 1.0 indicates that C and K are identical.
This index (Rand, 1971) simply measures the number of pairwise agreements between a clustering K and a set of class labels C, normalised so that the value lies between 0 and 1:
![]()
where a denotes the number of pairs of points with the same label in C and assigned to the same cluster in K, b denotes the number of pairs with the same label, but in different clusters, c denotes the number of pairs in the same cluster, but with different class labels and d denotes the number of pairs with a different label in C that were assigned to a different cluster in K. The index produces a result in the range [0,1], where a value of 1.0 indicates that C and K are identical. A high value for this measure generally indicates a high level of agreement between a clustering and the annotated natural classes.
A simple approximation of accuracy for unsupervised learning that employs external class information was described by Topchy et al. (2003). By finding the optimal correspondence between a dataset's annotated class labels and the clusters in a given partition, a performance measure may be derived that reflects the proportion of instances that were correctly assigned. A high value for this measure generally indicates a high level of agreement between a clustering and the annotated natural classes. Note that this measure is only applicable when the number of clusters k is the same as the number of natural classes.
Literature:
J.C. Dunn. Well separated clusters and optimal fuzzy partitions. 1974. J.Cybern. 4. 95-104.
D.L. Davies, D.W. Bouldin. A cluster separation measure. 1979. IEEE Trans. Pattern Anal. Machine Intell. 1 (4). 224-227.
P.J. Rousseeuw. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. 1987. Journal of Computational and Applied Mathematics. 20. 53-65.
L. Hubert, J. Schultz. Quadratic assignment as a general data-analysis strategy.1976. British Journal of Mathematical and Statistical Psychologie. 29. 190-241.
L. Goodman, W. Kruskal. Measures of associations for cross-validations. 1954. J. Am. Stat. Assoc. 49. 732-764.
E.J. Pauwels .G. Frederix, G. Finding salient regions in images: nonparametric clustering for image segmentation and grouping. 1999. Computer Vision and Image Understanding, 75. 73-85.
P. Jaccard. The distribution of flora in the alpine zone. 1912. New Phytologist. 11, 37–50.
W.M. Rand. Objective criteria for the evaluation of clustering methods. 1971. Journal of the American Statistical Association. 846-850.
A. Topchy, A. Jain, W. Punch. Combining multiple weak clusterings.2003. Proc. Third IEEE International Conference on Data Mining (ICDM'03). 331-338.
Links