/ #K-Means #PAM 

Clustering

1 Introduction

Cluster is a group of objects that belongs to the same class. In other words, similar objects are grouped in one cluster and dissimilar objects are grouped in another cluster.

1.1 Distances used in clustering

Distances are normally used to measure the similarity or dissimilarity between two data objects

Properties of the valid distance

  • d(i,j) = 0
  • d(i,i) = 0
  • d(i,j) = d(j,i)
  • d(i,j) = d(i,k) + d(k,j)

1.1.1 Distance between the objects

  • Euclidean distance:

\[ \sqrt{\sum_{i=1}^n (x_i-y_i)^2} \]

  • Manhattan distance:

\[ \left(\sum_{i=1}^n |x_i-y_i|\right) \]

  • Minkowski distance: \[ \left(\sum_{i=1}^n |x_i-y_i|^q\right)^{1/q} \]

if q=1, then manhattan and if q=2, then euclidean

1.2 Types of variables

Interval-scaled variables

  • Continuous measurements (weight, temperature, etc.)

Binary variables

  • Variables with 2 states (on/off, yes/no)

Nominal variables

  • A generalization of the binary variable in that it can take more than 2 states (color/red,yellow,blue,green)

Ordinal

  • ranking is important (e.g. medals with gold,silver,bronze))

Ratio variables

  • a positive measurement on a nonlinear scale (growth)

Variables of mixed types

1.3 Calculating distance based on types of variables

1.3.1 Binary variables

1.3.2 Categorical Variables:

1.3.3 Nominal variables and Ordinal Variables:

1.3.4 Mixed variables

1.3.5 Cosine Similarity

2 Clustering Methods

Clustering methods can be classified into the following categories:

  • Partitioning Method
  • Hierarchical Method
  • Density-based Method

2.1 Partitioning Method

Suppose we are given a database of ‘n’ objects and the partitioning method constructs ‘k’ partition of data. Each partition will represent a cluster and k <= n. It means that it will classify the data into k groups, which satisfy the following requirements:

  • No overlap between clusters.
  • No empty clusters.
  • No points/data-points should be excluded.

Examples: K-Means

2.2 Hierarchical Method

This method creates a hierarchical decomposition of the given set of data objects. We can classify hierarchical methods on the basis of how the hierarchical decomposition is formed. There are two approaches here:

  • Agglomerative Approach
  • Divisive Approach

Examples: AGNES (agglomatric) and DIANA (divisive)

2.2.1 Agglomerative Approach

This approach is also known as the bottom-up approach. In this, we start with each object forming a separate group. It keeps on merging the objects or groups that are close to one another. It keep on doing so until all of the groups are merged into one or until the termination condition holds.

2.2.2 Divisive Approach

This approach is also known as the top-down approach. In this, we start with all of the objects in the same cluster. In the continuous iteration, a cluster is split up into smaller clusters. It is down until each object in one cluster or the termination condition holds. This method is rigid, i.e., once a merging or splitting is done, it can never be undone.

2.3 Density-based Method

This method is based on the notion of density. The basic idea is to continue growing the given cluster as long as the density in the neighborhood exceeds some threshold, i.e., for each data point within a given cluster, the radius of a given cluster has to contain at least a minimum number of points.