r-statistics.co by Selva Prabhakaran


Multi-Dimensional Scaling

If you have multiple features for each observation (row) in a dataset and would like to reduce the number of features in the data so as to visualize which observations are similar, Multi Dimensional Scaling (MDS) will help.

The Advantage and Disadvantage of MDS

The advantage with MDS is that you can specify the number of dimensions you want in the output data. The disadvantage however is that it is not possible to deal with un-ordered categorical features.

How to implement MDS?

It can be easily implemented using the cmdscale() in {stats} and the isoMDS() and sammon() from {MASS} package. All these functions take the dissimilarity object of class dist as the main argument and k is the desired number of dimensions in the scaled output.

Below is the code that demonstrates these functions on swiss data that contains fertility and socio-economic data on 47 French speaking provinces in Switzerland.

head(swiss)  # first 6 rows of swiss
#              Fertility Agriculture Examination Education Catholic Infant.Mortality
# Courtelary        80.2        17.0          15        12     9.96             22.2
# Delemont          83.1        45.1           6         9    84.84             22.2
# Franches-Mnt      92.5        39.7           5         5    93.40             20.2
# Moutier           85.8        36.5          12         7    33.77             20.3
# Neuveville        76.9        43.5          17        15     5.16             20.6
# Porrentruy        76.1        35.3           9         7    90.57             26.6

1. cmdscale(): Classical MDS

d <- dist(swiss)  # compute distance matrix
scaled_2 <- cmdscale(d)  # perform MDS. k defaults to 2
head(scaled_2)  # first 6 features
#                    [,1]       [,2]
# Courtelary    37.032433 -17.434879
# Delemont     -42.797334 -14.687668
# Franches-Mnt -51.081639 -19.274036
# Moutier        7.716707  -5.458722
# Neuveville    35.032658   5.126097
# Porrentruy   -44.161953 -25.922412

scaled_3 <- cmdscale(d, k=3)  # setting k=3 to get 3 features.
head(scaled_3)
#>                    [,1]       [,2]       [,3]
#> Courtelary    37.032433 -17.434879 -22.609928
#> Delemont     -42.797334 -14.687668 -12.063389
#> Franches-Mnt -51.081639 -19.274036 -22.541458
#> Moutier        7.716707  -5.458722 -20.799893
#> Neuveville    35.032658   5.126097  -9.218281
#> Porrentruy   -44.161953 -25.922412 -10.045238

2. MASS::isoMDS(): Non-metric Multi-dimensional scaling

library(MASS)
swiss.dist <- dist(swiss)
swiss.mds <- isoMDS(swiss.dist)
head(swiss.mds$points)
#>                    [,1]       [,2]
#> Courtelary    38.850496 -16.154674
#> Delemont     -42.676573 -13.720989
#> Franches-Mnt -53.587659 -21.335763
#> Moutier        6.735536  -4.604116
#> Neuveville    35.622307   4.633972
#> Porrentruy   -44.739479 -25.495702
plot(swiss.mds$points, type = "n")
text(swiss.mds$points, labels = as.character(1:nrow(swiss)))

3. MASS::sammon(): Another form of non-metric MDS

library(MASS)
swiss.dist <- dist(swiss)
swiss.sam <- sammon(swiss.dist)
head(swiss.sam$points)
#>                    [,1]       [,2]
#> Courtelary    37.032433 -17.434879
#> Delemont     -42.797334 -14.687668
#> Franches-Mnt -51.081639 -19.274036
#> Moutier        7.716707  -5.458722
#> Neuveville    35.032658   5.126097
#> Porrentruy   -44.161953 -25.922412

Cluster with k-Means and plot

kmeans_clust <- kmeans(swiss.sam$points, 3)  # k-means wihth 3 clusters.
plot(swiss.sam$points, type = "n", main="MDS with sammon() and clustered", xlab = "X-Dim", ylab="Y-Dim")
text(swiss.sam$points, labels = rownames(swiss), col = kmeans_clust$cluster)  # set color using k-means output