r-statistics.co by Selva Prabhakaran


Multinomial Regression

Multinomial regression is much similar to logistic regression but is applicable when the response variable is a nominal categorical variable with more than 2 levels.

Introduction

Multinomial logistic regression can be implemented with mlogit() from mlogit package and multinom() from nnet package. We will use the latter for this example.

Example: Predict Choice of Contraceptive Method

In this example, we will try to predict the choice of contraceptive preferred by women (1=No-use, 2=Long-term, 3=Short-term). We have the education, work, religion, number of children, media exposure and standard of living as variables available in the cmc data. In this example, we will model the choice of contraceptive method cmc as a function of all these variables.

Import Data

cmcData <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/cmc/cmc.data", stringsAsFactors=FALSE, header=F)
colnames(cmcData) <- c("wife_age", "wife_edu", "hus_edu", "num_child", "wife_rel", "wife_work", "hus_occu", "sil", "media_exp", "cmc")
head(cmcData)
#>   wife_age wife_edu hus_edu num_child wife_rel wife_work hus_occu sil media_exp cmc
#> 1       24        2       3         3        1         1        2   3         0   1
#> 2       45        1       3        10        1         1        3   4         0   1
#> 3       43        2       3         7        1         1        3   4         0   1
#> 4       42        3       2         9        1         1        3   3         0   1
#> 5       36        3       3         8        1         1        3   2         0   1
#> 6       19        4       4         0        1         1        3   3         0   1

Convert Numerics to Factors

cmcData$wife_edu <- factor(cmcData$wife_edu, levels=sort(unique(cmcData$wife_edu)))
cmcData$hus_edu <- factor(cmcData$hus_edu, levels=sort(unique(cmcData$hus_edu)))
cmcData$wife_rel <- factor(cmcData$wife_rel, levels=sort(unique(cmcData$wife_rel)))
cmcData$wife_work <- factor(cmcData$wife_work, levels=sort(unique(cmcData$wife_work)))
cmcData$hus_occu <- factor(cmcData$hus_occu, levels=sort(unique(cmcData$hus_occu)))
cmcData$sil <- factor(cmcData$sil, levels=sort(unique(cmcData$sil)))
cmcData$media_exp <- factor(cmcData$media_exp, levels=sort(unique(cmcData$media_exp)))
cmcData$cmc <- factor(cmcData$cmc, levels=sort(unique(cmcData$cmc)))

Create Training and Test Data

# Prepare Training and Test Data
set.seed(100)
trainingRows <- sample(1:nrow(cmcData), 0.7*nrow(cmcData))
training <- cmcData[trainingRows, ]
test <- cmcData[-trainingRows, ]

Build Multinomial Model

library(nnet)
multinomModel <- multinom(cmc ~ ., data=training) # multinom Model
summary (multinomModel) # model summary

#> Call:
#> multinom(formula = cmc ~ ., data = training)
#> 
#> Coefficients:
#>   (Intercept)    wife_age  wife_edu2 wife_edu3 wife_edu4  hus_edu2  hus_edu3
#> 2  -1.5937363 -0.04360644 1.07871567 2.0445226  2.835641 -1.407238 -1.268765
#> 3   0.4376064 -0.10923832 0.03095292 0.4308403  0.979347  1.073331  1.150374
#>     hus_edu4 num_child  wife_rel1 wife_work1  hus_occu2   hus_occu3 hus_occu4
#> 2 -1.3102661 0.3060657 -0.4455628  0.1165996 -0.4943500 -0.40723995 1.2664442
#> 3  0.8607095 0.3376620 -0.2072181  0.3427517 -0.1950799  0.04609764 0.5596847
#>         sil2      sil3      sil4 media_exp1
#> 2 0.81445361 1.2655842 1.3311827 -0.2440084
#> 3 0.03657688 0.3155116 0.5562075 -0.9285685

#> Std. Errors:
#>   (Intercept)   wife_age wife_edu2 wife_edu3 wife_edu4  hus_edu2  hus_edu3
#> 2   0.9964378 0.01485064 0.5520832 0.5649966 0.5834594 0.6270468 0.5823429
#> 3   0.9225193 0.01400097 0.3181759 0.3368472 0.3629088 0.6885676 0.6837955
#>    hus_edu4  num_child wife_rel1 wife_work1 hus_occu2 hus_occu3 hus_occu4
#> 2 0.5886178 0.05094430 0.2391401  0.2001434 0.2473945 0.2444405 0.6986301
#> 3 0.6915629 0.04595659 0.2373718  0.1814554 0.2302729 0.2226137 0.6189151
#>        sil2      sil3      sil4 media_exp1
#> 2 0.5462033 0.5229496 0.5268553  0.4951397
#> 3 0.3106383 0.2907037 0.2943716  0.3819526
#> 
#> Residual Deviance: 1930.658 
#> AIC: 2002.658

Predict on Test Data

predicted_scores <- predict (multinomModel, test, "probs") # predict on new data
#>              1          2          3
#> 6    0.2699230 0.18691129 0.54316572
#> 9    0.3626476 0.08523814 0.55211422
#> 10   0.7564912 0.19409005 0.04941879
#> 12   0.7680439 0.05851352 0.17344257
#> 14   0.8961808 0.04747638 0.05634281
#> 17   0.6677357 0.23683800 0.09542632
#> .
#> .
#> 1464 0.5523515 0.02851988 0.4191287
#> 1471 0.1816340 0.41055467 0.4078114
#> 1472 0.5369837 0.16864237 0.2943739

predicted_class <- predict (multinomModel, test)
#> [1] 3 3 1 1 1 1
#> Levels: 1 2 3

Confusion Matrix and Misclassification Error

table(predicted_class, test$cmc)
#> predicted_class   1   2   3
#>               1 112  26  58
#>               2  19  37  21
#>               3  55  39  75

mean(as.character(predicted_class) != as.character(test$cmc))
#=> 0.4932127

A misclassification error of 49.3% is probably too high. May be it can be improved by improving the model terms or may be the variables are not as good in explaining the contraceptive method used. Either ways, I would encourage the investigator to try other ML approaches as well for this problem.