Probit Regression
Probit regression can used to solve binary classification problems, just like logistic regression.
While logistic regression used a cumulative logistic function, probit regression uses a normal cumulative density function for the estimation model. Specifying a probit model is similar to logistic regression, i.e. using the glm()
function but with family
argument set to binomial(link="probit")
.
Example Problem
Lets use the same example from logistic regression and try to predict if an individual will earn more than $50K. So before running the code below that builds the probit model, you will have to follow the steps from logistic regression to import and prepare the training and test data.
probitMod <- glm(ABOVE50K ~ RELATIONSHIP + AGE + CAPITALGAIN + OCCUPATION + EDUCATIONNUM, data=trainingData, family=binomial(link="probit")) # build the logit model
predicted <- predict(probitMod, testData, type="response") # predict the probability scores
summary(probitMod) # model summary
#> Call:
#> glm(formula = ABOVE50K ~ RELATIONSHIP + AGE + CAPITALGAIN + OCCUPATION +
#> EDUCATIONNUM, family = binomial(link = "probit"), data = trainingData)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -4.5104 -0.5506 -0.0005 0.6515 3.6745
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -2.590e+00 1.401e-01 -18.492 < 2e-16 ***
#> RELATIONSHIP Not-in-family -1.330e+00 4.039e-02 -32.929 < 2e-16 ***
#> RELATIONSHIP Other-relative -1.565e+00 1.406e-01 -11.135 < 2e-16 ***
#> RELATIONSHIP Own-child -1.986e+00 8.785e-02 -22.611 < 2e-16 ***
#> RELATIONSHIP Unmarried -1.466e+00 6.308e-02 -23.241 < 2e-16 ***
#> RELATIONSHIP Wife 1.758e-01 6.387e-02 2.753 0.005910 **
#> AGE 1.523e-02 1.358e-03 11.214 < 2e-16 ***
#> CAPITALGAIN 1.744e-04 9.158e-06 19.045 < 2e-16 ***
#> OCCUPATION Adm-clerical 4.955e-01 9.583e-02 5.170 2.34e-07 ***
#> OCCUPATION Armed-Forces 7.364e-01 1.034e+00 0.712 0.476343
#> OCCUPATION Craft-repair 7.247e-01 9.246e-02 7.838 4.59e-15 ***
#> OCCUPATION Exec-managerial 1.141e+00 9.117e-02 12.520 < 2e-16 ***
#> OCCUPATION Farming-fishing 7.620e-02 1.237e-01 0.616 0.537808
#> OCCUPATION Handlers-cleaners 3.048e-01 1.295e-01 2.353 0.018599 *
#> OCCUPATION Machine-op-inspct 4.044e-01 1.060e-01 3.816 0.000136 ***
#> OCCUPATION Other-service 8.727e-02 1.098e-01 0.795 0.426548
#> OCCUPATION Priv-house-serv -1.605e+00 1.245e+00 -1.289 0.197400
#> OCCUPATION Prof-specialty 9.801e-01 9.354e-02 10.477 < 2e-16 ***
#> OCCUPATION Protective-serv 8.934e-01 1.271e-01 7.031 2.06e-12 ***
#> OCCUPATION Sales 8.024e-01 9.325e-02 8.605 < 2e-16 ***
#> OCCUPATION Tech-support 9.705e-01 1.157e-01 8.387 < 2e-16 ***
#> OCCUPATION Transport-moving 6.133e-01 1.054e-01 5.821 5.86e-09 ***
#> EDUCATIONNUM 1.575e-01 7.696e-03 20.464 < 2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 15216.0 on 10975 degrees of freedom
#> Residual deviance: 8753.6 on 10953 degrees of freedom
#> AIC: 8799.6
#>
#> Number of Fisher Scoring iterations: 8
Almost all the Xs have p values less than 0.05. So the significance of predictors should be of any concern here. It is okay to ignore few category levels of OCCUPATION
not being significant, as most of the other levels have acceptable p values.
Model Diagnostics
The model diagnostics for probit models are the same as seen in case of logistic regression.
Determine Optimal Cutoff
library(InformationValue)
optCutOff <- optimalCutoff(testData$ABOVE50K, predicted_scores)[1]
#=> [1] 0.88
ROC Curve
plotROC(testData$ABOVE50K, predicted_scores)
The area under ROC curve is slightly higher than that of the logit model’s. Therefore this could be the preferred model.
Concordance
Concordance(testData$ABOVE50K, predicted)
#> $Concordance
#> [1] 0.891276
Concordance is numerially similar to that of the logit model’s.
Misclassification Error
misClassError(testData$ABOVE50K, predicted_scores, threshold = optCutOff)
#=> 0.0894
The misclassification error is marginally higher than that of logit model.
Specificity and Sensitivity
sensitivity(testData$ABOVE50K, predicted_scores, threshold = optCutOff)
#> 0.359541
specificity(testData$ABOVE50K, predicted_scores, threshold = optCutOff)
#> 0.9780574
The ability to detect the events (i.e. persons earning more than $50K), is 5% higher than that of logit model with same formula. But the specificity that is pretty high is marginally lower compared to logit.
Confusion Matrix
confusionMatrix(testData$ABOVE50K, predicted_scores, threshold = optCutOff)
# The columns are actuals, while rows are predicteds.
#> 0 1
#> 0 18810 1507
#> 1 422 846