# Probit Regression

Probit regression can used to solve binary classification problems, just like logistic regression.

While logistic regression used a cumulative logistic function, probit regression uses a normal cumulative density function for the estimation model. Specifying a probit model is similar to logistic regression, i.e. using the glm() function but with family argument set to binomial(link="probit").

## Example Problem

Lets use the same example from logistic regression and try to predict if an individual will earn more than $50K. So before running the code below that builds the probit model, you will have to follow the steps from logistic regression to import and prepare the training and test data. probitMod <- glm(ABOVE50K ~ RELATIONSHIP + AGE + CAPITALGAIN + OCCUPATION + EDUCATIONNUM, data=trainingData, family=binomial(link="probit")) # build the logit model predicted <- predict(probitMod, testData, type="response") # predict the probability scores summary(probitMod) # model summary #> Call: #> glm(formula = ABOVE50K ~ RELATIONSHIP + AGE + CAPITALGAIN + OCCUPATION + #> EDUCATIONNUM, family = binomial(link = "probit"), data = trainingData) #> #> Deviance Residuals: #> Min 1Q Median 3Q Max #> -4.5104 -0.5506 -0.0005 0.6515 3.6745 #> #> Coefficients: #> Estimate Std. Error z value Pr(>|z|) #> (Intercept) -2.590e+00 1.401e-01 -18.492 < 2e-16 *** #> RELATIONSHIP Not-in-family -1.330e+00 4.039e-02 -32.929 < 2e-16 *** #> RELATIONSHIP Other-relative -1.565e+00 1.406e-01 -11.135 < 2e-16 *** #> RELATIONSHIP Own-child -1.986e+00 8.785e-02 -22.611 < 2e-16 *** #> RELATIONSHIP Unmarried -1.466e+00 6.308e-02 -23.241 < 2e-16 *** #> RELATIONSHIP Wife 1.758e-01 6.387e-02 2.753 0.005910 ** #> AGE 1.523e-02 1.358e-03 11.214 < 2e-16 *** #> CAPITALGAIN 1.744e-04 9.158e-06 19.045 < 2e-16 *** #> OCCUPATION Adm-clerical 4.955e-01 9.583e-02 5.170 2.34e-07 *** #> OCCUPATION Armed-Forces 7.364e-01 1.034e+00 0.712 0.476343 #> OCCUPATION Craft-repair 7.247e-01 9.246e-02 7.838 4.59e-15 *** #> OCCUPATION Exec-managerial 1.141e+00 9.117e-02 12.520 < 2e-16 *** #> OCCUPATION Farming-fishing 7.620e-02 1.237e-01 0.616 0.537808 #> OCCUPATION Handlers-cleaners 3.048e-01 1.295e-01 2.353 0.018599 * #> OCCUPATION Machine-op-inspct 4.044e-01 1.060e-01 3.816 0.000136 *** #> OCCUPATION Other-service 8.727e-02 1.098e-01 0.795 0.426548 #> OCCUPATION Priv-house-serv -1.605e+00 1.245e+00 -1.289 0.197400 #> OCCUPATION Prof-specialty 9.801e-01 9.354e-02 10.477 < 2e-16 *** #> OCCUPATION Protective-serv 8.934e-01 1.271e-01 7.031 2.06e-12 *** #> OCCUPATION Sales 8.024e-01 9.325e-02 8.605 < 2e-16 *** #> OCCUPATION Tech-support 9.705e-01 1.157e-01 8.387 < 2e-16 *** #> OCCUPATION Transport-moving 6.133e-01 1.054e-01 5.821 5.86e-09 *** #> EDUCATIONNUM 1.575e-01 7.696e-03 20.464 < 2e-16 *** #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> (Dispersion parameter for binomial family taken to be 1) #> #> Null deviance: 15216.0 on 10975 degrees of freedom #> Residual deviance: 8753.6 on 10953 degrees of freedom #> AIC: 8799.6 #> #> Number of Fisher Scoring iterations: 8 Almost all the Xs have p values less than 0.05. So the significance of predictors should be of any concern here. It is okay to ignore few category levels of OCCUPATION not being significant, as most of the other levels have acceptable p values. ## Model Diagnostics The model diagnostics for probit models are the same as seen in case of logistic regression. #### Determine Optimal Cutoff library(InformationValue) optCutOff <- optimalCutoff(testData$ABOVE50K, predicted_scores)[1]
#=> [1] 0.88

#> $Concordance #> [1] 0.891276 Concordance is numerially similar to that of the logit model’s. #### Misclassification Error misClassError(testData$ABOVE50K, predicted_scores, threshold = optCutOff)
#=> 0.0894

The misclassification error is marginally higher than that of logit model.

#### Specificity and Sensitivity

sensitivity(testData$ABOVE50K, predicted_scores, threshold = optCutOff) #> 0.359541 specificity(testData$ABOVE50K, predicted_scores, threshold = optCutOff)
#> 0.9780574

The ability to detect the events (i.e. persons earning more than $50K), is 5% higher than that of logit model with same formula. But the specificity that is pretty high is marginally lower compared to logit. #### Confusion Matrix confusionMatrix(testData$ABOVE50K, predicted_scores, threshold = optCutOff)
# The columns are actuals, while rows are predicteds.
#>       0    1
#> 0 18810 1507
#> 1   422  846