Tuesday, 30 July 2019

Logistic Regression using R - How to handle class bias?

Logistic Regression using R - How to handle class bias?

Doing this blog as a revision into some primitive topics in Machine Learning.

Since, almost all my current work is done using R, I will implement logistic regression using this language.

We will be using adult data set. After all is set, now we dive straight into the code. Further explanations shall not be provided as most of the code is self-explanatory in nature.

> idata <- read.csv("adult_csv.csv")
> summary(idata)
      age                   workclass         fnlwgt               education     education.num  
 Min.   :0.000   Private         :33906   Min.   :  12285   HS-grad     :15784   Min.   : 1.00  
 1st Qu.:1.000   Self-emp-not-inc: 3862   1st Qu.: 117551   Some-college:10878   1st Qu.: 9.00  
 Median :2.000   Local-gov       : 3136   Median : 178145   Bachelors   : 8025   Median :10.00  
 Mean   :1.771                   : 2799   Mean   : 189664   Masters     : 2657   Mean   :10.08  
 3rd Qu.:3.000   State-gov       : 1981   3rd Qu.: 237642   Assoc-voc   : 2061   3rd Qu.:12.00  
 Max.   :4.000   Self-emp-inc    : 1695   Max.   :1490400   11th        : 1812   Max.   :16.00  
                 (Other)         : 1463                     (Other)     : 7625                  
               marital.status            occupation            relationship                   race      
 Divorced             : 6633   Prof-specialty : 6172   Husband       :19716   Amer-Indian-Eskimo:  470  
 Married-AF-spouse    :   37   Craft-repair   : 6112   Not-in-family :12583   Asian-Pac-Islander: 1519  
 Married-civ-spouse   :22379   Exec-managerial: 6086   Other-relative: 1506   Black             : 4685  
 Married-spouse-absent:  628   Adm-clerical   : 5611   Own-child     : 7581   Other             :  406  
 Never-married        :16117   Sales          : 5504   Unmarried     : 5125   White             :41762  
 Separated            : 1530   Other-service  : 4923   Wife          : 2331                             
 Widowed              : 1518   (Other)        :14434                                                    
     sex         capitalgain      capitalloss      hoursperweek         native.country    class      
 Female:16192   Min.   :0.0000   Min.   :0.0000   Min.   :0.000   United-States:43832   <=50K:37155  
 Male  :32650   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:2.000   Mexico       :  951   >50K :11687  
                Median :0.0000   Median :0.0000   Median :2.000                :  857                
                Mean   :0.2003   Mean   :0.1149   Mean   :1.951   Philippines  :  295                
                3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:2.000   Germany      :  206                
                Max.   :4.0000   Max.   :4.0000   Max.   :4.000   Puerto-Rico  :  184                
                                                                  (Other)      : 2517  
> head(idata)
  age        workclass fnlwgt education education.num     marital.status        occupation  relationship  race
1   2        State-gov  77516 Bachelors            13      Never-married      Adm-clerical Not-in-family White
2   3 Self-emp-not-inc  83311 Bachelors            13 Married-civ-spouse   Exec-managerial       Husband White
3   2          Private 215646   HS-grad             9           Divorced Handlers-cleaners Not-in-family White
4   3          Private 234721      11th             7 Married-civ-spouse Handlers-cleaners       Husband Black
5   1          Private 338409 Bachelors            13 Married-civ-spouse    Prof-specialty          Wife Black
6   2          Private 284582   Masters            14 Married-civ-spouse   Exec-managerial          Wife White
     sex capitalgain capitalloss hoursperweek native.country class
1   Male           1           0            2  United-States <=50K
2   Male           0           0            0  United-States <=50K
3   Male           0           0            2  United-States <=50K
4   Male           0           0            2  United-States <=50K
5 Female           0           0            2           Cuba <=50K
6 Female           0           0            2  United-States <=50K

Checking class bias now, on column class,

> table(idata$class)
<=50K  >50K 
37155 11687

There is a class bias, a condition observed when the proportion of one event is much smaller than proportion of other event.
Thus, we must sample the observations in approximately equal proportions to get better models.

CREATING TRAINING AND TESTING SAMPLES

#training data
input_ones <- idata[which(idata$class == ">50K"), ]  # all <=50k values
input_zeros <- idata[which(idata$class == "<=50K"), ]  # all <=50k values
set.seed(100)  # for repeatability of samples
input_ones_training_rows <- sample(1:nrow(input_ones), 0.7*nrow(input_ones))  # >50k values for training
input_zeros_training_rows <- sample(1:nrow(input_zeros), 0.7*nrow(input_ones))  # <=50k values training. 
training_ones <- input_ones[input_ones_training_rows, ]  
training_zeros <- input_zeros[input_zeros_training_rows, ]
trainingData <- rbind(training_ones, training_zeros)  # row bind  
# Create Test Data
test_ones <- input_ones[-input_ones_training_rows, ]
test_zeros <- input_zeros[-input_zeros_training_rows, ]
testData <- rbind(test_ones, test_zeros)  # row bind 

One simple optional step is to create Weight of Evidence (WOE) for categorical variables.
This is a very simple, yet useful method - Read more here.

idata_woe <- idata
str(idata_woe)

table(idata_woe$class)

library(car)
library(dplyr)

idata_woe$class <- dplyr::recode(idata_woe$class,"<=50K" = 0,">50K" = 1)
#idata_woe$class <- car::recode(idata_woe$class,"<=50K" = 0,">50K" = 1) - this is not working
idata_woe

i <- 10
q <- quantile(idata_woe$class,
              probs = c(1:(i-1)/i),
              na.rm = TRUE,
              type=3)
cuts <- unique(q)


WOE_Class <- table(findInterval(idata_woe$class,vec = cuts,rightmost.closed = FALSE),idata_woe$class)
WOE_Class <- as.data.frame.matrix(WOE_Class)
WOE_Class$`0`<-rowSums(WOE_Class)
WOE_Class$WOE <- log((WOE_Class$`1`*sum(WOE_Class$`0`))/(WOE_Class$`0`*sum(WOE_Class$`1`)))
> WOE_Class
      0     1      WOE
1 37155     0     -Inf
2 11687 11687 1.430113

Now that we have already calculated WOE, it is only logical that we calculate the Information Value (IV) also,

library(scorecard)

iv(idata_woe, y="class")
> iv(idata_woe, y="class")
          variable info_value
 1:   relationship 1.52771659
 2: marital.status 1.34867700
 3:            age 0.89506214
 4:    capitalgain 0.89314987
 5:     occupation 0.76787908
 6:      education 0.73690518
 7:  education.num 0.73690518
 8:   hoursperweek 0.40824961
 9:         fnlwgt 0.33755308
10:            sex 0.30052143
11:      workclass 0.17056215
12:    capitalloss 0.11592250
13: native.country 0.07509682
14:           race 0.06784214

Now, we have the information value (meaning the strength of relation between “class” and other parameters.
We now start building the logistic regression model,

logitMod <- glm(class ~ relationship + age + capitalgain + occupation + education.num, data=trainingData, family=binomial(link="logit"))
predicted <- plogis(predict(logitMod, testData))  # predicted scores
# or
predicted <- predict(logitMod, testData, type="response")  # predicted scores
summary(logitMod)

Result is as follows,

> summary(logitMod)

Call:
glm(formula = class ~ relationship + age + capitalgain + occupation + 
    education.num, family = binomial(link = "logit"), data = trainingData)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.3583  -0.5408   0.0015   0.6281   3.3139  

Coefficients:
                            Estimate Std. Error z value Pr(>|z|)    
(Intercept)                 -4.43416    0.18263 -24.280  < 2e-16 ***
relationshipNot-in-family   -2.26671    0.05846 -38.776  < 2e-16 ***
relationshipOther-relative  -2.52271    0.19614 -12.862  < 2e-16 ***
relationshipOwn-child       -3.43974    0.14067 -24.452  < 2e-16 ***
relationshipUnmarried       -2.66316    0.09625 -27.670  < 2e-16 ***
relationshipWife             0.31766    0.09130   3.479 0.000503 ***
age                          0.34161    0.02041  16.734  < 2e-16 ***
capitalgain                  0.95230    0.03972  23.974  < 2e-16 ***
occupationAdm-clerical       1.08011    0.13682   7.895 2.91e-15 ***
occupationArmed-Forces       2.89421    1.34843   2.146 0.031844 *  
occupationCraft-repair       1.29746    0.13042   9.949  < 2e-16 ***
occupationExec-managerial    2.08246    0.13025  15.988  < 2e-16 ***
occupationFarming-fishing    0.23595    0.17661   1.336 0.181539    
occupationHandlers-cleaners  0.58579    0.18067   3.242 0.001185 ** 
occupationMachine-op-inspct  0.83323    0.14885   5.598 2.17e-08 ***
occupationOther-service      0.16679    0.16294   1.024 0.306012    
occupationPriv-house-serv   -0.02904    0.74544  -0.039 0.968925    
occupationProf-specialty     1.70451    0.13280  12.836  < 2e-16 ***
occupationProtective-serv    1.72939    0.17889   9.667  < 2e-16 ***
occupationSales              1.56500    0.13284  11.781  < 2e-16 ***
occupationTech-support       1.48003    0.16098   9.194  < 2e-16 ***
occupationTransport-moving   1.21363    0.14650   8.284  < 2e-16 ***
education.num                0.29476    0.01128  26.127  < 2e-16 ***
---
Signif. codes:  0***0.001**0.01*0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 22680  on 16359  degrees of freedom
Residual deviance: 13115  on 16337  degrees of freedom
AIC: 13161

Number of Fisher Scoring iterations: 6

Cheers!