Doing this blog as a revision into some primitive topics in Machine Learning.
Since, almost all my current work is done using R, I will implement logistic regression using this language.
We will be using adult data set. After all is set, now we dive straight into the code. Further explanations shall not be provided as most of the code is self-explanatory in nature.
> idata <- read.csv("adult_csv.csv")
> summary(idata)
age workclass fnlwgt education education.num
Min. :0.000 Private :33906 Min. : 12285 HS-grad :15784 Min. : 1.00
1st Qu.:1.000 Self-emp-not-inc: 3862 1st Qu.: 117551 Some-college:10878 1st Qu.: 9.00
Median :2.000 Local-gov : 3136 Median : 178145 Bachelors : 8025 Median :10.00
Mean :1.771 : 2799 Mean : 189664 Masters : 2657 Mean :10.08
3rd Qu.:3.000 State-gov : 1981 3rd Qu.: 237642 Assoc-voc : 2061 3rd Qu.:12.00
Max. :4.000 Self-emp-inc : 1695 Max. :1490400 11th : 1812 Max. :16.00
(Other) : 1463 (Other) : 7625
marital.status occupation relationship race
Divorced : 6633 Prof-specialty : 6172 Husband :19716 Amer-Indian-Eskimo: 470
Married-AF-spouse : 37 Craft-repair : 6112 Not-in-family :12583 Asian-Pac-Islander: 1519
Married-civ-spouse :22379 Exec-managerial: 6086 Other-relative: 1506 Black : 4685
Married-spouse-absent: 628 Adm-clerical : 5611 Own-child : 7581 Other : 406
Never-married :16117 Sales : 5504 Unmarried : 5125 White :41762
Separated : 1530 Other-service : 4923 Wife : 2331
Widowed : 1518 (Other) :14434
sex capitalgain capitalloss hoursperweek native.country class
Female:16192 Min. :0.0000 Min. :0.0000 Min. :0.000 United-States:43832 <=50K:37155
Male :32650 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:2.000 Mexico : 951 >50K :11687
Median :0.0000 Median :0.0000 Median :2.000 : 857
Mean :0.2003 Mean :0.1149 Mean :1.951 Philippines : 295
3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:2.000 Germany : 206
Max. :4.0000 Max. :4.0000 Max. :4.000 Puerto-Rico : 184
(Other) : 2517
> head(idata)
age workclass fnlwgt education education.num marital.status occupation relationship race
1 2 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White
2 3 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White
3 2 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White
4 3 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black
5 1 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black
6 2 Private 284582 Masters 14 Married-civ-spouse Exec-managerial Wife White
sex capitalgain capitalloss hoursperweek native.country class
1 Male 1 0 2 United-States <=50K
2 Male 0 0 0 United-States <=50K
3 Male 0 0 2 United-States <=50K
4 Male 0 0 2 United-States <=50K
5 Female 0 0 2 Cuba <=50K
6 Female 0 0 2 United-States <=50K
Checking class bias now, on column class
,
> table(idata$class)
<=50K >50K
37155 11687
There is a class bias, a condition observed when the proportion of one event is much smaller than proportion of other event.
Thus, we must sample the observations in approximately equal proportions to get better models.
CREATING TRAINING AND TESTING SAMPLES
#training data
input_ones <- idata[which(idata$class == ">50K"), ] # all <=50k values
input_zeros <- idata[which(idata$class == "<=50K"), ] # all <=50k values
set.seed(100) # for repeatability of samples
input_ones_training_rows <- sample(1:nrow(input_ones), 0.7*nrow(input_ones)) # >50k values for training
input_zeros_training_rows <- sample(1:nrow(input_zeros), 0.7*nrow(input_ones)) # <=50k values training.
training_ones <- input_ones[input_ones_training_rows, ]
training_zeros <- input_zeros[input_zeros_training_rows, ]
trainingData <- rbind(training_ones, training_zeros) # row bind
# Create Test Data
test_ones <- input_ones[-input_ones_training_rows, ]
test_zeros <- input_zeros[-input_zeros_training_rows, ]
testData <- rbind(test_ones, test_zeros) # row bind
One simple optional step is to create Weight of Evidence (WOE) for categorical variables.
This is a very simple, yet useful method - Read more here.
idata_woe <- idata
str(idata_woe)
table(idata_woe$class)
library(car)
library(dplyr)
idata_woe$class <- dplyr::recode(idata_woe$class,"<=50K" = 0,">50K" = 1)
#idata_woe$class <- car::recode(idata_woe$class,"<=50K" = 0,">50K" = 1) - this is not working
idata_woe
i <- 10
q <- quantile(idata_woe$class,
probs = c(1:(i-1)/i),
na.rm = TRUE,
type=3)
cuts <- unique(q)
WOE_Class <- table(findInterval(idata_woe$class,vec = cuts,rightmost.closed = FALSE),idata_woe$class)
WOE_Class <- as.data.frame.matrix(WOE_Class)
WOE_Class$`0`<-rowSums(WOE_Class)
WOE_Class$WOE <- log((WOE_Class$`1`*sum(WOE_Class$`0`))/(WOE_Class$`0`*sum(WOE_Class$`1`)))
> WOE_Class
0 1 WOE
1 37155 0 -Inf
2 11687 11687 1.430113
Now that we have already calculated WOE, it is only logical that we calculate the Information Value (IV) also,
library(scorecard)
iv(idata_woe, y="class")
> iv(idata_woe, y="class")
variable info_value
1: relationship 1.52771659
2: marital.status 1.34867700
3: age 0.89506214
4: capitalgain 0.89314987
5: occupation 0.76787908
6: education 0.73690518
7: education.num 0.73690518
8: hoursperweek 0.40824961
9: fnlwgt 0.33755308
10: sex 0.30052143
11: workclass 0.17056215
12: capitalloss 0.11592250
13: native.country 0.07509682
14: race 0.06784214
Now, we have the information value (meaning the strength of relation between “class” and other parameters.
We now start building the logistic regression model,
logitMod <- glm(class ~ relationship + age + capitalgain + occupation + education.num, data=trainingData, family=binomial(link="logit"))
predicted <- plogis(predict(logitMod, testData)) # predicted scores
# or
predicted <- predict(logitMod, testData, type="response") # predicted scores
summary(logitMod)
Result is as follows,
> summary(logitMod)
Call:
glm(formula = class ~ relationship + age + capitalgain + occupation +
education.num, family = binomial(link = "logit"), data = trainingData)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.3583 -0.5408 0.0015 0.6281 3.3139
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.43416 0.18263 -24.280 < 2e-16 ***
relationshipNot-in-family -2.26671 0.05846 -38.776 < 2e-16 ***
relationshipOther-relative -2.52271 0.19614 -12.862 < 2e-16 ***
relationshipOwn-child -3.43974 0.14067 -24.452 < 2e-16 ***
relationshipUnmarried -2.66316 0.09625 -27.670 < 2e-16 ***
relationshipWife 0.31766 0.09130 3.479 0.000503 ***
age 0.34161 0.02041 16.734 < 2e-16 ***
capitalgain 0.95230 0.03972 23.974 < 2e-16 ***
occupationAdm-clerical 1.08011 0.13682 7.895 2.91e-15 ***
occupationArmed-Forces 2.89421 1.34843 2.146 0.031844 *
occupationCraft-repair 1.29746 0.13042 9.949 < 2e-16 ***
occupationExec-managerial 2.08246 0.13025 15.988 < 2e-16 ***
occupationFarming-fishing 0.23595 0.17661 1.336 0.181539
occupationHandlers-cleaners 0.58579 0.18067 3.242 0.001185 **
occupationMachine-op-inspct 0.83323 0.14885 5.598 2.17e-08 ***
occupationOther-service 0.16679 0.16294 1.024 0.306012
occupationPriv-house-serv -0.02904 0.74544 -0.039 0.968925
occupationProf-specialty 1.70451 0.13280 12.836 < 2e-16 ***
occupationProtective-serv 1.72939 0.17889 9.667 < 2e-16 ***
occupationSales 1.56500 0.13284 11.781 < 2e-16 ***
occupationTech-support 1.48003 0.16098 9.194 < 2e-16 ***
occupationTransport-moving 1.21363 0.14650 8.284 < 2e-16 ***
education.num 0.29476 0.01128 26.127 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 22680 on 16359 degrees of freedom
Residual deviance: 13115 on 16337 degrees of freedom
AIC: 13161
Number of Fisher Scoring iterations: 6
Cheers!