Targeting A Housing Subsidy

options(scipen=10000000)
library("gridExtra")
library(tidyverse)
library(kableExtra)
library(caret)
library(knitr) 
library(pscl)
library(plotROC)
library(pROC)
library(lubridate)
library(ggcorrplot)
library(sf)
library(dplyr)
library(stargazer)
plotTheme <- function(base_size = 12, title_size = 16) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = title_size, colour = "black"), 
    plot.subtitle = element_text(face="italic"),
    plot.caption = element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),
    panel.grid.minor = element_line("grey90", size = 0.01),
    panel.grid.major = element_line("grey90", size = 0.01),
    panel.border = element_rect(colour = "white" , fill=NA, size=0),
    strip.background = element_rect(color = "white",fill='white'),
    strip.text = element_text(size=12),
    axis.title = element_text(size=12),
    axis.text = element_text(size=10),
    plot.background = element_blank(),
    legend.background = element_blank(),
    legend.title = element_text(colour = "black", face = "italic"),
    legend.text = element_text(colour = "black", face = "italic"),
    strip.text.x = element_text(size = 12)
  )
}

mapTheme <- function(base_size = 12, title_size = 16) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = title_size,colour = "black"),
    plot.subtitle=element_text(face="italic"),
    plot.caption=element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),
    axis.title = element_blank(),
    axis.text = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.grid.minor = element_blank(),
    strip.text.x = element_text(size = 12),
    strip.background = element_rect(fill = "white"))
}

palette5 <- c('#f1a340','#fee0b6','#d8daeb','#998ec3','#542788')
palette4 <- c("#981FAC","#FF006A","#FE4C35","#FE9900")
palette2 <- c('#4575b4', '#fecc5c')
house <- read.csv('~/Github/MUSA_508_Lab/Week_7/homework/data/housingSubsidy.csv')
house <- house %>%
  dplyr::select(-X)

I. Introduction and Motivation

Repairing homes helps enhance neighborhoods and helps ensure housing safety, and Emil City has been working for years to use its limited resources to support improvements and enhancements to the local living environment through the home repair credit program, but in fact, the program has not worked as well as the government thought it would. Only 11% of eligible homeowners they reach out to take the credit each year. Therefore, the purpose of this analysis is to model the Department of Housing and Community Development (HCD) by analyzing data from previous years’ claims to receive the credit. Development (HCD) to better target policy populations and avoid ineffective financial expenditures and wasted resources.

Below shows all the variables considered in the model:

age: Age of homeowner (Numeric)
job : Occupation indicator (Category)
marital : marital status (category: divorced,married,single,unknown; note: divorced means divorced or widowed)
education: (category: basic.4y,basic.6y,basic.9y,high.school,illiterate,professional.course,university.degree,unknown)
taxLien: Is there a lien against the owner’s property? (Category)
mortgage: Is the owner carrying a mortgage (Category)
taxbill_in_phl: Is the owner’s full time residence not in Philadelphia(Category)
contact: How have we previously contacted individual? (Category)
month: Month we last contacted individual (Category)
day_of_week: Day of the week we last contacted individual (Category)
campaign: Number of contacts for this ind for this campaign (Category)
pdays: Number of days after ind. last contacted from a previous program (Category) 999 = client not previously contacted
previous: Number of contacts before this campaign for this ind.(Numeric)
poutcome: Outcome of the previous marketing campaign (Category)
unemploy_rate: Unemployment rate at time of campaign (Numeric)
cons.price.idx: Consumer Price Idex at campaign time (Numeric)
cons.conf.idx: Consumer confidence index at time of campaign (Numeric)
inflation_rate: US Inflation Rate Numeric daily indicator (Numeric)
spent_on_repairs: Amoung annually spent on home repairs (Numeric)

Output variable (desired target):

y - has the resident reached out to HCD and took the credit.? (Binary: ‘yes’,‘no’)

II. Exploratory Analysis: Feature Importance & Correlation

It is imperative to explore and build features that are strongly correlated to the uptake in the tax credit before building any model. This section respectively plots continuous variables and categorical variables against the uptake of the tax credit. I will analyze the characteristics of the variables through the distribution of categorical scalars, the distribution of continuous variables, and the correlation of continuous variables and gain insights into the relationship between independent variables and dependent variables.

2.1 Distribution of dependent variable:

For all 4,119 observations, 3,668 homeowners did not take the credit, while only 451 homeowners, accounted for only 11% of all homeowners reached out to HCD and took the credit.

ggplot(data = house) +
  geom_bar(aes(y,fill = y),width = 0.4) +
          scale_fill_manual(values = palette2,
                          name = 'Took Credit') +
  labs(x="Took Credit", y="Value", 
           title = "Taking credit vs. No-taking credit\n",
       caption = 'Figure 2.1') +
      theme(legend.position = "none") +
  plotTheme()

2.2 Correlation and Distribution of numeric variables

It seems that dependent variable has little relationship with other numeric variables, but it may also be due to the fact that dependent variable is a categorical variable. Besides, the correlation between inflation_rate， unemploy_rate, spent_on_repairs is relatively high, which warns us to selectively keep one of them in the later model construction to avoid multicolinearity.

numericVars <- 
  house %>%
  dplyr::select(is.numeric) %>% 
  na.omit()
nonnumeric <-
  house %>%
  dplyr::select(-names(numericVars)) %>% 
  na.omit()

ggcorrplot(
  round(cor(numericVars), 1), 
  p.mat = cor_pmat(numericVars),
  colors = c('#4575b4', "#ffffff",'#fecc5c'),
  lab = TRUE,
  hc.order = TRUE,
  method = 'circle', 
  insig='blank', 
  sig.level = 0.1,
  type="upper") +  
  labs(title = "Correlation across numeric variables", caption = 'Figure 2.2.1') +
  #theme(label.position='bottom')+
  plotTheme()

The distribution of the data shows that the key vectors for receiving credit are mainly whether there is a campaign contact, the level of the inflation rate in the year, whether there is previous participation in the campaign program and the unemployment rate.

house %>%
  dplyr::select(names(numericVars),y) %>%
  gather(Variable, value, -y) %>%
    ggplot(aes(y, value, fill=y)) + 
      geom_bar(position = "dodge", stat = "summary", fun = "mean") + 
      facet_wrap(~Variable, scales = "free",ncol=5) +
      scale_fill_manual(values = palette2) +
      labs(x="taking credits", y="Value", 
           title = "Feature associations with the likelihood of taking credict",
           subtitle = "(continous outcomes)",
           caption = 'Figure 2.2.2') +
      theme()+plotTheme()

Although the difference between taking credits and not taking credits counts is not very large for some variables, there is a difference in the distribution of the data between taking credits and not taking credits variables, which means that most of the variables can provide valid information in the prediction model.

house %>%
  dplyr::select(names(numericVars),y) %>%
    gather(Variable, value, -y) %>%
    ggplot() + 
    geom_density(aes(value, color=y), fill = "transparent") + 
    facet_wrap(~Variable, scales = "free",ncol=5) +
    scale_color_manual(values = palette2) +
    labs(title = "Feature distributions taking credits vs. not taking credits",
         subtitle = "(continous outcomes)",
         caption = 'Figure 2.2.3')+
   theme(legend.position = "bottom")+
  plotTheme()

2.3 Feature Engineering

From the discription, we can see that for the variable pdays there exist the value of 999(999 = client not previously contacted), which have a very large value but the numerical meaning of the value might not have the same effect.So I exam the distribution of the variable and turned it into a catergorical one.

house <- house %>%
  dplyr::mutate(contactInterval = ifelse(pdays > 7,"Long","Short"))

2.4 Distribution of nonnumeric variables

Similar to numerical variables, there is a difference in the distribution of the catergorical data between taking credits and variables without taking credits, which means that most of the variables can provide valid information in the prediction model.

nonnumeric %>%
    gather(Variable, value, -y) %>%
    count(Variable, value, y) %>%
      ggplot(., aes(value, n, fill = y)) +   
        geom_bar(position = "dodge", stat="identity") +
        facet_wrap(~Variable, scales="free", ncol = 5) +
        scale_fill_manual(values = palette2) +
        labs(x="taking credict", y="count",
             title = "Feature associations with the likelihood of taking credict",
             subtitle = "Categorical features", caption = 'Figure 2.4') +
        theme(axis.text.x = element_text(angle = 45, hjust = 1))+
        plotTheme()

2.5 Feature engineering for nonnumeric variables

Because there are more non-catergory variables, I calculated the mean value of y for each variable in order to reduce the computational overload when the model is run and to preserve the counts of variables with different catergories.

nonnumericVar <- names(nonnumeric)

house <- 
  house %>% 
  group_by(marital) %>% 
  summarize(toty = sum(y_numeric), 
            n = n(), 
            maritalAvg = 100*(toty/n)) %>%
  dplyr::select(-n, -toty) %>%
  right_join(house, .) %>%
  ungroup()

house <- 
  house %>% 
  group_by(taxLien) %>% 
  summarize(toty = sum(y_numeric), 
            n = n(), 
            taxLienAvg = 100*(toty/n)) %>%
  dplyr::select(-n, -toty) %>%
  right_join(house, .) %>%
  ungroup()

house <- 
  house %>% 
  group_by(mortgage) %>% 
  summarize(toty = sum(y_numeric), 
            n = n(), 
            mortgageAvg = 100*(toty/n)) %>%
  dplyr::select(-n, -toty) %>%
  right_join(house, .) %>%
  ungroup()


house <- 
  house %>% 
  group_by(job) %>% 
  summarize(toty = sum(y_numeric), 
            n = n(), 
            jobyAvg = 100*(toty/n)) %>%
  dplyr::select(-n, -toty) %>%
  right_join(house, .) %>%
  ungroup()

house <- 
  house %>% 
  group_by(education) %>% 
  summarize(toty = sum(y_numeric), 
            n = n(), 
            eduyAvg = 100*(toty/n)) %>%
  dplyr::select(-n, -toty) %>%
  right_join(house, .) %>%
  ungroup()

house <- 
  house %>% 
  group_by(taxbill_in_phl) %>% 
  summarize(toty = sum(y_numeric), 
            n = n(), 
            taxAvg = 100*(toty/n)) %>%
  dplyr::select(-n, -toty) %>%
  right_join(house, .) %>%
  ungroup()


house <- 
  house %>% 
  group_by(contact) %>% 
  summarize(toty = sum(y_numeric), 
            n = n(), 
            conAvg = 100*(toty/n)) %>%
  dplyr::select(-n, -toty) %>%
  right_join(house, .) %>%
  ungroup()


house <- 
  house %>% 
  group_by(month) %>% 
  summarize(toty = sum(y_numeric), 
            n = n(), 
            monthAvg = 100*(toty/n)) %>%
  dplyr::select(-n, -toty) %>%
  right_join(house, .) %>%
  ungroup()


house <- 
  house %>% 
  group_by(day_of_week) %>% 
  summarize(toty = sum(y_numeric), 
            n = n(), 
            day_of_weekAvg = 100*(toty/n)) %>%
  dplyr::select(-n, -toty) %>%
  right_join(house, .) %>%
  ungroup()


house <- 
  house %>% 
  group_by(poutcome) %>% 
  summarize(toty = sum(y_numeric), 
            n = n(), 
            poutcomAvg = 100*(toty/n)) %>%
  dplyr::select(-n, -toty) %>%
  right_join(house, .) %>%
  ungroup()

III. Model Building

Since the dependent variable in this analysis is a binomial variable, this section builds a series of Logistic Regression models. The optimization of models include these three:

model with all provided features.
stepwise model
model with features selected according to the all feature model and correlation

3.1 Model Building

At first, put all the features into the model and test them. When building model directly with all features in the dataset, only few features are significantly related to whether an individual took the credit or not, indicting the model needs to be rebuilt by including fewer features and engineering couples of features.

Note: Model summary will be shown together with the second model below.

## Split the data into a 65/35 training/test set
set.seed(3456)
trainIndex <- createDataPartition(house$y, p = .65,
                                  list = FALSE,
                                  times = 1)
houseTrain <- house[ trainIndex,]
houseTest  <- house[-trainIndex,]

## All feature model
all.model <- glm(y_numeric~ .,
                  data=houseTrain %>% 
                    dplyr::select(-names(nonnumeric)),family="binomial" (link="logit"))

3.2 Predictions and Plots

We create a dataframe of predictions for the 1,440 observations in our test set, called testProbs. These predictions are the estimated probabilities of taking the credit for these out-of-sample subjects. We can compare them to the observed outcome. Figure 3 indicated that the model is able to predict the probability of not taking credits to be higher, which is a good thing for the model. This is because it means that the model can help reduce waste and loss from people who allocate resources to households that do not need them.

testProbs.all <- data.frame(Outcome = as.factor(houseTest$y_numeric), Probs = predict(all.model, houseTest,type= "response"))
ggplot(testProbs.all, aes(x = Probs, fill = as.factor(Outcome))) + 
  geom_density() +
  facet_grid(Outcome ~ .) +
  scale_fill_manual(values = palette2) +
  labs(x = "Taking credits", y = "Density of probabilities",
       title = "Distribution of predicted probabilities by observed outcome", caption = 'Figure 3.2') +
  theme(strip.text.x = element_text(size = 18),
        legend.position = "none")+ 
  plotTheme()

3.3 Good of Fit: Confusion Matrix

A “confusion matrix” for the threshold of 50% shows us the rate at which we got True Positives (aka Sensitivity), False Positives, True Negatives (aka Specificity) and False Negatives for that threshold.

In this case,

True Positives: Predicted correctly homeowner would take the credit.
False Positives: Predicted incorrectly homeowner would take the credit.
True Negatives: Predicted correctly homeowner would not take the credit.
False Negatives: We predicted that a homeowner would not take the credit but they did.

This analysis more cares about sensitivity since the goal is to maximize the efficiency of the outreach campaign. It is obvious that this initial model has a low sensitivity (0.248) and high specificity (0.986), which is ok because for this case we wish the model to predict true negative better because we want to reduce loss. We will re-build the model and optimize it later!

testProbs <- 
  testProbs.all %>%
  mutate(predOutcome  = as.factor(ifelse(testProbs.all$Probs > 0.5 , 1, 0)))
caret::confusionMatrix(testProbs$predOutcome, testProbs$Outcome, 
                       positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1265  118
##          1   18   39
##                                               
##                Accuracy : 0.9056              
##                  95% CI : (0.8893, 0.9202)    
##     No Information Rate : 0.891               
##     P-Value [Acc > NIR] : 0.03945             
##                                               
##                   Kappa : 0.3253              
##                                               
##  Mcnemar's Test P-Value : < 0.0000000000000002
##                                               
##             Sensitivity : 0.24841             
##             Specificity : 0.98597             
##          Pos Pred Value : 0.68421             
##          Neg Pred Value : 0.91468             
##              Prevalence : 0.10903             
##          Detection Rate : 0.02708             
##    Detection Prevalence : 0.03958             
##       Balanced Accuracy : 0.61719             
##                                               
##        'Positive' Class : 1                   
##

IV. Model Optimization

4.1 Model Building

Optimized models include using stepwise model and manually selected features according to p value of features according to the all feature model. Also due to the unbalanced distribution of taking credit and not taking credit data in the dataset, I optimized the model by downsample the dataset and build stepwise model base on that.

## Stepwise model

library(MASS)
step.model <- all.model %>% stepAIC(trace = FALSE)

summary(step.model)

## 
## Call:
## glm(formula = y_numeric ~ campaign + pdays + unemploy_rate + 
##     cons.price.idx + cons.conf.idx + spent_on_repairs + eduyAvg + 
##     conAvg + monthAvg, family = binomial(link = "logit"), data = houseTrain %>% 
##     dplyr::select(-names(nonnumeric)))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8202  -0.4227  -0.3473  -0.2859   2.8913  
## 
## Coefficients:
##                      Estimate   Std. Error z value      Pr(>|z|)    
## (Intercept)      -193.2235629   43.9131971  -4.400 0.00001081885 ***
## campaign           -0.0936896    0.0419568  -2.233      0.025549 *  
## pdays              -0.0014693    0.0002488  -5.905 0.00000000353 ***
## unemploy_rate      -1.0288251    0.2012092  -5.113 0.00000031673 ***
## cons.price.idx      1.6133204    0.3208009   5.029 0.00000049294 ***
## cons.conf.idx       0.0551983    0.0149249   3.698      0.000217 ***
## spent_on_repairs    0.0080936    0.0030170   2.683      0.007304 ** 
## eduyAvg             0.0550033    0.0309555   1.777      0.075593 .  
## conAvg              0.0976204    0.0240810   4.054 0.00005038585 ***
## monthAvg            0.0270702    0.0065539   4.130 0.00003620925 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1853.7  on 2678  degrees of freedom
## Residual deviance: 1531.7  on 2669  degrees of freedom
## AIC: 1551.7
## 
## Number of Fisher Scoring iterations: 6

## Engineering model

engineer.model <- glm(y_numeric~ .,
                  data=houseTrain %>% 
                    dplyr::select(pdays,cons.price.idx ,monthAvg,y_numeric), family="binomial" (link="logit"))
summary(engineer.model)

## 
## Call:
## glm(formula = y_numeric ~ ., family = binomial(link = "logit"), 
##     data = houseTrain %>% dplyr::select(pdays, cons.price.idx, 
##         monthAvg, y_numeric))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1831  -0.4334  -0.3804  -0.3598   2.4063  
## 
## Coefficients:
##                  Estimate Std. Error z value             Pr(>|z|)    
## (Intercept)    32.4582495 10.2100437   3.179              0.00148 ** 
## pdays          -0.0024118  0.0002249 -10.725 < 0.0000000000000002 ***
## cons.price.idx -0.3515682  0.1090847  -3.223              0.00127 ** 
## monthAvg        0.0445980  0.0057471   7.760  0.00000000000000849 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1853.7  on 2678  degrees of freedom
## Residual deviance: 1629.3  on 2675  degrees of freedom
## AIC: 1637.3
## 
## Number of Fisher Scoring iterations: 5

## Down model
house$y = as.factor(house$y)
set.seed(9)
down_train <- downSample(x = subset(house, select = -c(y)), y = house$y, yname = "y")
set.seed(3456)
trainIndex <- createDataPartition(down_train$y, p = .65,
                                  list = FALSE,
                                  times = 1)
reTrain <- down_train[ trainIndex,]
reTest  <- down_train[-trainIndex,]

down.model  <- glm(y_numeric~ .,
                  data=reTrain %>% 
                   dplyr::select(-names(nonnumeric)),family="binomial" (link="logit"))
down.model <- down.model %>% stepAIC(trace = FALSE)
summary(down.model)

## 
## Call:
## glm(formula = y_numeric ~ age + pdays + unemploy_rate + cons.price.idx + 
##     maritalAvg + mortgageAvg + jobyAvg + conAvg + monthAvg + 
##     poutcomAvg, family = binomial(link = "logit"), data = reTrain %>% 
##     dplyr::select(-names(nonnumeric)))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6390  -0.8641  -0.1564   0.8210   2.1460  
## 
## Coefficients:
##                   Estimate  Std. Error z value       Pr(>|z|)    
## (Intercept)    -134.378871   28.442866  -4.725 0.000002306602 ***
## age               0.026404    0.010712   2.465       0.013703 *  
## pdays            -0.005231    0.002857  -1.831       0.067076 .  
## unemploy_rate    -0.682226    0.106935  -6.380 0.000000000177 ***
## cons.price.idx    1.379114    0.298965   4.613 0.000003969641 ***
## maritalAvg        0.150482    0.072211   2.084       0.037168 *  
## mortgageAvg       0.532141    0.295594   1.800       0.071822 .  
## jobyAvg           0.060636    0.024720   2.453       0.014170 *  
## conAvg            0.116702    0.030112   3.876       0.000106 ***
## monthAvg          0.031341    0.011636   2.694       0.007070 ** 
## poutcomAvg       -0.075778    0.051471  -1.472       0.140953    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 815.14  on 587  degrees of freedom
## Residual deviance: 622.46  on 577  degrees of freedom
## AIC: 644.46
## 
## Number of Fisher Scoring iterations: 6

From the model results, the results of the stepwise model are better than those of the manually selected feature model, and the residual of the model trained with the full dataset is relatively better than that of the model trained with the down sample dataset.

4.2 Predictions and Plots

From the predicted density plot, we can see that the distribution of predicted probabilities for the two models trained by the full dataset is basically the same as the distribution of all feature models. However, the engineer model is not very evenly distributed in the probability of 0, indicating that the selected feature is missing in some predicted variable, and the stepwise model performs better. This is also proved by the accuracy of these two models.

testProbs1 <- data.frame(Outcome = as.factor(houseTest$y_numeric), Probs = predict(step.model, houseTest,type= "response"))
testProbs2 <- data.frame(Outcome = as.factor(houseTest$y_numeric), Probs = predict(engineer.model, houseTest,type= "response"))
testProbs3 <- data.frame(Outcome = as.factor(houseTest$y_numeric), Probs = predict(down.model, houseTest,type= "response"))

ggplot(testProbs1, aes(x = Probs, fill = as.factor(Outcome))) + 
  geom_density() +
  facet_grid(Outcome ~ .) +
  scale_fill_manual(values = palette2) +
  labs(x = "taking credits", y = "Density of probabilities",
       title = "Distribution of predicted probabilities by observed outcome - step.model",
       caption = 'Figure 4.2.1') +
  theme(strip.text.x = element_text(size = 18),
        legend.position = "none")+ 
  plotTheme()

ggplot(testProbs2, aes(x = Probs, fill = as.factor(Outcome))) + 
  geom_density() +
  facet_grid(Outcome ~ .) +
  scale_fill_manual(values = palette2) +
  labs(x = "taking credits", y = "Density of probabilities",
       title = "Distribution of predicted probabilities by observed outcome - engineer.model"
       , caption = 'Figure 4.2.2') +
  theme(strip.text.x = element_text(size = 18),
        legend.position = "none")+ 
  plotTheme()

For the downsample model, the results are not too focused on the predicted results of not taking credit, but also have a better performance on taking credict, which is in line with our requirements for the model.

ggplot(testProbs3, aes(x = Probs, fill = as.factor(Outcome))) + 
  geom_density() +
  facet_grid(Outcome ~ .) +
  scale_fill_manual(values = palette2) +
  labs(x = "taking credits", y = "Density of probabilities",
       title = "Distribution of predicted probabilities by observed outcome - down.model"
       , caption = 'Figure 4.2.3') +
  theme(strip.text.x = element_text(size = 18),
        legend.position = "none")+ 
  plotTheme()

4.3 Good of Fit: Confusion Matrix

Shown below, compared to the all feature model, the sensitivity rate declined for both model trained by the full dataset but the specificity of the stepwise model increase, meaning that reducing predictors might lead to informaton loss for the models.

Confusion Matrix of engineered model

testProbs <- 
  testProbs2 %>%
  mutate(predOutcome  = as.factor(ifelse(testProbs2$Probs > 0.5 , 1, 0)))
caret::confusionMatrix(testProbs$predOutcome, testProbs$Outcome, 
                       positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1264  126
##          1   19   31
##                                              
##                Accuracy : 0.8993             
##                  95% CI : (0.8826, 0.9144)   
##     No Information Rate : 0.891              
##     P-Value [Acc > NIR] : 0.1656             
##                                              
##                   Kappa : 0.2606             
##                                              
##  Mcnemar's Test P-Value : <0.0000000000000002
##                                              
##             Sensitivity : 0.19745            
##             Specificity : 0.98519            
##          Pos Pred Value : 0.62000            
##          Neg Pred Value : 0.90935            
##              Prevalence : 0.10903            
##          Detection Rate : 0.02153            
##    Detection Prevalence : 0.03472            
##       Balanced Accuracy : 0.59132            
##                                              
##        'Positive' Class : 1                  
##

Confusion Matrix of stepwise model

testProbs <- 
  testProbs1 %>%
  mutate(predOutcome  = as.factor(ifelse(testProbs1$Probs > 0.5 , 1, 0)))
caret::confusionMatrix(testProbs$predOutcome, testProbs$Outcome, 
                       positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1268  121
##          1   15   36
##                                               
##                Accuracy : 0.9056              
##                  95% CI : (0.8893, 0.9202)    
##     No Information Rate : 0.891               
##     P-Value [Acc > NIR] : 0.03945             
##                                               
##                   Kappa : 0.3092              
##                                               
##  Mcnemar's Test P-Value : < 0.0000000000000002
##                                               
##             Sensitivity : 0.22930             
##             Specificity : 0.98831             
##          Pos Pred Value : 0.70588             
##          Neg Pred Value : 0.91289             
##              Prevalence : 0.10903             
##          Detection Rate : 0.02500             
##    Detection Prevalence : 0.03542             
##       Balanced Accuracy : 0.60880             
##                                               
##        'Positive' Class : 1                   
##

Confusion Matrix of downsample stepwise model

Compared to the model trained by the full dataset, the model trained by down sample dataset has a significant improvement in sensitivity although the specificity has decreased.

testProbs3 <- 
  testProbs3 %>%
  mutate(predOutcome  = as.factor(ifelse(testProbs3$Probs > 0.5 , 1, 0)))
caret::confusionMatrix(testProbs3$predOutcome, testProbs3$Outcome, 
                       positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1050   43
##          1  233  114
##                                              
##                Accuracy : 0.8083             
##                  95% CI : (0.787, 0.8284)    
##     No Information Rate : 0.891              
##     P-Value [Acc > NIR] : 1                  
##                                              
##                   Kappa : 0.3556             
##                                              
##  Mcnemar's Test P-Value : <0.0000000000000002
##                                              
##             Sensitivity : 0.72611            
##             Specificity : 0.81839            
##          Pos Pred Value : 0.32853            
##          Neg Pred Value : 0.96066            
##              Prevalence : 0.10903            
##          Detection Rate : 0.07917            
##    Detection Prevalence : 0.24097            
##       Balanced Accuracy : 0.77225            
##                                              
##        'Positive' Class : 1                  
##

V. Model Comparison

5.1 Feature Comparison

stargazer(all.model, step.model, engineer.model,down.model, type = "html", 
          title = "Table 5.1 Summary Statistics of Models on Training Set",
          header = FALSE,
          single.row = TRUE,
          column.labels=c("all.model","step.model","engineer.model",'down.sample'))

**Table 5.1 Summary Statistics of Models on Training Set**

	Dependent variable:

	y_numeric
	all.model	step.model	engineer.model	down.sample
	(1)	(2)	(3)	(4)

age	0.006 (0.007)			0.026^** (0.011)
campaign	-0.094^** (0.042)	-0.094^** (0.042)
pdays	-0.001 (0.001)	-0.001^*** (0.0002)	-0.002^*** (0.0002)	-0.005^* (0.003)
previous	-0.154 (0.128)
unemploy_rate	-0.967^*** (0.224)	-1.029^*** (0.201)		-0.682^*** (0.107)
cons.price.idx	1.682^*** (0.350)	1.613^*** (0.321)	-0.352^*** (0.109)	1.379^*** (0.299)
cons.conf.idx	0.059^*** (0.021)	0.055^*** (0.015)
inflation_rate	-0.138 (0.281)
spent_on_repairs	0.010^** (0.005)	0.008^*** (0.003)
contactIntervalShort	0.341 (0.601)
maritalAvg	0.003 (0.049)			0.150^** (0.072)
taxLienAvg	0.002 (0.034)
mortgageAvg	-0.012 (0.214)			0.532^* (0.296)
jobyAvg	0.009 (0.016)			0.061^** (0.025)
eduyAvg	0.057^* (0.032)	0.055^* (0.031)
taxAvg	0.115 (0.166)
conAvg	0.101^*** (0.025)	0.098^*** (0.024)		0.117^*** (0.030)
monthAvg	0.028^*** (0.007)	0.027^*** (0.007)	0.045^*** (0.006)	0.031^*** (0.012)
day_of_weekAvg	-0.118 (0.202)
poutcomAvg	0.006 (0.013)			-0.076 (0.051)
Constant	-207.447^*** (53.867)	-193.224^*** (43.913)	32.458^*** (10.210)	-134.379^*** (28.443)

Observations	2,679	2,679	2,679	588
Log Likelihood	-763.302	-765.856	-814.658	-311.232
Akaike Inf. Crit.	1,568.605	1,551.712	1,637.317	644.463

Note:	p<0.1; p<0.05; p<0.01

5.2 Cross Validation on Models

This section conducted cross-validation on models. The outputs include the mean AUC, sensitivity and specificity across the 100 k-folds.

The first two models generalize well with regard to specificity, while neither of models generalize well with sensitivity. However, for the third model, it did well in specifity and mean AUC value.

The result of cross validation is lower than the accuracy of test. That may be due to the inbalance distribution of data in the training and testing set. In this case, we are developing a logistic regression model. So we might get the conclusion that the third one perform better.

stepwise model

ctrl <- trainControl(method = "cv", number = 100, classProbs=TRUE, summaryFunction=twoClassSummary)


cvFit.reg1 <- train(y ~ campaign + pdays + unemploy_rate + 
    cons.price.idx + cons.conf.idx + spent_on_repairs + eduyAvg + 
    conAvg + monthAvg, data = houseTrain %>% dplyr::select(-y_numeric) %>%
                  dplyr::mutate(y = ifelse(y=="no","c1.yes","c1.no")), 
                method="glm", family="binomial",
                metric="ROC", trControl = ctrl)

cvFit.reg2 <- train(y~ .,
                  data=houseTrain %>% 
                    dplyr::select(pdays,cons.price.idx ,monthAvg,y) %>%
                  dplyr::mutate(y = ifelse(y=="no","c1.yes","c1.no")),
                method="glm", family="binomial",
                metric="ROC", trControl = ctrl)


cvFit.reg3 <- train( y ~ age + pdays + unemploy_rate + cons.price.idx + 
    maritalAvg + mortgageAvg + jobyAvg + conAvg + monthAvg + 
    poutcomAvg,
                  data=reTrain %>% 
                    dplyr::select(-y_numeric) %>%
                  dplyr::mutate(y = ifelse(y=="no","c1.yes","c1.no")),
                method="glm", family="binomial",
                metric="ROC", trControl = ctrl)

cvFit.reg1

## Generalized Linear Model 
## 
## 2679 samples
##    9 predictor
##    2 classes: 'c1.no', 'c1.yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (100 fold) 
## Summary of sample sizes: 2652, 2652, 2652, 2653, 2652, 2652, ... 
## Resampling results:
## 
##   ROC        Sens       Spec     
##   0.7562606  0.1966667  0.9866123

dplyr::select(cvFit.reg1$resample, -Resample) %>%
  gather(metric, value) %>%
  left_join(gather(cvFit.reg1$results[2:4], metric, mean)) %>%
  ggplot(aes(value)) + 
    geom_histogram(bins=35, fill = "#fecc5c") +
    facet_wrap(~metric) +
    geom_vline(aes(xintercept = mean), colour = "#4575b4", linetype = 2, size = 1) +
    scale_x_continuous(limits = c(0, 1)) +
    labs(x="Goodness of Fit", y="Count", title="CV Goodness of Fit Metrics",
         subtitle = "Across-fold mean reprented as dotted lines")+plotTheme()

engineered model

cvFit.reg2

## Generalized Linear Model 
## 
## 2679 samples
##    3 predictor
##    2 classes: 'c1.no', 'c1.yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (100 fold) 
## Summary of sample sizes: 2652, 2652, 2652, 2652, 2653, 2652, ... 
## Resampling results:
## 
##   ROC        Sens       Spec     
##   0.6974155  0.1366667  0.9849275

dplyr::select(cvFit.reg2$resample, -Resample) %>%
  gather(metric, value) %>%
  left_join(gather(cvFit.reg2$results[2:4], metric, mean)) %>%
  ggplot(aes(value)) + 
    geom_histogram(bins=35, fill = "#fecc5c") +
    facet_wrap(~metric) +
    geom_vline(aes(xintercept = mean), colour = "#4575b4", linetype = 2, size = 1) +
    scale_x_continuous(limits = c(0, 1)) +
    labs(x="Goodness of Fit", y="Count", title="CV Goodness of Fit Metrics",
         subtitle = "Across-fold mean reprented as dotted lines")+plotTheme()

downsample model

cvFit.reg3

## Generalized Linear Model 
## 
## 588 samples
##  10 predictor
##   2 classes: 'c1.no', 'c1.yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (100 fold) 
## Summary of sample sizes: 582, 582, 582, 582, 582, 582, ... 
## Resampling results:
## 
##   ROC        Sens       Spec     
##   0.7944444  0.6366667  0.8316667

dplyr::select(cvFit.reg3$resample, -Resample) %>%
  gather(metric, value) %>%
  left_join(gather(cvFit.reg3$results[2:4], metric, mean)) %>%
  ggplot(aes(value)) + 
    geom_histogram(bins=35, fill = "#fecc5c") +
    facet_wrap(~metric) +
    geom_vline(aes(xintercept = mean), colour = "#4575b4", linetype = 2, size = 1) +
    scale_x_continuous(limits = c(0, 1)) +
    labs(x="Goodness of Fit", y="Count", title="CV Goodness of Fit Metrics",
         subtitle = "Across-fold mean reprented as dotted lines")+plotTheme()

From the histogram of ROC, Sens, and Spec for different folders of cross validation, we can see that the ROC of the down sample model is less accurate in some folders but there is a significant improvement in sensitivity, which indicates that we can use model 3 to make predictions.

5.3 ROC Curve Comparison

Theoretically, the ROC curve indicates a TP vs. FP rate at one certain decision threshold, while AUC provides an aggregate measure of performance across all possible classification thresholds. Shown in the graph below, for every additional improvement in true positive rate, the model will make a greater proportion of false positive errors. Moving from a 75% to a 95% true positive rate dramatically increases the false positive rate. The trade-off has been triggered by such diminishing. The AUC of 0.83 indicates that for the sake of ranking a random positive example more highly than a random negative example, this model need to set the threshold no less than 0.83. Through ROC curve plotting, we can also get the conclusion that the down sample model performs better.

stepwise Model

auc(testProbs1$Outcome, testProbs1$Probs)

## Area under the curve: 0.8296

engineer Model

auc(testProbs2$Outcome, testProbs2$Probs)

## Area under the curve: 0.7591

downsample Model

auc(testProbs3$Outcome, testProbs3$Probs)

## Area under the curve: 0.839

grid.arrange(ggplot(testProbs1, aes(d = as.numeric(Outcome), m = Probs)) +
  geom_roc(n.cuts = 30, labels = FALSE, colour = "#FE9900") +
  style_roc(theme = theme_grey) +
  geom_abline(slope = 1, intercept = 0, size = 1.5, color = 'grey') +
  labs(title = "ROC Curve - stepwiseModel", caption = 'Figure 5.3.1') +
  plotTheme(),
  ggplot(testProbs2, aes(d = as.numeric(Outcome), m = Probs)) +
  geom_roc(n.cuts = 30, labels = FALSE, colour = "#FE9900") +
  style_roc(theme = theme_grey) +
  geom_abline(slope = 1, intercept = 0, size = 1.5, color = 'grey') +
  labs(title = "ROC Curve - engineerModel", caption = 'Figure 5.3.2') +
  plotTheme(), 
  ggplot(testProbs3, aes(d = as.numeric(Outcome), m = Probs)) +
  geom_roc(n.cuts = 30, labels = FALSE, colour = "#FE9900") +
  style_roc(theme = theme_grey) +
  geom_abline(slope = 1, intercept = 0, size = 1.5, color = 'grey') +
  labs(title = "ROC Curve - downsampleModel", caption = 'Figure 5.3.3') +
  plotTheme(),ncol= 3)

VI. Cost-Benefit Analysis

As indicated in the very beginning, the cost-benefit analysis conducted for the public department is different from private company; rather than only care about revenue, public department cares more about the efficiency, equity and adequacy. As for this case, the cost for the HCD is generated from the campaign and allocation. On the other hand, both the homeowner’s property increase and surrounding homeowners’ property increases (this analysis supposes the tax rate is 10%) are included as sources of revenue of the HCD which promotes property values increase by implementing the tax credit program.

Because we down-sampled the dataset which may lead to a reduction in the overall prediction accuracy and bias in the cost-benefit analysis. In order to avoid errors caused by this operation, the results of both the full dataset model and the downsample dataset model are analyzed in the following analysis

6.1 Cost/Benefit Equation

True Positive - Predicted correctly homeowner would take the credit; allocated the marketing resources, and 25% took the credit.
True Negative - Predicted correctly homeowner would not take the credit, no marketing resources were allocated, and no credit was allocated.
False Positive - Predicted incorrectly homeowner would take the credit; allocated marketing resources; no credit allocated.
False Negative - We predicted that a homeowner would not take the credit but they did. These are likely homeowners who signed up for reasons unrelated to the marketing campaign. Thus, we ‘0 out’ this category, assuming the cost/benefit of this is $0.

6.2 Create the “Cost/Benefit Table”

full sample model

Initially when setting the threshold up to 0.5 by default, only 157 (TP 36 and FN 121) individuals are expected to take the tax credit, of which 126 individuals might take the credit due to other reasons rather than attracted by the campaign.

cost_benefit_table <-
   testProbs %>%
      count(predOutcome, Outcome) %>%
      summarize(True_Negative = sum(n[predOutcome==0 & Outcome==0]),
                True_Positive = sum(n[predOutcome==1 & Outcome==1]),
                False_Negative = sum(n[predOutcome==0 & Outcome==1]),
                False_Positive = sum(n[predOutcome==1 & Outcome==0])) %>%
       gather(Variable, Count) %>%
       mutate(Revenue =
               ifelse(Variable == "True_Negative",0 ,
               ifelse(Variable == "True_Positive",((-5000 - 2850 + (10000 + 56000) * 0.1) * Count * 0.25 +
                                                    (- 2850) * Count * 0.75),
               ifelse(Variable == "False_Negative",0,
               ifelse(Variable == "False_Positive", (-2850) * Count, 0))))) %>%
    bind_cols(data.frame(Description = c(
              "We correctly predicted homeowner would not take the credit",
              "We correctly predicted homeowner would take the credit",
              "We predicted that a homeowner would not take the credit but they did",
              "We predicted incorrectly homeowner would take the credit")))

kable(cost_benefit_table,
       caption = "Cost/Benefit Table for Full Sample Model") %>% kable_styling() %>%
  footnote(general = "Table 6.2.1 Cost/Benefit Table for Full Sample Model\n",
           general_title= '\n')

Cost/Benefit Table for Full Sample Model
Variable	Count	Revenue	Description
True_Negative	1268	0	We correctly predicted homeowner would not take the credit
True_Positive	36	-88200	We correctly predicted homeowner would take the credit
False_Negative	121	0	We predicted that a homeowner would not take the credit but they did
False_Positive	15	-42750	We predicted incorrectly homeowner would take the credit

Table 6.2.1 Cost/Benefit Table for Full Sample Model

downsample model

Initially when setting the threshold up to 0.5 by default, only 157 (TP 114 and FN 43) individuals are expected to take the tax credit, of which 43 individuals might take the credit due to other reasons rather than attracted by the campaign. Under such circumstances, the HCD gains no revenue but turns out with a large amount of deficits.

cost_benefit_table <-
   testProbs3 %>%
      count(predOutcome, Outcome) %>%
      summarize(True_Negative = sum(n[predOutcome==0 & Outcome==0]),
                True_Positive = sum(n[predOutcome==1 & Outcome==1]),
                False_Negative = sum(n[predOutcome==0 & Outcome==1]),
                False_Positive = sum(n[predOutcome==1 & Outcome==0])) %>%
       gather(Variable, Count) %>%
       mutate(Revenue =
               ifelse(Variable == "True_Negative",0 ,
               ifelse(Variable == "True_Positive",((-5000 - 2850 + (10000 + 56000) * 0.1) * Count * 0.25 +
                                                    (- 2850) * Count * 0.75),
               ifelse(Variable == "False_Negative",0,
               ifelse(Variable == "False_Positive", (-2850) * Count, 0))))) %>%
    bind_cols(data.frame(Description = c(
              "We correctly predicted homeowner would not take the credit",
              "We correctly predicted homeowner would take the credit",
              "We predicted that a homeowner would not take the credit but they did",
              "We predicted incorrectly homeowner would take the credit")))

kable(cost_benefit_table,
       caption = "Cost/Benefit Table for Downsample Model") %>% kable_styling() %>%
  footnote(general = "Table 6.2.2 Cost/Benefit Table for Downsample Model\n",
           general_title= '\n')

Cost/Benefit Table for Downsample Model
Variable	Count	Revenue	Description
True_Negative	1050	0	We correctly predicted homeowner would not take the credit
True_Positive	114	-279300	We correctly predicted homeowner would take the credit
False_Negative	43	0	We predicted that a homeowner would not take the credit but they did
False_Positive	233	-664050	We predicted incorrectly homeowner would take the credit

Table 6.2.2 Cost/Benefit Table for Downsample Model

6.3 Optimize Thresholds

This section selects optimal thresholds by plot the confusion metric outcomes for each threshold.

The overall revenue plot shows that as the threshold increases, the FP case increases and the overall revenue will also increase significantly. However, as indicated before, maximizing the revenues is not the ultimate goal of the HCD; instead, the HCD is looking forward to allocating more credits for the sake of better development. So in the pursuit of tp case increase as well as the overall increase in revenue, there is a threshold design balance.

fullsample model

Graphs below indicate that before the threshold is 0.1, the number of TN cases rises sharply while the number of FP cases falls sharply. The predicted number of TP cases has been showing a continuous decreasing trend.

iterateThresholds <- function(data) {
  x = .01
  all_prediction <- data.frame()
  while (x <= 1) {
  
  this_prediction <-
      testProbs %>%
      mutate(predOutcome = ifelse(Probs > x, 1, 0)) %>%
      count(predOutcome, Outcome) %>%
      summarize(True_Negative = sum(n[predOutcome==0 & Outcome==0]),
                True_Positive = sum(n[predOutcome==1 & Outcome==1]),
                False_Negative = sum(n[predOutcome==0 & Outcome==1]),
                False_Positive = sum(n[predOutcome==1 & Outcome==0])) %>%
     gather(Variable, Count) %>%
     mutate(Revenue =
               ifelse(Variable == "True_Negative", Count * 0,
               ifelse(Variable == "True_Positive",((-5000 - 2850 + (10000 + 56000) * 0.10) * Count * 0.25 +
                                                    (- 2850) * Count * 0.75),
               ifelse(Variable == "False_Negative",0,
               ifelse(Variable == "False_Positive", (-2850) * Count, 0)))),
            Threshold = x)
  
  all_prediction <- rbind(all_prediction, this_prediction)
  x <- x + .01
  }
return(all_prediction)
}

whichThreshold <- iterateThresholds()

whichThreshold_revenue <- 
whichThreshold %>% 
    group_by(Threshold) %>% 
    summarize(Revenue = sum(Revenue))

grid.arrange(
 whichThreshold%>%
  filter(Variable != 'False_Negative') %>%
  ggplot(.,aes(Threshold, Count, colour = Variable)) +
  geom_point() +
  scale_colour_manual(values = palette5[c(5, 1:3)]) +    
  labs(title = "Count by confusion matrix type and threshold\n",
       y = "Count",
       caption = 'Figure 6.3.1') +
  plotTheme() +
  guides(colour=guide_legend(title = "Confusion Matrix")) ,

whichThreshold %>%
  filter(Variable != 'False_Negative') %>%
  ggplot(.,aes(Threshold, Revenue, colour = Variable)) +
  geom_point() +
  scale_colour_manual(values = palette5[c(5, 1:3)]) +    
  labs(title = "Revenue by confusion matrix type and threshold\n",
       y = "Revenue",
       caption = 'Figure 6.3.2') +
  plotTheme() +
  guides(colour=guide_legend(title = "Confusion Matrix")) , ncol =2)

downsample model

Graphs below indicate that before the threshold is 0.75, the number of TN cases rises while the number of FP cases falls . The predicted number of TP cases has been showing a continuous decreasing trend.

iterateThresholds <- function(data) {
  x = .01
  all_prediction <- data.frame()
  while (x <= 1) {
  
  this_prediction <-
      testProbs3 %>%
      mutate(predOutcome = ifelse(Probs > x, 1, 0)) %>%
      count(predOutcome, Outcome) %>%
      summarize(True_Negative = sum(n[predOutcome==0 & Outcome==0]),
                True_Positive = sum(n[predOutcome==1 & Outcome==1]),
                False_Negative = sum(n[predOutcome==0 & Outcome==1]),
                False_Positive = sum(n[predOutcome==1 & Outcome==0])) %>%
     gather(Variable, Count) %>%
     mutate(Revenue =
               ifelse(Variable == "True_Negative", Count * 0,
               ifelse(Variable == "True_Positive",((-5000 - 2850 + (10000 + 56000) * 0.10) * Count * 0.25 +
                                                    (- 2850) * Count * 0.75),
               ifelse(Variable == "False_Negative",0,
               ifelse(Variable == "False_Positive", (-2850) * Count, 0)))),
            Threshold = x)
  
  all_prediction <- rbind(all_prediction, this_prediction)
  x <- x + .01
  }
return(all_prediction)
}

whichThreshold2 <- iterateThresholds()

whichThreshold_revenue2 <- 
whichThreshold %>% 
    group_by(Threshold) %>% 
    summarize(Revenue = sum(Revenue))

grid.arrange(
whichThreshold2%>%
  filter(Variable != 'False_Negative') %>%
  ggplot(.,aes(Threshold, Count, colour = Variable)) +
  geom_point() +
  scale_colour_manual(values = palette5[c(5, 1:3)]) +    
  labs(title = "Count by confusion matrix type and threshold\n",
       y = "Count",
       caption = 'Figure 6.3.3') +
  plotTheme() +
  guides(colour=guide_legend(title = "Confusion Matrix")) ,

whichThreshold2 %>%
  filter(Variable != 'False_Negative') %>%
  ggplot(.,aes(Threshold, Revenue, colour = Variable)) +
  geom_point() +
  scale_colour_manual(values = palette5[c(5, 1:3)]) +    
  labs(title = "Revenue by confusion matrix type and threshold\n",
       y = "Revenue",
       caption = 'Figure 6.3.4') +
  plotTheme() +
  guides(colour=guide_legend(title = "Confusion Matrix")) , ncol= 2)

6.4 Optimum Threshold and Prediction Outcome

The two plots below indicate the total revenues and the total credit counts for each threshold in relation to the optimum threshold. It is obvious that with the increase of the threshold, both revenues and counts of credits increase.

The optimum threshold in this case is 85%. The final table shows total revenues and credit counts respectively at two thresholds: 50% and optimum one 85%. Despite both being negative, the deficits at 85% threshold is less than the deficits at the 50% threshold. However, more individuals are expected to take the credit, which increases the potential return for the HCD and the city development.

grid.arrange(
  ggplot(whichThreshold_revenue)+
  geom_point(aes(x = Threshold, y = Revenue),color = palette5[c(5)])+
  geom_vline(xintercept =  pull(arrange(whichThreshold_revenue, -Revenue)[1,1]),color = palette5[c(1)])+
  labs(title = "Model Revenues By Threshold\n(Full Sample Model)\n",
        subtitle = "Vertical Line Denotes Optimal Threshold", caption = 'Figure 6.4.1') +
  plotTheme(),
    ggplot(whichThreshold_revenue2)+
  geom_point(aes(x = Threshold, y = Revenue),color = palette5[c(5)])+
  geom_vline(xintercept =  pull(arrange(whichThreshold_revenue2, -Revenue)[1,1]),color = palette5[c(1)])+
  labs(title = "Model Revenues By Threshold\n(Down Sample Model)\n",
        subtitle = "Vertical Line Denotes Optimal Threshold", caption = 'Figure 6.4.2') +
  plotTheme(), ncol =2)

  whichThreshold_subsidy_revenue <- 
  whichThreshold %>% 
    mutate(total_count =
               ifelse(Variable == "True_Positive", Count * 0.25, 
               ifelse(Variable == 'False_Negative', Count ,0 ))) %>% 
    group_by(Threshold) %>% 
    summarize(Total_Revenue = sum(Revenue),
              Total_Count_of_Credits = sum(total_count)) 
  
  whichThreshold_subsidy_revenue2 <- 
  whichThreshold2 %>% 
    mutate(total_count =
               ifelse(Variable == "True_Positive", Count * 0.25, 
               ifelse(Variable == 'False_Negative', Count ,0 ))) %>% 
    group_by(Threshold) %>% 
    summarize(Total_Revenue = sum(Revenue),
              Total_Count_of_Credits = sum(total_count)) 
  
grid.arrange(  
  whichThreshold_subsidy_revenue %>%
  dplyr::select(Threshold, Total_Count_of_Credits, Total_Revenue) %>%
  ggplot(aes(Threshold, Total_Count_of_Credits)) +
    geom_point(color =  "#658feb") +
    geom_vline(xintercept = pull(arrange(whichThreshold_subsidy_revenue, -Total_Revenue)[1,1]),color = palette5[c(1)]) +
  ylab('Total Counts')  +
  labs(title = "Total Count of Credits by Threshold\n(Full Sample Model)\n",subtitle = "Vertical Line Denotes Optimal Threshold",
       caption = 'Figure 6.4.3')  + 
 plotTheme() ,
  
   whichThreshold_subsidy_revenue2 %>%
  dplyr::select(Threshold, Total_Count_of_Credits, Total_Revenue) %>%
  ggplot(aes(Threshold, Total_Count_of_Credits)) +
    geom_point(color =  "#658feb") +
    geom_vline(xintercept = pull(arrange(whichThreshold_subsidy_revenue, -Total_Revenue)[1,1]),color = palette5[c(1)]) +
  ylab('Total Counts')  +
  labs(title = "Total Count of Credits by Threshold\n(Down Sample Model)\n",subtitle = "Vertical Line Denotes Optimal Threshold",
       caption = 'Figure 6.4.4')  + 
 plotTheme() , ncol = 2)

output_table <- rbind(whichThreshold_subsidy_revenue[50,],whichThreshold_subsidy_revenue[85,])


kable(output_table) %>%
  kable_styling(font_size = 12, full_width = F,
                bootstrap_options = c("striped", "hover", "condensed")) %>%
  footnote(general = "Table 6.4.1 Total Revenue and Credit by Threshold\n(Full Sample Model)\n",
           general_title= '\n')

Threshold	Total_Revenue	Total_Count_of_Credits
0.50	-130950	130.00
0.85	-2450	156.25

Table 6.4.1 Total Revenue and Credit by Threshold (Full Sample Model)

output_table2 <- rbind(whichThreshold_subsidy_revenue2[50,],whichThreshold_subsidy_revenue2[85,])


kable(output_table2) %>%
  kable_styling(font_size = 12, full_width = F,
                bootstrap_options = c("striped", "hover", "condensed")) %>%
  footnote(general = "Table 6.4.2 Total Revenue and Credit by Threshold\n(Down Sample Model)\n",
           general_title= '\n')

Threshold	Total_Revenue	Total_Count_of_Credits
0.50	-943350	71.5
0.85	-265550	104.5

Table 6.4.2 Total Revenue and Credit by Threshold (Down Sample Model)

VII. Conclusion

Although the accuracy of the stepwise model is relatively high, the overall sensitivity of the model is very difficult to improve, which may be caused by the low percentage of taking credits in the original data. To improve the model, I try to down sample the dataset and make it to have an equal outcome of 0 and 1. Through the analysis we can see that the values of threshold obtained by both models are similar, and the model trained by downsample training set can predict more taking credict household, helping HCD to find more target occupants. This may lead to more spending, so HCD can reverse the number of users they target to reach to form a certain percentage of training dataset. This also inspire us to collect data and retrain the model as the policy is implemented.

Although profiting from the program is not the ultimate goal of the overall policy, HCD still needs to reconsider the effectiveness of the overall policy. If the overall investment is limited, can the overall policy be made more effective by limiting the number of users making contact and instead providing a higher credit limit? Is there a problem with the communication channels of the policy, can the existing communication media be updated, etc.?

Targeting A Housing Subsidy

Xinge Zhang

2022-10-21

I. Introduction and Motivation

II. Exploratory Analysis: Feature Importance & Correlation

2.1 Distribution of dependent variable:

2.2 Correlation and Distribution of numeric variables

2.3 Feature Engineering

2.4 Distribution of nonnumeric variables

2.5 Feature engineering for nonnumeric variables

III. Model Building

3.1 Model Building

3.2 Predictions and Plots

3.3 Good of Fit: Confusion Matrix

IV. Model Optimization

4.1 Model Building

4.2 Predictions and Plots

4.3 Good of Fit: Confusion Matrix

Confusion Matrix of engineered model

Confusion Matrix of stepwise model

Confusion Matrix of downsample stepwise model

V. Model Comparison

5.1 Feature Comparison

5.2 Cross Validation on Models

stepwise model

engineered model

downsample model

5.3 ROC Curve Comparison

VI. Cost-Benefit Analysis

6.1 Cost/Benefit Equation

6.2 Create the “Cost/Benefit Table”

full sample model

downsample model

6.3 Optimize Thresholds

fullsample model

downsample model

6.4 Optimum Threshold and Prediction Outcome

VII. Conclusion