I.Introduction and Motivation

Mecklenburg County is a county located in the southwestern region of the state of North Carolina, in the United States. As of the 2020 census, the population was 1,115,482, making it the second-most populous county in North Carolina. Its county seat is Charlotte, the state’s largest city. Between 2004 and 2014, Charlotte was ranked as the country’s fastest-growing metro area, with 888,000 new residents.Based on U.S. Census data from 2005 to 2015, Charlotte tops the U.S. in millennial population growth, which indicates a growing demand in the housing market for the next decade. From this point of view, it is important to make a model that is adapted to local indicators and can accurately predict house prices.

In August 2022, Mecklenburg County home prices were up 15.1% compared to last year, selling for a median price of $420K. There were 1,723 homes sold in August this year, down from 2,131 last year. Across the nation, 0.96% of homebuyers searched to move into Charlotte from outside metros. 76% of Charlotte homebuyers searched to stay within the Charlotte metropolitan area.

We can see that even though the up-going price holds people from buying houses a little bit, the demand for homes is still robust. Homebuyers in Mecklenburg County mainly aim to relocate and stay in the metropolitan area. This group of buyers, who are already familiar with the different local resources in Charlotte, will be very conscious of the availability of various public amenities in the area and the quality of the home. we bring in the amenities feature and the building features from their perspectives. There are also immigrants from big cities like New York willing to move to Charlotte. This group of people might consider the basic context of the region. Therefore, census tract data are also important to analysis.

We also refer to Zillow and Redfin’s analysis of the local market, fully consider the price stratification of the house price market as well as the the property itself. Different from zillow and redfin’s predicted valuation, our model focuses on the prediction specific on the local market. This forecast will not consider the impact of time series changes on prices, due to the already clear challenge set as well as time nodes. We will also discard the top houses as training data based on the location of the forecast points inferring the price stratum to which they belong.

II.Data Manipulation and Visualization

2.0 Set Up

In this section, we loaded necessary libraries, created plot theme options and map theme options, and identified functions of quantile breaks, and average nearest neighbor distance for further analysis.

# You can set some global options for knitting chunks

knitr::opts_chunk$set(echo = TRUE)

# Load some libraries

library(tidyverse)
library(sf)
library(spdep)
library(caret)
library(ckanr)
library(FNN)
library(grid)
library(gridExtra)
library(ggcorrplot) # plot correlation plot
library(corrr)      # another way to plot correlation plot
library(kableExtra)
library(jtools)     # for regression model plots
library(ggstance) # to support jtools plots
library(ggpubr)    # plotting R^2 value on ggplot point scatter
library(broom.mixed) # needed for effects plots
library(stargazer)
library(tidycensus)

options(scipen = 999)
options(tigris_class = "sf")

# functions and data directory

source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")

palette3 <- c('#ffffd4','#fed98e','#fe9929','#d95f0e','#993404')
palette4 <- c('#4575b4', '#fecc5c')
palette5<-c('#fc8d59','#fee090','#ffffbf','#e0f3f8','#91bfdb')
palette2<-c ('#eff3ff','#bdd7e7','#6baed6','#3182bd','#08519c')

plotTheme <- function(base_size = 12, title_size = 16) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = title_size, colour = "black"), 
    plot.subtitle = element_text(face="italic"),
    plot.caption = element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),
    panel.grid.minor = element_line("grey90", size = 0.01),
    panel.grid.major = element_line("grey90", size = 0.01),
    panel.border = element_rect(colour = "white" , fill=NA, size=0),
    strip.background = element_rect(color = "white",fill='white'),
    strip.text = element_text(size=12),
    axis.title = element_text(size=12),
    axis.text = element_text(size=10),
    plot.background = element_blank(),
    legend.background = element_blank(),
    legend.title = element_text(colour = "black", face = "italic"),
    legend.text = element_text(colour = "black", face = "italic"),
    strip.text.x = element_text(size = 14)
  )
}

mapTheme <- function(base_size = 12, title_size = 16) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = title_size,colour = "black"),
    plot.subtitle=element_text(face="italic"),
    plot.caption=element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),
    axis.title = element_blank(),
    axis.text = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.grid.minor = element_blank(),
    strip.text.x = element_text(size = 14),
    strip.background = element_rect(fill = "white"))
}
setwd("~/Github/508_houseprice_prediction")
census_api_key("c1c749a5ad57fa58fe19e1403d894e9acef41a97", install=TRUE,overwrite = TRUE)

2.1 Data Wrangling

In this section, we loaded Mecklenburg data and other complementary datasets, conducted feature selection and engineering. Besides, for analysis, we separate the whole dataset into challenge_set and model_set in advance.

2.1.1 Data loading

Seven tract-related datasets were used in this project:

  • Long_commute: Percentage of auto commuters traveling 20 minutes or more to work

  • Crime: crime violent incidences in tracts.

  • Pctgrocery: Percentage of housing units within ½-mile of a full-service, chain grocery store.

  • Pcttest1\2\3: Percentage of Charlotte-Mecklenburg Schools (CMS) students in grades 3-5  6-8  9-12 proficient in End of Grade testing

  • Bachelor: Percent of Bachelor’s Degree

  • Age: Median Age of Residents

  • Census Data: demographic variables from the ACS 2020 for census tracts in Mecklenburg County. We specifically selected 8 variables in the analysis:

    • TotalPop: Total population in each census tract
    • Whites: White population in each census tract
    • MedHHInc: Median household income in each census tract
    • MedRent: Median Rent for properties in each census tract
    • TotalPoverty: Population living under the level of poverty in each census tract
    • TotalVacant: Total vacant unit in each census tract
    • TotalUnit: Total housing units in each census tract
    • RenterOccpied: Total householder lived in renter-occupied housing units

    With variables above, we created the four features to be used in the model building.

    • pctWhite: white population proportion in each census tract
    • pctPoverty: Poverty population proportion in each census tract
    • pctTotalVacant: Vacant unites proportion in each census tract
    • pctRenterOccupied: Proportion of householder living in renter-occupied housing units in each census tract
  • Amenity Data: compilatory data chosen for describing amenities and public service of Mecklenburg County. The datasets we chose encompasses Churches, Colleges and Universities, Libraries, Medical Facilities, Business Investment Opportunity Zones (Polygon),parks, daycare. See each dataset below:

    1. Churches: Churches dataset defines locations classified as religious organizations in Mecklenburg County, data is queried from Master Address Table..

    2. Colleges and Universities: Dataset contains public and private colleges and universities in Mecklenburg County.

    3. Capital Improvement Projects: Dataset identifies capital improvement projects for; Central Piedmont Community College, Charlotte-Mecklenburg Schools, Libraries, Park and Recreation, and other government facilities in Mecklenburg County.

    4. Daycare: Dataset provides NC DHHS licensed daycare locations in Mecklenburg County.

    5. Historic Cemeteries: This dataset represents historic and abandoned burial locations in Mecklenburg County, NC.

    6. Historic Property National: This dataset represents nationally-designated historic properties in Mecklenburg County, NC.

    7. Libraries: A point coverage of the Public Libraries of Charlotte-Mecklenburg, North Carolina system.

    8. Medical Facilities: This dataset was generated to publish a geospatial inventory of Medical Facilities in Mecklenburg County, NC. This data serves to support and assist the Charlotte Mecklenburg Planning Department, The City of Charlotte, and others in Urban Planning decisions.

    9. Business Investment Opportunity Zones(Polygon): This dataset displays areas that are within a 1/4 mile of a commercial property that has a value 70% the County average and a median household income of <= 55K in Mecklenburg County.

    10. Park Locations: Mecklenburg County Park and Recreation park location entrance points.

Notice: For the tract-related data, we get the relevant data by spatial joining the house price points with the track, and for the data based on FACILITIES distribution points, we assign values to the house price information points by the knn algorithm.

#Raw Data:
data<-st_read("~/Github/508_houseprice_prediction/data/studentData.geojson")

#Add age,pricepersq,bathrooms
data<-data %>% mutate(AGE = 2022-yearbuilt,
                                pricepersq = price/heatedarea,
                                bathrooms = fullbaths+halfbaths)
#Census Tract:
tract<-st_read('~/Github/508_houseprice_prediction/data/CensusTract/Census_Tracts_2020.shp')
tract<-tract %>% st_transform(crs=4326)

#Zipcode:
zipcode<-st_read('~/Github/508_houseprice_prediction/data/zipcode/Zipcode.shp')
zipcode<-zipcode %>% st_transform(crs=4326) %>% rename(shapeLeng = shape_Leng,shapeArea = shape_Area) %>% dplyr::select(zip,po_name,shapeLeng,shapeArea)

# Amenities
#1.college:
college<-st_read('./data/college/Colleges.shp')
college.sf<-college %>% st_transform(crs = 4326) %>% dplyr::select(geometry) %>% na.omit() 
#2.daycare:
daycare<-st_read('./data/daycare/Daycare.shp')
daycare.sf<-daycare %>% st_transform(crs=4326)%>% dplyr::select(geometry) %>% na.omit() 

#3.Medical Facilities:
MedicalFacilities<-st_read('./data/medical/MedicalFacilities.shp')
MedicalFacilities.sf<-MedicalFacilities %>% st_transform(crs=4326)%>% dplyr::select(geometry) %>% na.omit()

#4.library:
library<-st_read('./data/library/Library.shp')
library.sf<-library %>% st_transform(crs=4326)%>% dplyr::select(geometry) %>% na.omit() 

#5.Historical cemeteries:
hist_cemeteries<-st_read('./data/historic_cemeteries/Historic_Cemeteries.shp')
hist_cemeteries.sf<-hist_cemeteries %>% st_transform(crs=4326)%>% st_centroid() %>% dplyr::select(geometry) %>% na.omit() 

#6.Historical Properties:
hist_properties<-st_read('./data/historic_properties/HistoricProperty_National.shp')
hist_properties.sf<-hist_properties%>% st_transform(crs=4326)%>%st_centroid() %>% dplyr::select(geometry) %>% na.omit() 

#7.church
church<-st_read('./data/church/Churches.shp')
church.sf<-church %>% st_transform(crs=4326)%>% dplyr::select(geometry) %>% na.omit() 


#8.park location:
park_location <- st_read('./data/Park_Locations/Park_Locations.shp')
park_location.sf <- park_location %>% st_transform(crs=4326)%>%st_centroid() %>% dplyr::select(geometry) %>% na.omit()


#9.Capital Improvement Projects:
capital_improve<-st_read('./data/Capital_Improvement_Projects/Capital_Improvement_Projects.shp')
capital_improve.sf<-capital_improve %>% st_transform(crs=4326)%>% dplyr::select(geometry) %>% na.omit() 

#10.Business Investment Opportunity Zones:
business_invest<-st_read('./data/Business/Business_Investment_Opportunity_Zones.shp')
business_invest<-business_invest %>% st_transform(crs=4326)
business_invest_union<-business_invest %>% st_union() %>% st_sf()
isin<-st_within(data,business_invest_union) %>% lengths>0
isin<-data.frame(isin)
data<-cbind(data,isin)

#spatial features collection

#1.Crime:
crime<-st_read('./data/Crime - Violent.geojson')
crime<-crime %>% st_transform(crs=4326)
crime1<-crime %>% mutate(crime_sum = X2019.raw+X2020.raw+X2021.raw) %>%  dplyr::select(crime_sum) 

#2.Long Commute:
long_commute<-st_read('./data/LongCommute.geojson')
long_commute<-long_commute %>% st_transform(crs=4326) %>% mutate(PctLongCommute = X2020) %>%  dplyr::select(PctLongCommute)

#3.Percent of Grocery:
grocery<-st_read('./data/Proximity to a Grocery Store.geojson')
Pctgrocery<-grocery %>% st_transform(crs=4326) %>% mutate(Pctgrocery = X2021) %>%  dplyr::select(Pctgrocery)

#4.Percent of Test Proficiency in Elementary Schools:
test_elementary<-st_read('./data/Test Proficiency - Elementary School.geojson')
Pcttest1<-test_elementary%>% st_transform(crs=4326) %>%mutate( test.elementary = X2019) %>% dplyr::select(test.elementary)

#5.Percent of Test Proficiency in Middle Schools:
test_middle<-st_read('./data/Test Proficiency - Middle School.geojson')
Pcttest2<-test_middle%>% st_transform(crs=4326) %>%mutate( test.middle = X2019) %>% dplyr::select(test.middle)

#6.Percent of Test Proficiency in High Schools:
test_high<-st_read('./data/Test Proficiency - High School.geojson')
Pcttest3<-test_high%>% st_transform(crs=4326) %>% mutate(test.high = X2019) %>% dplyr::select(test.high)

#7.Percent of Bachelor's Degree:
bachelor<-st_read('./data/Education Level - Bachelor Degree.geojson')
PctBachelor<-bachelor %>% st_transform(crs=4326) %>% mutate(PctBachelor = X2020) %>% dplyr::select(PctBachelor)

#8.Median Age of Residents:
age<-st_read('./data/Age of Residents.geojson')
MedAge<-age %>% st_transform(crs=4326) %>% mutate(MedAge = X2020) %>% dplyr::select(MedAge)
options(tigris_use_cache = TRUE)

acs_variable_list.2020 <- load_variables(2020, 
                                         "acs5", 
                                         cache = TRUE)

tracts20 <- 
  get_acs(geography = "tract", variables = c("B25026_001E","B02001_002E", "B19013_001E","B25058_001E","B06012_002E",                            
"B25002_003E","B25004_002E","B25004_003E","B25004_004E","B25004_005E",
"B25001_001E","B08137_003E"), 
          year=2020, state= 37, county= 119, geometry=T, output="wide")%>% st_transform(crs=4326) %>%
  rename(TotalPop = B25026_001E, 
         Whites = B02001_002E,
         MedHHInc = B19013_001E, 
         MedRent = B25058_001E,
         TotalPoverty = B06012_002E,
         TotalVacant = B25002_003E,
         ForRent = B25004_002E,
         ForRentVac = B25004_003E,
         ForSale = B25004_004E,
         ForSaleVac = B25004_005E,
         TotalUnit = B25001_001E,
         RenterOccupied = B08137_003E
         ) %>%
  dplyr::select(-NAME, -starts_with("B")) %>% #-starts_with("B") awesome!
  mutate(pctWhite = ifelse(TotalPop > 0, Whites / TotalPop,0),
         pctPoverty = ifelse(TotalPop > 0, TotalPoverty / TotalPop, 0),
         pctTotalVacant = ifelse(TotalUnit > 0, TotalVacant / TotalUnit * 100, 0),
         TotalOccupied = TotalUnit - TotalVacant,
         pctRenterOccupied = ifelse(TotalOccupied >0, RenterOccupied/TotalOccupied, 0)) %>%
  dplyr::select(-Whites, -TotalPoverty) 

2.1.2 Feature Engineering

In this section, we created some new features to describe the amenities and public services in Mecklenburg County, as well as to each home sale observation’s internal characteristics.

First of all, we were aimed at creating features to measure the extent to which buyers can assess multiple public services, including Churches, Colleges and Universities, Libraries, Medical Facilities, parks, daycares. To utmost avoid scale bias triggered by setting arbitrary areal unit or fixed buffer distance of each home sale observation, we chose to calculate the average nearest neighbor distance from each home sale to its k nearest neighbor public services. Accordingly, the smaller the value is, the more possibility that the observation can access specific service. To better compare each feature’s impact, we set the k from 3 to 5.

data<-st_join(data,crime1)
data<-st_join(data,long_commute)
data<-st_join(data,Pctgrocery)
data<-st_join(data,Pcttest1)
data<-st_join(data,Pcttest2)
data<-st_join(data,Pcttest3)
data<-st_join(data,PctBachelor)
data<-st_join(data,MedAge)
data<-st_join(data,zipcode)
data<-st_join(data,tracts20)


st_c<-st_coordinates

#1.college:
data <-
  data %>% 
  mutate(
    college_nn1 = nn_function(st_c(data),st_c(college.sf), 1),
    college_nn2 = nn_function(st_c(data),st_c(college.sf), 2),
    college_nn3 = nn_function(st_c(data),st_c(college.sf), 3),
    college_nn4 = nn_function(st_c(data),st_c(college.sf), 4),
    college_nn5 = nn_function(st_c(data),st_c(college.sf), 5),
    ) 

#2.daycare:
data <-
  data %>% 
  mutate(
    daycare_nn1 = nn_function(st_c(data),st_c(daycare.sf), 1),
    daycare_nn2 = nn_function(st_c(data),st_c(daycare.sf), 2),
    daycare_nn3 = nn_function(st_c(data),st_c(daycare.sf), 3),
    daycare_nn4 = nn_function(st_c(data),st_c(daycare.sf), 4),
    daycare_nn5 = nn_function(st_c(data),st_c(daycare.sf), 5),
    )

#3.Medical Facilities:
data <-
  data %>% 
  mutate(
    MedicalFacilities_nn1 = nn_function(st_c(data),st_c(MedicalFacilities.sf), 1),
    MedicalFacilities_nn2 = nn_function(st_c(data),st_c(MedicalFacilities.sf), 2),
    MedicalFacilities_nn3 = nn_function(st_c(data),st_c(MedicalFacilities.sf), 3),
    MedicalFacilities_nn4 = nn_function(st_c(data),st_c(MedicalFacilities.sf), 4),
    MedicalFacilities_nn5 = nn_function(st_c(data),st_c(MedicalFacilities.sf), 5),
    )

#4.library:
data <-
  data %>% 
  mutate(
    library_nn1 = nn_function(st_c(data),st_c(library.sf), 1),
    library_nn2 = nn_function(st_c(data),st_c(library.sf), 2),
    library_nn3 = nn_function(st_c(data),st_c(library.sf), 3),
    library_nn4 = nn_function(st_c(data),st_c(library.sf), 4),
    library_nn5 = nn_function(st_c(data),st_c(library.sf), 5),
    )

#5.Historical cemeteries:
data <-
  data %>% 
  mutate(
    hist_cemeteries_nn1 = nn_function(st_c(data),st_c(hist_cemeteries.sf), 1),
    hist_cemeteries_nn2 = nn_function(st_c(data),st_c(hist_cemeteries.sf), 2),
    hist_cemeteries_nn3 = nn_function(st_c(data),st_c(hist_cemeteries.sf), 3),
    hist_cemeteries_nn4 = nn_function(st_c(data),st_c(hist_cemeteries.sf), 4),
    hist_cemeteries_nn5 = nn_function(st_c(data),st_c(hist_cemeteries.sf), 5),
    )

#6.Historical Properties:
data <-
  data %>% 
  mutate(
    hist_properties_nn1 = nn_function(st_c(data),st_c(hist_properties.sf), 1),
    hist_properties_nn2 = nn_function(st_c(data),st_c(hist_properties.sf), 2),
    hist_properties_nn3 = nn_function(st_c(data),st_c(hist_properties.sf), 3),
    hist_properties_nn4 = nn_function(st_c(data),st_c(hist_properties.sf), 4),
    hist_properties_nn5 = nn_function(st_c(data),st_c(hist_properties.sf), 5),
    )

#7.church
data <-
  data %>% 
  mutate(
    church_nn1 = nn_function(st_c(data),st_c(church.sf), 1),
    church_nn2 = nn_function(st_c(data),st_c(church.sf), 2),
    church_nn3 = nn_function(st_c(data),st_c(church.sf), 3),
    church_nn4 = nn_function(st_c(data),st_c(church.sf), 4),
    church_nn5 = nn_function(st_c(data),st_c(church.sf), 5),
    )

#9.park easement:
data <-
  data %>% 
  mutate(
    park_location_nn1 = nn_function(st_c(data),st_c(park_location.sf), 1),
    park_location_nn2 = nn_function(st_c(data),st_c(park_location.sf), 2),
    park_location_nn3 = nn_function(st_c(data),st_c(park_location.sf), 3),
    park_location_nn4 = nn_function(st_c(data),st_c(park_location.sf), 4),
    park_location_nn5 = nn_function(st_c(data),st_c(park_location.sf), 5),
    )

#10.Capital Improvement Projects:
data <-
  data %>% 
  mutate(
    capital_improve_nn1 = nn_function(st_c(data),st_c(capital_improve.sf), 1),
    capital_improve_nn2 = nn_function(st_c(data),st_c(capital_improve.sf), 2),
    capital_improve_nn3 = nn_function(st_c(data),st_c(capital_improve.sf), 3),
    capital_improve_nn4 = nn_function(st_c(data),st_c(capital_improve.sf), 4),
    capital_improve_nn5 = nn_function(st_c(data),st_c(capital_improve.sf), 5),
    )

2.1.3 Split dataset & price tier classification

Through the histogram comparison we can see that the distribution of housing prices is very uneven and needs to be removed outliers according to certain criteria.

#challenge_set:
challenge_set<-data %>% filter(price==0)
#model_set:
model_set<-data %>% filter(price!=0)

plot1<- ggplot(model_set)+
  geom_histogram(aes(price), position = "dodge",stat = "bin",color = "white",fill= "orange") +  
  labs(title = "Distribution of prices\n(modelsetwith outlier)\n",caption = 'Figure DATA 2.1.1')+ 
  scale_color_grey() + theme_classic()

# define outliers
Q1 = quantile(model_set$price)[2]
Q3 = quantile(model_set$price)[4]
lowerouter <- Q1 - 3*(Q3-Q1)
upperouter <- Q3 + 3*(Q3-Q1)

model_set<-data %>% filter(price!=0 & price < upperouter)
plot2 <- ggplot(model_set)+
  geom_histogram(aes(price), position = "dodge",stat = "bin",color = "white",fill= "orange") +  
  labs(title = "Distribution of prices\n(modelsetwithout outlier)\n",caption = 'Figure DATA 2.1.2')+ theme_classic()

grid.arrange(plot1, plot2, ncol=2)

2.2 Summary Statistics of Features

As required, we provided three tables of summary statistics with variable descriptions for sorting features based on their categories below. For better visualization and analysis, features were divided according to their sources and catergories.

Based on the hedonic home price model, features are put into three categories:

  • Internal Features, features related to the quality of house, like the number of bathrooms;

  • Amenities Features, like hospital accessibility;

  • Tract Features, features related to the spatial characteristics, like census tracts data

In total, we took 99 features into consideration in the initial stage. Detailed summary statistics are provided below.

2.2.1 Internal Features

The table and plots show that the difference between houses is very large, especially in terms of price, area, and age. Also out of instinct, we might consider fullbaths&halfbaths units&bedrooms, shape_area&heated area might correlate with each other.

#Summary and selection of internal features:
internal.variable <- c("price",
                "shape_Area",
                "pricepersq",
                "AGE",
                "heatedarea",
                "numfirepla",
                "fullbaths",
                "halfbaths",
                "bedrooms",
                "units")

internal.features<-model_set[internal.variable] %>%
  st_drop_geometry()
#stargazer:
stargazer(internal.features, type = "html", 
          title = "Table DATA 2.2 Summary Statistics of Internal Characteristics\n(numeric features)\n ",
          header = FALSE,
          single.row = TRUE)
Table DATA 2.2 Summary Statistics of Internal Characteristic
Statistic N Mean St. Dev. Min Max
price 44,288 406,077.200 189,009.900 12,000 1,182,000
shape_Area 44,288 14,878.390 21,314.750 1,139.637 1,032,970.000
pricepersq 44,288 Inf.000 4.464 Inf.000
AGE 44,288 28.497 30.510 0 2,022
heatedarea 44,288 2,260.081 897.891 0 9,183
numfirepla 44,288 0.771 0.450 0 5
fullbaths 44,288 2.217 0.739 0 8
halfbaths 44,288 0.578 0.513 0 4
bedrooms 44,288 3.470 0.919 0 65
units 44,288 0.976 0.195 0 8

We can see that AGE do not have correlation with the dependent variable, which we will consider to remove in the future. Also we can notice that heatedarea has a strong positive influence for price. In addition the relationship of this variable with other variables is very similar to their relationship with price, which suggests that it may be an important indicator in the future that can represent these variables to predict prices. Also we check the relationship between fullbaths&halfbaths units&bedrooms, shape_area&heated area. Surprisingly, fullbaths and halfbaths do not have relation with each other. Bedrooms and units have negative correlation. This suggests that perhaps the distribution of data in this dataset is not what we would expect by common sense.

#Correlation matrix of internal features:
numericVars <- 
  select_if(internal.features, is.numeric) %>% na.omit()
#['#d73027','#fc8d59','#fee090','#ffffbf','#e0f3f8','#91bfdb','#4575b4']
ggcorrplot(
  round(cor(numericVars), 1), 
  p.mat = cor_pmat(numericVars),
  colors = c('#4575b4', "#ffffff",'#fecc5c'),
  type="upper",
  insig = "pch") +  
  labs(title = "Correlation across numeric variables\n(internal features)\n", caption = 'Figure DATA 2.2.1') +
  plotTheme()

sel.internal.list <-  c("price",
                "shape_Area",
                "heatedarea",
                "numfirepla",
                "fullbaths",
                "halfbaths",
                "bedrooms",
                "units")

2.2.2 Amenities Features

From the results of the correlation, different amenities features have different values of nn related to price. Park does not have much impact on the overall results and will be removed in the next step of the analysis. capital improvement related variables are meaningful only when n=2. Similarly, college features are related when nn=2. Medical facilities are more meaningful when n>2.

amenities_feature.list<-model_set %>% dplyr::select(
  starts_with('college_'),
  starts_with('daycare_'),
  starts_with('MedicalFacilities_'),
  starts_with('library_'),
  starts_with('hist_'),
  starts_with('church'),
  starts_with('park_'),
  starts_with('capital_'),
  price) %>% 
  st_drop_geometry()
#stargazer:
stargazer(amenities_feature.list, type = "html", 
          title = "Table DATA 2.1 Summary Statistics of Internal Characteristics ",
          out.header = TRUE,
          single.row = TRUE,out="amenities_feature.htm")
Table DATA 2.1 Summary Statistics of Internal Characteristics
Statistic N Mean St. Dev. Min Max
college_nn1 44,288 0.052 0.028 0.001 0.172
college_nn2 44,288 0.066 0.027 0.004 0.175
college_nn3 44,288 0.076 0.030 0.006 0.184
college_nn4 44,288 0.085 0.033 0.006 0.191
college_nn5 44,288 0.093 0.036 0.007 0.206
daycare_nn1 44,288 0.010 0.008 0.00001 0.067
daycare_nn2 44,288 0.012 0.009 0.0004 0.082
daycare_nn3 44,288 0.014 0.010 0.001 0.089
daycare_nn4 44,288 0.016 0.010 0.001 0.093
daycare_nn5 44,288 0.017 0.011 0.001 0.096
MedicalFacilities_nn1 44,288 0.018 0.014 0.0001 0.077
MedicalFacilities_nn2 44,288 0.019 0.014 0.0004 0.081
MedicalFacilities_nn3 44,288 0.021 0.014 0.0004 0.085
MedicalFacilities_nn4 44,288 0.022 0.014 0.0005 0.086
MedicalFacilities_nn5 44,288 0.023 0.014 0.0005 0.088
library_nn1 44,288 0.040 0.022 0.001 0.111
library_nn2 44,288 0.055 0.025 0.003 0.138
library_nn3 44,288 0.064 0.028 0.006 0.164
library_nn4 44,288 0.071 0.030 0.008 0.177
library_nn5 44,288 0.078 0.032 0.010 0.191
hist_cemeteries_nn1 44,288 0.016 0.008 0.0004 0.044
hist_cemeteries_nn2 44,288 0.019 0.007 0.001 0.046
hist_cemeteries_nn3 44,288 0.022 0.007 0.001 0.048
hist_cemeteries_nn4 44,288 0.024 0.008 0.004 0.049
hist_cemeteries_nn5 44,288 0.027 0.008 0.004 0.053
hist_properties_nn1 44,288 0.092 0.056 0.0004 0.254
hist_properties_nn2 44,288 0.095 0.055 0.001 0.260
hist_properties_nn3 44,288 0.101 0.056 0.002 0.263
hist_properties_nn4 44,288 0.105 0.056 0.008 0.266
hist_properties_nn5 44,288 0.109 0.056 0.014 0.270
church_nn1 44,288 0.009 0.006 0.0001 0.046
church_nn2 44,288 0.011 0.007 0.0001 0.046
church_nn3 44,288 0.012 0.007 0.0003 0.047
church_nn4 44,288 0.013 0.007 0.001 0.050
church_nn5 44,288 0.014 0.008 0.001 0.054
park_location_nn1 44,288 0.012 0.008 0.0002 0.052
park_location_nn2 44,288 0.015 0.008 0.001 0.052
park_location_nn3 44,288 0.017 0.008 0.001 0.059
park_location_nn4 44,288 0.020 0.009 0.001 0.065
park_location_nn5 44,288 0.021 0.009 0.002 0.072
capital_improve_nn1 44,288 0.020 0.015 0.00004 0.114
capital_improve_nn2 44,288 0.024 0.015 0.0004 0.121
capital_improve_nn3 44,288 0.028 0.015 0.001 0.127
capital_improve_nn4 44,288 0.031 0.016 0.004 0.132
capital_improve_nn5 44,288 0.034 0.017 0.005 0.136
price 44,288 406,077.200 189,009.900 12,000 1,182,000
ggcorrplot(
  round(cor(amenities_feature.list), 1), 
  p.mat = cor_pmat(amenities_feature.list),
  colors = c('#4575b4', "#ffffff",'#fecc5c'),
  type="upper",
  insig = "pch") +  
  labs(title = "Correlation across numeric variables\n(amenity features)\n",
       caption = 'Figure DATA 2.2.2')+plotTheme()+
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size=7),
        axis.text.y = element_text(size=7))

#Amenities
sel.amenitie<-model_set%>% dplyr::select(
  starts_with('college_'),
  starts_with('daycare_'),
  starts_with('MedicalFacilities_'),
  starts_with('library_'),
  starts_with('hist_'),
  starts_with('church'),
  starts_with('capital_'),
  price) %>% 
  st_drop_geometry()

2.2.3 Tract Features

From the result of correlation, Most tract features are price-dependent. pctTotalVacant has nothing to do with price and can be deleted. pctBachelor,test.high, test.middle, test.elementary are correlated can keep one, MedAge, MedHHInc, MedRent highly relevant, can keep one.

#Tract Feature
tract.list <- c("pctTotalVacant", "pctRenterOccupied", "pctWhite","pctPoverty","MedRent" ,"MedHHInc","MedAge","crime_sum", "PctLongCommute", "Pctgrocery", "test.elementary","test.middle", "test.high","PctBachelor",'price')

tract_feature.list<-model_set[tract.list]%>% 
  st_drop_geometry()

numeric.tract<- 
  select_if(tract_feature.list, is.numeric) %>% na.omit()

ggcorrplot(
  round(cor(numeric.tract), 1), 
  p.mat = cor_pmat(numeric.tract),
  colors = c('#4575b4', "#ffffff",'#fecc5c'),
  type="upper",
  insig = "pch") +  
  labs(title = "Correlation across numeric variables\n public service features\n",
       caption = 'Figure DATA 2.2.3')+plotTheme()+
  theme(axis.title.x = element_text(size=3),axis.title.y = element_text(size=3))

sel.tract.list<- c( "pctRenterOccupied", "pctWhite","pctPoverty","MedRent" ,"MedHHInc","MedAge","crime_sum", "PctLongCommute", "Pctgrocery",  "test.elementary","PctBachelor",'price')

2.2.4 Nonnumeric Features

From the histograms we can see that most of the indicators are correlated with prices, and that different indicators affect different ranges of prices: - bldggrade(building grade) and extwall related to the high price as well as the price of the low price range. - aheatingty and storyheigh relate to the central and high price range - heatedfuel relates to the central price range

Based on their distribution in different price ranges, we reclassify the nonnumeric feature to facilitate better training learning.

numeric<- select_if(data, is.numeric)%>% na.omit()%>% 
  st_drop_geometry()
nonnumeric <- select_if(data, !(colnames(data)%in% colnames(numeric)))
# 1.Modify extwall:
extwall.stone<-c("STONE")
data <- 
  data %>%
  mutate(extwall.m = ifelse(extwall %in% extwall.stone,"STONE","Non-STONE"))

# 2.Modify Aheatingty:
hotwater<-c("HOT WATER")
data<-data %>% mutate(aheatingty.m = ifelse(aheatingty %in% hotwater,"HOT WATER","ELSE" ))

# 3.Modify heatedfuel:
gas<-c("GAS")
data<-data %>% mutate(heatedfuel.m = ifelse(heatedfuel %in% gas,"GAS","ELSE" ))

# 4.Modify bldggrade
fairplus<-c("FAIR","MINIMUM","NA")
data<-data %>% mutate(bldggrade.m = ifelse(bldggrade %in% fairplus,"ELSE",bldggrade))

# 5.Modify storyheigh
storyelse<-c("NA","SPLIT LEVEL","BI-LEVEL","CAPE COD","RANCH W/BSMT")
data<-data %>% mutate(storyheigh.m = ifelse(storyheigh %in% storyelse,"ELSE",storyheigh ))

#非数字的列
nonnumeric.feature1 <- c("descbuildi", "storyheigh.m", "aheatingty.m", "heatedfuel.m", "extwall.m", "foundation", "bldggrade.m","isin","price")
model_set %>% 
  st_drop_geometry() %>% 
  dplyr::select(price,extwall,storyheigh,heatedfuel,aheatingty,bldggrade) %>%
  filter(price <= upperouter) %>%
  gather(Variable, Value, -price) %>% 
  ggplot(aes(Value, price)) +
  geom_bar(position = "dodge", stat = "summary", fun = mean, fill= '#abd9e9') +
  facet_wrap(~Variable, ncol = 3, scales = "free") +
  labs(title = "Price as a function of\nnonnumeric variables", y = "Mean_Price",
       caption = 'Figure DATA 2.3') +plotTheme()+
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size=6),  strip.background = element_rect(fill = "white"),panel.grid.minor =element_blank() ,panel.grid.major =element_blank())

2.3 Home Price Correlation Scatterplots

In this section, we specifically explored the correlation between sale price and four interesting and meaningful features: Heated Area, Crime Summary, Distance to capital improvement Projects, Distance to libraries.

Through graphs below, it is obvious that the first two of four features have strong correlation with sale price, while for the distance to amenities, there is no particularly strong correlation with price. But in general distance and price are inversely correlated, which is consistent with our intuition. Meanwhile, by comparing the scatter plot and correlation of capital improvement and library, we can see that there is a specific distance boundary for capital improvement, while the distance to library has no specific boundary although it is related to price.

# DATA - 4 Scatter plot  
## 1. Two House Internal characteristics
features <- c('Heated Area', 'Crime Summary')
names(features) <- c("heatedarea", "crime_sum")
price.house.plot <- 
  st_drop_geometry(model_set) %>% 
  dplyr::select(price, heatedarea, crime_sum) %>%
  filter(price <= upperouter) %>%
  gather(Variable, Value, -price) %>% 
  ggplot(aes(Value, price)) +
  geom_point(size = .5, colour='#abd9e9',alpha=0.5) + 
  geom_smooth(method = "lm", se=F, colour = 'orange') +
  facet_wrap(~Variable, ncol = 2, scales = "free",
             labeller = labeller(Variable = features)) +
  plotTheme() + 
  labs(title = "Price as a Function of House Internal Characteristics Variables:\nCrime Summary, Heated Area\n",
       caption = 'Figure DATA 2.4.1') 

price.house.plot

## 2. Average Distance of Nearest 2 capital improvement projects 
price.capital_improve_nn2.plot <-
  st_drop_geometry(model_set) %>% 
  dplyr::select(price, capital_improve_nn2) %>%
  filter(price <= upperouter) %>%
  gather(Variable, Value, -price) %>% 
  ggplot(aes(Value, price)) +
  geom_point(size = .5,colour='#abd9e9',alpha=0.5) + 
  geom_smooth(method = "lm", se=F, colour = 'orange') +
  labs(title = "Price as a Function of Average Distance of\nNearest Two Capital Improvement Projects\n",subtitle="Mecklenburg, NC",
       caption = 'Figure DATA 2.4.2') +
  plotTheme()

price.capital_improve_nn2.plot

## 3. Average Distance of Nearest 4 libraries 
price.library_nn4.plot <-
  st_drop_geometry(model_set) %>% 
  dplyr::select(price, library_nn4) %>%
  filter(price <= upperouter) %>%
  gather(Variable, Value, -price) %>% 
  ggplot(aes(Value, price)) +
  geom_point(size = .5, colour='#abd9e9',alpha=0.5) + 
  geom_smooth(method = "lm", se=F, colour = 'orange') +
  labs(title = "Price as a Function of Average Distance of\nNearest Four Libraries\n",subtitle="Mecklenburg, NC",
       caption = 'Figure DATA 2.4.3') +
  plotTheme()

price.library_nn4.plot

2.5 Map Home Price

The sale price per SqFt in Mecklenberg shows a clear spatial pattern. The sale price in Belk, Lofton, Harris is significantly higher than other areas.

#Plot of home price per square feet:
ggplot()+
  geom_sf(data=st_union(tract),fill="white", color = "grey")+
  geom_sf(data=model_set,aes(color=q5(pricepersq)),show.legend="point",size=.75)+
  scale_color_manual(values = palette2,labels = qBr(model_set, "pricepersq"),name = "House Price\nPer Square Feet\n(Quintile Breaks)")+
  labs(title="House Price",subtitle = "Mecklenberg County,NC",caption="Figure2.5")+
  mapTheme()

2.6 Maps of Independent Variables

After having a preliminary idea of the correlation between variables and the distribution of house prices, we would like to map the correlation with the age of the house, which is not particularly correlated with house prices, to see if the age of the house is completely irrelevant to the overall price distribution.The map results are unexpected, although in some parts of age and the distribution of housing prices are not related, but for some areas, the areas with high housing age but higher prices, but this may be due to these areas are located in the old city, housing construction time but other resources in the district is more excellent.

ggplot()+
  geom_sf(data=model_set,aes(color=q5(AGE)),alpha= 0.5,show.legend = "point",size=.75)+
  geom_sf(data=st_union(tract),fill= "transparent", color = "grey")+
  scale_color_manual(values = palette2,labels = qBr(model_set, "AGE"),name = "Quintile Breaks\n(Year)")+
  labs(title="Distribution of the House Age",subtitle = "Based on Each Home Sale Observation\nMecklenburg, NC",caption="Figure2.6.1")+mapTheme()

According to the results of correlation we know that the accessibility of daycare facilities, testscore and price are positively correlated, but it is not clear how much overlap they have in geographic distribution, here we would like to compare and observe.

pcttest1.plot <- ggplot()+
  geom_sf(data=st_union(tract),fill="white", color = "grey")+
  geom_sf(data=model_set,aes(color=q5(test.elementary)),show.legend = "point",size=.75)+
  scale_color_manual(values = c('#eff3ff','#bdd7e7','#6baed6','#3182bd','#08519c'),labels = qBr(model_set, "test.elementary"),name = "Quintile Breaks\n(Percent)")+mapTheme()+
  labs(title="Test Proficiency - Elementary School",subtitle = "Percentage of Charlotte-Mecklenburg Schools (CMS) students \nin grades 3-5 proficient in End of Grade testing",caption="Figure2.6.2")

daycare_nn4 <- model_set %>%
  mutate(daycare_nn4.expanded = daycare_nn4 * 1000)

daycare.plot <- ggplot() + 
  geom_sf(data = st_union(tract),fill ="white", color = "grey") +
  geom_sf(data = daycare_nn4,aes(color = q5(daycare_nn4.expanded)),size = .75) +
  scale_color_manual(values = palette2,
                     labels = qBr(daycare_nn4,'daycare_nn4.expanded'),name = "Quintile Breaks\n(Percent)") +  mapTheme() +
  labs(title = 'Average Distrance to Nearest 4 Daycare Center',
       subtitle =  "Average Distrance to Nearest Daycare\n(timed by 1000 for better visualization)",
       caption = 'Figure2.6.3') 

grid.arrange(pcttest1.plot, daycare.plot, ncol=2)

2.7 Detection of NA values

To further understand the data set to reduce the error, we counted the NA values of each variable and selected as few as possible with NA values.

#challenge_set
challenge_set<-data %>% filter(price==0)
#model_set
model_set<-data %>% filter(price!=0)
# na
na <- colSums(is.na(model_set))
na.100 <-as.data.frame(subset(na, na>100))
na.100<-tibble::rownames_to_column(na.100, "col_names")

ggplot(na.100) +
  geom_col(aes(x=col_names, y=subset(na, na>100)),fill = '#fecc5c',color="white") +
  labs(title="NA counts in DataFrame",
       subtitle = "Na > 100\n",
       caption = "Figure DATA 2.7") +
  xlab('Variable Column Names') +
  ylab('Count') + plotTheme()+theme(axis.text.x = element_text(angle = 45, hjust = 1, size=8))

III.Methods

The primary goal of this analysis is to train a robust model to predict house price as accurately as possible. Accordingly, this section started to provide a detailed interpretation on the process of the model building.

  • Data and Sample:In order to accurately build and test models, we randomly split the whole model_set into two parts: model_set.train with 77% of observations for model training and model_set.test with 23% of observations for model testing.

  • Dependent Variable: The dependent variable is home sale price: price.

  • Independent Variables: Based on the hedonic home price model, features are put into three catogories. After creating features and engineering and checking correlation coefficients, 88 features are eligible for the model and obviously or slightly correlated with price that were the initial independent variables for model building. We gradually screened out variables and eventually selected only 26 robust features included in the final model

  • Outliers: For the determination of outlier we refer to zillow’s price tier classification: The ZHVI generally applies to three market tiers - bottom, middle and top - roughly equivalent to one-third of each market. The mid-tier ZHVI includes all homes between 35% and 65% of the value of the home. This mid-market range is closest to the standard definition of a “typical” home in a region, and because of the geographic distribution of this challenge we primarily considered a model with a price range in the middle of the market for training purposes.

  • Procedure: The primitive approach used in this analysis for model building is Ordinary Least Squares Regression (OLS) which is aimed at predicting the price of houses as a function of several statistical parameters such as intercept, slope and selected features. In order to determine which features were able to significantly increase the accuracy and generalizability of the model, we systematically compared, plotted, and mapping the mean absolute errors (MAE), mean absolute percent errors (MAPE), as well as root mean square errors (RMSE, the standard deviation of the residuals) between original SalePrice and predicted price of both training set and test set, conducted the cross-validation on 100 folds, explored the underlying spatial pattern of price and errors caused by problematic prediction. To address the problem of neglecting the underlying spatial correlation caused by neighborhood, we then compared, plotted, and mapped the mean absolute errors (MAE), and mean absolute percent errors (MAPE) on the test set between baseline model without neighborhood features and updated model with neighborhood features, and then determined our final model based on series model examinations.

IV.Result

4.1 Dataset Splitting and Model Building

Initially, as indicated in the III. Methods, the dataset was divided into training set and test set, with the ratio of 6:4. Here we use paste()command to make sure to include all levels of our specified variables in the training set, therefore, in the end, the raio of training set and test set goes up to 77:23.

#1.Split Model Set:
set.seed(7278)

# get index for training sample:
inTrain<-caret::createDataPartition(
  y = paste(model_set$descbuildi,model_set$storyheigh,
            model_set$extwall,model_set$bldggrade,
            model_set$price),
  p = .60, 
  list=FALSE)

#Find the number of outliers:
Q1<-quantile(model_set$price)[2]
Q3<-quantile(model_set$price)[4]

number_out = Q3+3*(Q3-Q1)

# split model_set into train and test
model_set.train0<-model_set[inTrain,]
model_set.train<-model_set[inTrain,] %>% 
                  filter(price<=number_out)
model_set.test<-model_set[-inTrain,]

outlier<-model_set[inTrain,] %>% 
                  filter(price>Q3+3*(Q3-Q1))

# Feature Collection
##Non-numeric Features
nonnumeric <- model_set.train %>% 
  dplyr::select(all_of(nonnumeric.feature1))%>% 
  st_drop_geometry()

##Internal Features
sel.internal<-model_set.train %>% 
  dplyr::select(all_of(sel.internal.list)) %>% 
  st_drop_geometry()

##Tract Features
sel.tract<-model_set.train %>% 
  dplyr::select(all_of(sel.tract.list)) %>% 
  st_drop_geometry()

##AmenitY Features
sel.amenitie1<-model_set.train%>% 
  dplyr::select(
  starts_with('college_'),
  starts_with('daycare_'),
  starts_with('MedicalFacilities_'),
  starts_with('library_'),
  starts_with('hist_'),
  starts_with('church'),
  starts_with('capital_'),
  starts_with('park_l'),
  price) %>% 
  st_drop_geometry()

Secondly, we built three models with different features to examine which features should be included.

  • Model 0 included all the features we were interested in.
  • Model 1 excluded some of features which were not significantly correlated with price in Model 1.
  • Model 2 continued to exclude some of features which were not significantly correlated with SalePrice in Model 2 or have strong correlation with other features (considering multicollinearity), but still remained the ones with comparatively large correlation coefficients. Model 2 was the baseline model in this analysis.

The baseline model contained the following features.

  • 1 non-numeric feature: bldggrade.m;
  • 5 internal characteristic features: shape_Area,heatedarea,fullbaths,bedrooms, units;
  • 10 tract features: pctRenterOccupied,pctWhite,MedHHInc,test.elementary,crime_sum,PctLongCommute,Pctgrocery,isin;
  • 4 amenity features: college_nn5,`library_nn2,hist_properties_nn5, capital_improve_nn2.

Model summaries were all shown below.

train_set.all<-model_set.train %>% dplyr::select(price,
  all_of(nonnumeric.feature1),all_of(sel.internal.list),all_of(sel.tract.list),
  starts_with('college_'),
  starts_with('daycare_'),
  starts_with('MedicalFacilities_'),
  starts_with('library_'),
  starts_with('hist_'),
  starts_with('church'),
  starts_with('capital_'),
  starts_with('park_l')
) %>% st_drop_geometry()

M0 <- lm(price ~ ., data = train_set.all)

stargazer(M0, type = "html", 
          title = "Table 4.1.1 Summary Statistics of Model.1 ",
          header = FALSE,
          single.row = TRUE)
Table 4.1.1 Summary Statistics of Model.1
Dependent variable:
price
descbuildiGROUP HOME -41,288.880 (106,072.300)
descbuildiPATIO HOME -61,280.650** (25,644.670)
descbuildiRES -9,986.222 (7,808.843)
descbuildiSFR MODULAR -29,127.580 (29,352.890)
descbuildiSFR TINY -43,556.940 (75,228.800)
storyheigh.m1 STORY -27,773.900 (74,947.140)
storyheigh.m1.5 STORY -20,701.790 (74,958.720)
storyheigh.m2.0 STORY -42,492.290 (74,930.780)
storyheigh.m2.5 STORY -64,654.710 (75,016.770)
storyheigh.m3.0 STORY -85,106.020 (76,397.190)
storyheigh.mELSE -49,622.480 (74,995.120)
aheatingty.mHOT WATER 32,527.840 (29,398.270)
heatedfuel.mGAS 7,524.349*** (1,856.781)
extwall.mSTONE 8,385.035 (26,489.620)
foundationCRAWL SPACE 23,712.870*** (4,803.560)
foundationPIER-NO FOUND WALL 4,674.889 (75,058.990)
foundationSLAB-ABV GRD -16,376.690 (21,325.410)
foundationSLAB-COM 9,560.336 (53,156.830)
foundationSLAB-RES 15,284.410*** (4,905.522)
bldggrade.mCUSTOM 133,211.000*** (32,134.720)
bldggrade.mELSE -40,230.530*** (4,468.043)
bldggrade.mEXCELLENT 254,977.500*** (7,768.602)
bldggrade.mGOOD 64,956.860*** (1,812.582)
bldggrade.mVERY GOOD 176,920.800*** (2,932.553)
isin -8,576.152*** (2,232.975)
shape_Area 0.499*** (0.029)
heatedarea 82.809*** (1.352)
numfirepla 8,501.768*** (1,524.920)
fullbaths 24,863.760*** (1,386.495)
halfbaths 9,889.747*** (1,564.081)
bedrooms -3,807.025*** (801.045)
units -42,573.770*** (3,357.588)
pctRenterOccupied 38,936.350*** (3,818.520)
pctWhite 90,014.570*** (5,055.748)
pctPoverty 71,316.520*** (11,335.610)
MedRent 2.583 (2.971)
MedHHInc 0.402*** (0.040)
MedAge 729.766*** (122.373)
crime_sum -158.630*** (18.376)
PctLongCommute -542.353*** (66.427)
Pctgrocery 293.203*** (29.541)
test.elementary 770.958*** (67.549)
PctBachelor 803.744*** (66.697)
college_nn1 232,770.200*** (87,164.980)
college_nn2 -1,676,351.000*** (294,463.700)
college_nn3 1,721,574.000*** (388,118.200)
college_nn4 4,087,610.000*** (656,997.100)
college_nn5 -5,557,639.000*** (482,525.800)
daycare_nn1 -866,608.500** (346,049.700)
daycare_nn2 2,660,691.000*** (910,155.500)
daycare_nn3 -2,323,108.000 (1,590,074.000)
daycare_nn4 -2,786,717.000 (2,036,884.000)
daycare_nn5 4,087,477.000*** (1,168,428.000)
MedicalFacilities_nn1 -1,636,967.000*** (321,208.000)
MedicalFacilities_nn2 2,554,779.000** (999,597.600)
MedicalFacilities_nn3 1,203,957.000 (1,958,851.000)
MedicalFacilities_nn4 -4,606,285.000* (2,636,509.000)
MedicalFacilities_nn5 2,079,076.000 (1,448,033.000)
library_nn1 -312,612.700*** (99,276.840)
library_nn2 -1,966,036.000*** (311,835.800)
library_nn3 5,982,407.000*** (553,155.900)
library_nn4 -4,744,998.000*** (500,061.800)
library_nn5 245,433.500 (276,283.900)
hist_cemeteries_nn1 -32,462.860 (224,511.400)
hist_cemeteries_nn2 -503,041.600 (574,705.000)
hist_cemeteries_nn3 2,288,197.000** (926,490.600)
hist_cemeteries_nn4 -413,975.500 (1,139,783.000)
hist_cemeteries_nn5 -1,026,067.000 (688,260.800)
hist_properties_nn1 344,364.800* (206,602.700)
hist_properties_nn2 -301,629.400 (381,469.600)
hist_properties_nn3 260,879.700 (717,215.700)
hist_properties_nn4 1,527,876.000* (876,978.600)
hist_properties_nn5 -2,154,458.000*** (518,231.600)
church_nn1 288,215.100 (457,448.300)
church_nn2 -353,314.400 (1,029,434.000)
church_nn3 -1,838,581.000 (1,703,750.000)
church_nn4 8,361,436.000*** (2,664,523.000)
church_nn5 -6,927,791.000*** (1,630,967.000)
capital_improve_nn1 412,495.100** (186,382.100)
capital_improve_nn2 -2,059,174.000*** (462,101.700)
capital_improve_nn3 5,866,141.000*** (800,195.700)
capital_improve_nn4 -7,140,197.000*** (973,193.400)
capital_improve_nn5 5,039,698.000*** (595,806.800)
park_location_nn1 510,160.900* (278,551.900)
park_location_nn2 3,010,701.000*** (672,624.800)
park_location_nn3 -3,473,263.000*** (1,057,989.000)
park_location_nn4 -2,605,383.000 (1,619,162.000)
park_location_nn5 2,376,481.000** (1,020,605.000)
Constant 180,448.100** (76,328.620)
Observations 31,287
R2 0.712
Adjusted R2 0.711
Residual Std. Error 105,738.300 (df = 31198)
F Statistic 876.185*** (df = 88; 31198)
Note: p<0.1; p<0.05; p<0.01
train_set.1<-model_set.train %>% dplyr::select(
  price,
  #Non-numeric Feature
  bldggrade.m,
  #Internal Feature
  shape_Area ,heatedarea,fullbaths,bedrooms, units,
  #Tract Feature
  pctRenterOccupied , pctWhite ,pctPoverty,MedHHInc,
  test.elementary,crime_sum,PctLongCommute,Pctgrocery,MedAge,isin,
  #Amenities Feature
  college_nn4,college_nn5,library_nn3,library_nn4,
  library_nn2,hist_properties_nn5,capital_improve_nn4,
  capital_improve_nn5,park_location_nn2,
  park_location_nn3) %>% st_drop_geometry()

M1 <- lm(price ~ ., data = train_set.1)

stargazer(M1, type = "html", 
          title = "Table 4.1.2 Summary Statistics of Model.2 ",
          header = FALSE,
          single.row = TRUE)
Table 4.1.2 Summary Statistics of Model.2
Dependent variable:
price
bldggrade.mCUSTOM 158,633.500*** (32,722.530)
bldggrade.mELSE -39,671.670*** (4,475.939)
bldggrade.mEXCELLENT 268,849.500*** (7,754.282)
bldggrade.mGOOD 73,074.650*** (1,690.832)
bldggrade.mVERY GOOD 190,765.300*** (2,771.332)
shape_Area 0.533*** (0.027)
heatedarea 82.655*** (1.169)
fullbaths 19,889.830*** (1,205.370)
bedrooms -5,792.850*** (779.885)
units -41,056.630*** (2,945.156)
pctRenterOccupied 49,516.610*** (3,591.019)
pctWhite 116,152.900*** (4,193.958)
pctPoverty 48,076.890*** (10,545.090)
MedHHInc 0.623*** (0.032)
test.elementary 1,166.981*** (61.624)
crime_sum -156.560*** (17.586)
PctLongCommute -645.459*** (60.158)
Pctgrocery 378.763*** (28.141)
MedAge 386.205*** (112.264)
isin -6,736.273*** (2,197.607)
college_nn4 2,993,805.000*** (175,618.500)
college_nn5 -3,925,922.000*** (162,503.400)
library_nn3 8,355,500.000*** (461,825.900)
library_nn4 -6,547,574.000*** (330,246.800)
library_nn2 -2,814,707.000*** (185,514.300)
hist_properties_nn5 -258,243.100*** (21,605.490)
capital_improve_nn4 -2,784,548.000*** (384,867.600)
capital_improve_nn5 4,769,145.000*** (400,984.800)
park_location_nn2 2,742,174.000*** (371,010.500)
park_location_nn3 -3,218,861.000*** (384,560.700)
Constant 177,698.700*** (8,424.407)
Observations 32,671
R2 0.705
Adjusted R2 0.705
Residual Std. Error 107,947.200 (df = 32640)
F Statistic 2,605.773*** (df = 30; 32640)
Note: p<0.1; p<0.05; p<0.01
train_set.2<-model_set.train %>%
  dplyr::select(
  price,
  #Non-numeric Feature
  bldggrade.m,
  #Internal Feature
  shape_Area ,heatedarea,fullbaths,bedrooms , units,
  #Tract Feature
  pctRenterOccupied , pctWhite,MedHHInc,test.elementary,crime_sum,PctLongCommute,Pctgrocery,isin,
  #Amenities Feature
  college_nn5,library_nn2,hist_properties_nn5,capital_improve_nn2
 ) %>% st_drop_geometry()

M2 <- lm(price ~ ., data = train_set.2)

stargazer(M2, type = "html", 
          title = "Table 4.1.3 Summary Statistics of Model.2 ",
          header = FALSE,
          single.row = TRUE)
Table 4.1.3 Summary Statistics of Model.2
Dependent variable:
price
bldggrade.mCUSTOM 187,603.300*** (33,322.080)
bldggrade.mELSE -44,502.180*** (4,551.078)
bldggrade.mEXCELLENT 280,653.900*** (7,873.788)
bldggrade.mGOOD 80,274.340*** (1,695.223)
bldggrade.mVERY GOOD 199,805.200*** (2,779.488)
shape_Area 0.517*** (0.027)
heatedarea 80.575*** (1.179)
fullbaths 20,678.420*** (1,224.301)
bedrooms -6,848.791*** (791.729)
units -40,946.250*** (2,990.728)
pctRenterOccupied 57,237.990*** (3,487.492)
pctWhite 123,198.500*** (4,081.952)
MedHHInc 0.682*** (0.029)
test.elementary 1,110.838*** (61.527)
crime_sum -38.499** (17.298)
PctLongCommute -458.964*** (58.735)
Pctgrocery 326.338*** (28.401)
isin 3,863.954* (2,191.058)
college_nn5 -1,027,085.000*** (27,256.650)
library_nn2 -705,187.300*** (33,561.000)
hist_properties_nn5 -133,047.700*** (14,653.380)
capital_improve_nn2 814,010.000*** (53,984.940)
Constant 147,814.400*** (6,864.642)
Observations 32,799
R2 0.695
Adjusted R2 0.695
Residual Std. Error 109,970.800 (df = 32776)
F Statistic 3,392.057*** (df = 22; 32776)
Note: p<0.1; p<0.05; p<0.01

4.2 Training Set Summary Results

The results of the regression on the training set are in the following table. The current model embraced 18 independent features and reached 0.695 R-square, indicating that 69.5% of the variance for price can be explained by this model.

train_set.2<-model_set.train %>%
  dplyr::select(
  price,
  #Non-numeric Feature
  bldggrade.m,
  #Internal Feature
  shape_Area ,heatedarea,fullbaths,bedrooms , units,
  #Tract Feature
  pctRenterOccupied , pctWhite,MedHHInc,test.elementary,crime_sum,PctLongCommute,Pctgrocery,isin,
  #Amenities Feature
  college_nn5,library_nn2,hist_properties_nn5,capital_improve_nn2
 ) %>% st_drop_geometry()

M2 <- lm(price ~ ., data = train_set.2)

stargazer(M2, type = "html", 
          title = "Table 4.2.1 Summary Statistics of Model.2 ",
          header = FALSE,
          single.row = TRUE)
Table 4.2.1 Summary Statistics of Model.2
Dependent variable:
price
bldggrade.mCUSTOM 187,603.300*** (33,322.080)
bldggrade.mELSE -44,502.180*** (4,551.078)
bldggrade.mEXCELLENT 280,653.900*** (7,873.788)
bldggrade.mGOOD 80,274.340*** (1,695.223)
bldggrade.mVERY GOOD 199,805.200*** (2,779.488)
shape_Area 0.517*** (0.027)
heatedarea 80.575*** (1.179)
fullbaths 20,678.420*** (1,224.301)
bedrooms -6,848.791*** (791.729)
units -40,946.250*** (2,990.728)
pctRenterOccupied 57,237.990*** (3,487.492)
pctWhite 123,198.500*** (4,081.952)
MedHHInc 0.682*** (0.029)
test.elementary 1,110.838*** (61.527)
crime_sum -38.499** (17.298)
PctLongCommute -458.964*** (58.735)
Pctgrocery 326.338*** (28.401)
isin 3,863.954* (2,191.058)
college_nn5 -1,027,085.000*** (27,256.650)
library_nn2 -705,187.300*** (33,561.000)
hist_properties_nn5 -133,047.700*** (14,653.380)
capital_improve_nn2 814,010.000*** (53,984.940)
Constant 147,814.400*** (6,864.642)
Observations 32,799
R2 0.695
Adjusted R2 0.695
Residual Std. Error 109,970.800 (df = 32776)
F Statistic 3,392.057*** (df = 22; 32776)
Note: p<0.1; p<0.05; p<0.01
broom::glance(M2) %>%
  kable(caption = 'Table 4.2.2 Summary of Model.2 Evaluation Parameters') %>%
  kable_styling("striped", full_width = F)
Table 4.2.2 Summary of Model.2 Evaluation Parameters
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.6948266 0.6946218 109970.8 3392.057 0 22 -427258.1 854564.2 854765.7 396379127629018 32776 32799

4.3 MAE, MAPE on Test Set

The Mean Absolute Errors (MAE) and Mean Absolute Percent Errors (MAPE) are statistical measures of how accurate a forecast system is.Based on the following table, the MAE is 67903.84, the errors are accounted for 19.20%.

training_model<-M2

train_model_set.test<-model_set.test %>% 
  mutate(Regression = "Baseline Regression",
         price.Predict = predict(training_model, newdata = model_set.test),
         price.Error = price - price.Predict,
         price.AbsError = abs(price - price.Predict),
         price.APE = (abs(price - price.Predict)) / price) 
  
#%>% filter(price<=1500000)

test_GoodnessFit<-as.data.frame(cbind(mean(train_model_set.test$price.AbsError,na.rm = T),
                                      mean(train_model_set.test$price.APE,na.rm = T)))

colnames(test_GoodnessFit)<-c("Mean Absolute Errors (MAE)","MAPE")
kable(test_GoodnessFit,
      caption = 'Table 4.3.1. MAE and MAPE for Test Set') %>%
  kable_styling("striped", full_width = F)
Table 4.3.1. MAE and MAPE for Test Set
Mean Absolute Errors (MAE) MAPE
67903.84 0.1920308
grid.arrange(ncol=2,
ggplot()+
  geom_sf(data=st_union(zipcode), fill="transparent", color='grey90')+
  geom_sf(data=train_model_set.test,
          aes(color=q5(price)),show.legend="point",size=1.0)+
  scale_color_manual(values = palette2,labels = qBr(train_model_set.test, "price"),name = "Quintile Breaks")+
  labs(title="Test set sale price",subtitle = "Mecklenberg County,NC",caption="Figure 4.3.1")+
  mapTheme(),

ggplot()+
  geom_sf(data=st_union(zipcode), fill="transparent", color='grey90')+
  geom_sf(data=train_model_set.test,
          aes(color=q5(price.AbsError)),show.legend="point",size=1.0)+
  scale_color_manual(values = palette2,labels = qBr(train_model_set.test, "price.AbsError"),name = "Quintile Breaks")+
  labs(title="Test set absolute sale price errors",subtitle = "Mecklenberg County,NC",caption="Figure 4.3.2")+
  mapTheme())

4.4 Cross-Validation

This section conducted cross-validation to assess how our model would generalize to other independent data sets, in which we could flag problem of overfitting or feature selection bias.

Both a summary table and distribution plots of R-square, RMSE, and MAE in each of 100 fold are shown below.

The variation of 100 folds can also be visualized with a histogram of across-fold MAE. Although most of the errors are clustering tightly together, some errors are scattered, indicating the inconsistency of the model prediction.

# 4. Cross-Validation Test on traning model
##use caret package cross-validation method
fitControl <- caret:: trainControl(method = "cv", 
                                   number = 100,
                                   savePredictions = TRUE)

set.seed(2777)

model_set.train.cv <- 
  caret::train(price ~ ., data = st_drop_geometry(train_model_set.test) %>% 
                 dplyr::select(
  price,
  #Non-numeric Feature
  bldggrade.m,
  #Internal Feature
  shape_Area ,heatedarea,fullbaths,bedrooms , units,
  #Tract Feature
  pctRenterOccupied , pctWhite ,MedHHInc ,test.elementary,crime_sum,PctLongCommute,Pctgrocery,isin,
  #Amenities Feature
  college_nn5,library_nn2,hist_properties_nn5,capital_improve_nn2
                  ),
                  method = "lm", 
                  trControl = fitControl, 
                  na.action = na.pass)

#resample<-model_set.train.cv$resample

## show RMSE, MAE, R^2
kable(model_set.train.cv$resample,
          caption = 'Table 4.4.1 Cross-validation Test: Summary of RMSE, R Squared and MAE') %>%
  kable_styling("striped", full_width = F)
Table 4.4.1 Cross-validation Test: Summary of RMSE, R Squared and MAE
RMSE Rsquared MAE Resample
80912.91 0.7522926 61988.09 Fold001
121436.00 0.7463505 74018.64 Fold002
84759.52 0.8062709 58258.07 Fold003
106158.15 0.6756605 70056.23 Fold004
77961.65 0.7357602 57834.81 Fold005
73482.07 0.6995119 60563.83 Fold006
90359.49 0.8297619 61888.08 Fold007
123246.61 0.6600689 72162.88 Fold008
89536.26 0.7077044 68944.49 Fold009
79729.21 0.8622709 60546.61 Fold010
105376.36 0.8211884 74854.02 Fold011
85153.64 0.7593606 62359.78 Fold012
138457.91 0.7389223 76872.49 Fold013
117106.38 0.6930716 68160.00 Fold014
77425.56 0.7984826 58554.28 Fold015
81324.56 0.7790243 65255.28 Fold016
76080.88 0.7718194 59405.70 Fold017
77030.65 0.8606175 56063.12 Fold018
82696.65 0.6776215 61610.77 Fold019
74572.40 0.8444962 59572.49 Fold020
96769.75 0.7429385 74764.35 Fold021
81408.27 0.7492455 59731.98 Fold022
72124.31 0.7617940 57996.25 Fold023
93899.61 0.7655386 68447.70 Fold024
92350.77 0.7479819 69831.77 Fold025
79315.84 0.7646417 61775.05 Fold026
88734.00 0.6529469 68161.92 Fold027
85941.28 0.6139372 63653.07 Fold028
84634.52 0.7608822 61516.12 Fold029
157107.57 0.5184221 74765.82 Fold030
95249.51 0.7892243 69531.25 Fold031
229790.59 0.3267036 84076.76 Fold032
111974.08 0.8990451 71539.04 Fold033
75757.38 0.7395878 59932.93 Fold034
81072.31 0.7906971 60444.22 Fold035
81416.65 0.7585552 64237.95 Fold036
82397.53 0.7850780 60302.60 Fold037
76614.44 0.8437298 59364.15 Fold038
77155.37 0.6453661 61788.12 Fold039
88923.53 0.7384771 67362.11 Fold040
77696.50 0.6399515 57394.28 Fold041
96197.80 0.7283474 68342.53 Fold042
85975.44 0.7235721 66455.44 Fold043
82635.11 0.7420236 66598.59 Fold044
191299.54 0.8068155 76287.26 Fold045
77791.53 0.7771961 62149.63 Fold046
84071.75 0.7720400 55966.00 Fold047
128838.55 0.8780460 71954.15 Fold048
117833.10 0.6207811 77862.55 Fold049
88806.16 0.8260653 64398.22 Fold050
222750.62 0.3160261 93176.26 Fold051
67587.63 0.8609748 55261.32 Fold052
131942.65 0.5363726 67728.07 Fold053
78807.51 0.6802213 63149.29 Fold054
111583.45 0.7117481 71604.75 Fold055
75843.45 0.7808069 62636.57 Fold056
79765.84 0.8380014 52958.86 Fold057
152709.24 0.3062003 69524.11 Fold058
91941.43 0.6870728 66069.61 Fold059
83946.55 0.9090477 59549.83 Fold060
82260.87 0.8998816 54390.27 Fold061
94727.41 0.7267390 71415.28 Fold062
74954.90 0.7525417 59888.64 Fold063
84846.60 0.7388638 60976.16 Fold064
87286.07 0.6511729 61133.68 Fold065
73718.70 0.6295785 57520.68 Fold066
100651.76 0.7853676 67289.36 Fold067
92009.72 0.8420283 60763.85 Fold068
83402.51 0.6873405 63908.65 Fold069
78725.59 0.7958884 62264.96 Fold070
82053.26 0.7329766 62662.11 Fold071
117373.73 0.7646259 65532.60 Fold072
117038.68 0.8892069 68011.10 Fold073
78187.35 0.7468970 59884.77 Fold074
127063.38 0.8246603 77283.72 Fold075
82866.05 0.7934121 63257.25 Fold076
80706.16 0.7181297 64232.99 Fold077
100171.13 0.6949291 69780.16 Fold078
78924.85 0.7310373 60734.92 Fold079
209152.19 0.2315615 78269.54 Fold080
80822.99 0.7194550 59160.08 Fold081
93313.42 0.8006180 63157.04 Fold082
80407.82 0.6689253 65451.87 Fold083
81344.46 0.8013644 60615.17 Fold084
92029.97 0.7730008 66479.96 Fold085
93015.49 0.6081515 66038.53 Fold086
111518.41 0.7668593 74610.24 Fold087
88582.27 0.7614341 61794.83 Fold088
90290.17 0.7626602 66784.21 Fold089
90975.72 0.8030460 65394.76 Fold090
95298.47 0.8801259 64869.53 Fold091
100044.48 0.5777809 67635.81 Fold092
101148.57 0.7557638 72189.06 Fold093
82931.81 0.7227645 63470.03 Fold094
96950.00 0.5679764 65768.87 Fold095
97109.87 0.6834293 69545.44 Fold096
88167.20 0.7719486 58542.91 Fold097
73682.37 0.6315007 56274.40 Fold098
72719.75 0.7735177 57527.89 Fold099
127925.53 0.7875308 78286.84 Fold100
model_set.train.cv$resample %>% 
  pivot_longer(-Resample) %>% 
  mutate(name = as.factor(name)) %>% 
  ggplot(., aes(x = name, y = value, color = name)) +
  geom_jitter(width = 0.1) +
  facet_wrap(~name, ncol = 3, scales = "free") +
  theme_bw() +
  theme(legend.position = "none") +
  labs(title = 'Cross-validation Test: Distribution of MAE, RMSE, R Squared\n',
       caption = "Figure DATA 4.4.1") +
  plotTheme()

ggplot(data = model_set.train.cv$resample) +
  geom_histogram(aes(x = model_set.train.cv$resample$MAE), fill = '#fecc5c',color="white",bins=30) +
  labs(title="Distribution of Cross-validation MAE",
       subtitle = "K = 100\n",
       caption = "Figure DATA 4.4.2") +
  xlab('MAE of Training Model') +
  ylab('Count') + plotTheme()

4.5 Plot Predicted Prices

The Figure 5.5.1 below shows the predicted prices as a function of observed price. This plot indicates that our model can predict price but not so perfectly since the regression line (shown in dark-blue) and the reference line (shown in dark-red) are not completely overlapped.

# 5. Plot predicted prices as a function of observed prices
model_set.train0<-model_set[inTrain,] %>% filter(price<=3000000)
preds.train <- data.frame(pred   = predict(training_model, model_set.train0),
                          actual = model_set.train0$price,
                          source = "Training Set")
preds.test  <- data.frame(pred   = predict(training_model, train_model_set.test),
                          actual = train_model_set.test$price,
                          source = "Test Set")
preds <- rbind(preds.train, preds.test)

## Overall Plotting
ggplot(preds.test, aes(x = actual, y = pred, color = source)) +
  geom_point(color='#abd9e9') +
  geom_smooth(method = "lm", color = "darkblue") +
  geom_abline(color = "darkred") +
  coord_equal() +
  theme_bw() +
  labs(title = "Predicted Prices as a\nFunction of Observed Prices",
       subtitle = 'Blue line is model regression line\nwhile red line is reference line\n',
       caption = 'Figure DATA 4.5.1',
       x = "Observed Prices",
       y = "Predicted Prices") +
  theme(
    legend.position = "none"
  ) +
  plotTheme()

vs.plot<- ggplot(preds, aes(x = actual, y = pred,color=source)) +
  geom_point() +
  geom_smooth(method = "lm", color = "darkblue") +
  geom_abline(color = "darkred") +
  coord_equal() +
  theme_bw() +
  facet_wrap(~source, ncol = 2) +
  labs(title = "Predicted Prices as a Function of Observed Prices, by Sets",
       subtitle = 'Blue line is model regression line while red line is reference line\n',
       caption = 'Figure DATA 4.5.2',
       x = "Observed Prices",
       y = "Predicted Prices") +
  scale_color_manual(values = c('#abd9e9', '#fecc5c'))+
  theme(
    legend.position = "none"
  ) +
  plotTheme()
vs.plot + geom_vline(xintercept=3000000, color='orange') + 
  annotate("text", x = 3300000, y = 2000000, label = "Traning set\n upperbound",size=3)

4.6 Map of Test Set Residuals

In this section, we leveraged three different tools to examine if prediction errors display certain spatial patterns, and accordingly to determine whether we need to add more spatial features in the existing model.

  1. Mapping residuals: In the map of the residuals on the test set, we can see there is no clear pattern of the error directions.

  2. Plotting spatial lags: A spatial lag is a variable that averages the neighboring values of a location. The spatial lag of error plot shows that as home price errors increase, the nearby home price errors slightly increase, indicating that our model errors are rarely spatially autocorrelated.

  3. Moran’s I: Another approach for measuring spatial autocorrelation is Moran’s I. In this model Moran’s I is around 0.18, indicating a certain level of clustering on model errors. The frequency of all 999 randomly permuted I were plotted as a histogram with the Observed I indicated by the orange line (see Figure 4.6.3). That the observed I is higher than all of the 999 randomly generated I’s provides visual confirmation of spatial autocorrelation despite being slight. This suggests that variation in price is likely related to the spatial process has been omitted from our current model.

Therefore, we thought it was worthy to include neighborhood features in the model for the further test.

ggplot()+
  geom_sf(data=st_union(tract),fill="white",color='grey90')+
  geom_sf(data=train_model_set.test,aes(color=q5(price.Error)),show.legend = "point",size=1.2)+
  scale_color_manual(values=palette5,labels = qBr(train_model_set.test,'price.Error'),
                     name = "Residuals")+
  labs(title = 'Map of residuals on test set\n',subtitle = "Mecklenburg County, NC",
       caption = 'Figure 4.6.1') +
  mapTheme()

abs<-train_model_set.test %>% filter(is.na(price.AbsError)==F)
coords.test<-st_coordinates(abs)
neighborList.test <- knn2nb(knearneigh(coords.test, 5))

spatialWeights.test <- nb2listw(neighborList.test, style="W")

abs$lagPrice.Error <- lag.listw(spatialWeights.test, abs$price.Error,NAOK = TRUE)

ggplot(abs, aes(x = lagPrice.Error, y = price.Error)) +
  geom_point(colour = '#abd9e9') +
  geom_smooth(method = "lm", se = FALSE, colour ='orange') +
  labs(title = "Error on Test Set as a function of the Spatial Lag of Price\n",
       caption = "Figure 4.6.2",
       x = "Spatial lag of errors (Mean error of 5 nearest neighbors)",
       y = "Errors of Sale Price") +
  plotTheme()

moranTest <- moran.mc(abs$price.AbsError, 
                      spatialWeights.test, nsim = 999)

moran.plot <- ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
  geom_histogram(binwidth = 0.01) +
  geom_vline(aes(xintercept = moranTest$statistic), colour = "orange",size=1) +
  scale_x_continuous(limits = c(-1, 1)) +
  labs(title="Observed and permuted Moran's I",
       subtitle= "Observed Moran's I in orange, I = 0.18",
       x="Moran's I",
       y="Count",
       caption="Figure4.6.3") +
  plotTheme()

moran.plot

4.7 Neighborhood Effects Model

Existing findings have indicated that neighborhood could also be a potential feature for model building. Hence, this section built a new model called Neighborhood Effects model based on the baseline model by adding neighborhood feature zip.

train_set.3<-model_set.train %>% filter(price<=1100000) %>% 
  dplyr::select(
  price,
  #Non-numeric Feature
  bldggrade.m,
  #Internal Feature
  shape_Area ,heatedarea,fullbaths,bedrooms , units,
  #Tract Feature
  pctRenterOccupied , pctWhite ,MedHHInc ,test.elementary,crime_sum,PctLongCommute,Pctgrocery,zip,isin,
  #Amenities Feature
  college_nn5,library_nn2,hist_properties_nn5,capital_improve_nn2) %>% st_drop_geometry()

Nhood <- lm(price ~ ., data = train_set.3)

stargazer(Nhood, type = "html", 
          title = "Table 4.7.1 Summary Statistics of Neighborhood Effects Model ",
          header = FALSE,
          single.row = TRUE)
Table 4.7.1 Summary Statistics of Neighborhood Effects Model
Dependent variable:
price
bldggrade.mCUSTOM 26,376.520 (38,667.860)
bldggrade.mELSE -31,650.480*** (4,282.240)
bldggrade.mEXCELLENT 231,077.900*** (8,117.218)
bldggrade.mGOOD 71,902.980*** (1,633.686)
bldggrade.mVERY GOOD 178,487.400*** (2,678.114)
shape_Area 0.548*** (0.026)
heatedarea 82.809*** (1.119)
fullbaths 19,921.830*** (1,154.489)
bedrooms -5,483.186*** (782.413)
units -43,767.990*** (2,808.664)
pctRenterOccupied 20,847.170*** (3,485.965)
pctWhite 66,434.720*** (4,796.226)
MedHHInc 0.477*** (0.032)
test.elementary 774.515*** (60.706)
crime_sum -166.230*** (18.298)
PctLongCommute -442.497*** (64.490)
Pctgrocery 155.427*** (28.238)
zip28031 78,749.000** (36,398.030)
zip28036 13,920.070 (36,363.820)
zip28078 4,354.406 (36,080.780)
zip28104 116,992.300*** (41,153.540)
zip28105 62,712.940* (36,552.890)
zip28107 40,377.610 (48,779.050)
zip28134 91,440.900** (36,903.350)
zip28202 244,731.500*** (46,631.940)
zip28203 237,280.200*** (37,020.310)
zip28204 206,283.400*** (37,722.830)
zip28205 157,953.300*** (36,437.010)
zip28206 100,655.400*** (37,004.960)
zip28207 233,806.900*** (38,290.860)
zip28208 84,233.620** (36,601.070)
zip28209 202,412.800*** (36,675.530)
zip28210 118,848.300*** (36,575.750)
zip28211 144,037.000*** (36,523.230)
zip28212 63,046.490* (36,583.670)
zip28213 51,518.250 (36,774.660)
zip28214 66,615.310* (36,485.170)
zip28215 74,792.990** (36,526.410)
zip28216 59,080.300 (36,311.460)
zip28217 41,638.530 (36,971.960)
zip28226 75,541.000** (36,581.770)
zip28227 66,664.740* (36,504.830)
zip28262 31,160.350 (36,683.380)
zip28269 38,916.500 (36,223.850)
zip28270 76,740.100** (36,547.930)
zip28273 88,377.940** (36,852.600)
zip28277 114,149.600*** (36,627.970)
zip28278 122,936.900*** (36,831.400)
isin -4,633.993** (2,169.319)
college_nn5 -148,562.200** (61,725.460)
library_nn2 -536,522.900*** (47,126.170)
hist_properties_nn5 -633,455.000*** (45,726.650)
capital_improve_nn2 995,714.400*** (66,391.810)
Constant 114,512.000*** (36,962.150)
Observations 32,574
R2 0.715
Adjusted R2 0.714
Residual Std. Error 101,761.100 (df = 32520)
F Statistic 1,536.509*** (df = 53; 32520)
Note: p<0.1; p<0.05; p<0.01

The Mean Absolute Errors (MAE) and Mean Absolute Percent Errors (MAPE) are statistical measures of how accurate a forecast system is.Based on the following table, the MAE is 65032.58, the errors are accounted for 18.24%. The Neighborhood Effects Model is slightly better than the Baseline Regression Model.

Nhood_model.test<-model_set.test %>% 
  mutate(Regression = "Neighborhood Effects",
         price.Predict = predict(Nhood, newdata = model_set.test),
         price.Error = price - price.Predict,
         price.AbsError = abs(price - price.Predict),
         price.APE = (abs(price - price.Predict)) / price) 
#%>% filter(price<=1500000)

test_GoodnessFit2<-as.data.frame(cbind(mean(Nhood_model.test$price.AbsError,na.rm = T),
                                      mean(Nhood_model.test$price.APE,na.rm = T)))

colnames(test_GoodnessFit2)<-c("Mean Absolute Errors (MAE)","MAPE")
kable(test_GoodnessFit2,
      caption = 'Table 4.7.2 MAE and MAPE for Neighborhood Effects') %>%
  kable_styling("striped", full_width = F)
Table 4.7.2 MAE and MAPE for Neighborhood Effects
Mean Absolute Errors (MAE) MAPE
64826.58 0.1808676

4.8 Map of Predicted Prices on the Whole Dataset

Applying the new Neighborhood Effects model on the whole dataset, the distribution of sales price is shown in the map below. Sale prices are generally higher in south and north Mecklenburg County. There is no significant difference between the pattern of the distribution of price predictions in the two datasets (CHALLENGE & MODELLING).

# $392500: Median house value in Charlotte, NC, 2022 August.
data$Predicted.Price<- predict(Nhood, newdata = data)
data1<-data %>% mutate(Predicted.Price=ifelse(is.na(Predicted.Price),392500,Predicted.Price))

ggplot() + 
  geom_sf(data = st_union(tract),fill = 'white', color='gray90')+
  geom_sf(data = data1,aes(color = q5(Predicted.Price)),size = .75) +
  scale_color_manual(values = palette5,
                     labels = qBr(data,'Predicted.Price'),
                     name = "Price") +
  facet_wrap(~toPredict) +
  labs(title = 'Map of Predicted Sale Price\n',
       subtitle = 'Mecklenburg, NC',
       caption = 'Figure 4.8') +
  mapTheme() 

4.9 Map of MAPE by Neighborhood

Indicated in the summary table as well as the choropleth map below, MAPE varies across neighborhoods (only 31 neighborhoods left in the test set).

nhood_sum <- Nhood_model.test %>% 
  group_by(zip) %>%
  summarize(meanPrice = mean(price, na.rm = T),
            meanPrediction = mean(price.Predict, na.rm = T),
            meanMAE = mean(price.AbsError, na.rm = T),
            meanMAPE = mean(price.APE, na.rm = T)) 

nhood_sum %>%
  st_drop_geometry ()%>%
  arrange(desc(meanMAPE)) %>% 
  kable(caption = 'Table 4.9.1 Map of MAPE by Neighborhood') %>% 
  kable_styling("striped", full_width = F)
Table 4.9.1 Map of MAPE by Neighborhood
zip meanPrice meanPrediction meanMAE meanMAPE
28206 302294.6 306900.2 65613.83 0.2898917
28217 268027.3 256764.5 62816.12 0.2834093
28202 1075000.0 781470.2 293529.85 0.2730510
28208 280820.1 276215.9 63010.26 0.2622166
28031 454742.3 474134.7 102210.32 0.2481083
28209 574234.4 568169.0 136564.86 0.2417120
28207 1380000.0 1010612.8 399460.15 0.2359640
28036 522035.3 522523.0 117511.25 0.2347435
28211 698479.6 626548.7 166087.37 0.2287594
28205 430583.1 441666.0 76832.06 0.2252741
28203 678025.6 693575.3 117119.47 0.2078613
28216 301299.0 299101.9 56398.98 0.2011882
28213 293816.8 294920.6 54584.41 0.1996086
28227 337765.6 333025.3 60205.70 0.1949460
28262 322175.6 313407.7 66999.60 0.1944832
28204 834266.7 756309.9 153203.80 0.1851236
28212 280618.4 277305.3 49801.35 0.1840066
28269 310648.3 305612.5 51840.69 0.1729488
28214 296309.4 294868.0 47349.03 0.1706755
28226 527570.8 503374.5 89843.24 0.1619056
28278 404804.8 414181.5 62373.07 0.1612400
28273 318289.1 319602.1 46263.07 0.1585278
28270 501356.1 503340.1 76674.33 0.1575808
28078 414550.2 423749.8 63269.58 0.1572238
28210 506077.4 482239.1 80324.70 0.1545924
28215 297685.4 299806.0 44688.39 0.1541592
28134 366493.8 353847.9 54851.67 0.1526134
28105 386259.7 377921.8 53941.96 0.1479396
28277 523426.3 511843.4 78758.56 0.1477522
28104 408500.0 424366.1 30888.99 0.0918594
NA 490000.0 NaN NaN NaN

The following Figure 4.9.2 shows the difference between the Baseline Regression model and Neighborhood Effects model on the mean MAPE on the test set by neighborhood, and we can see that the latter model performs better. Besides, in the test set, both model exhibited larger MAPE in Downtown Charlotte.

pal<-colorRampPalette(c("#fff7fb","#ece2f0","#d0d1e6","#a6bddb","#67a9cf","#3690c0","#02818a","#016c59","#014636"))
pal_1<-pal(90)

bothRegression<-bothRegression<-rbind(train_model_set.test,Nhood_model.test)

st_drop_geometry(bothRegression) %>%
  group_by(Regression, zip) %>%
  summarize(mean.MAPE = mean(price.APE, na.rm = T)) %>%
  ungroup() %>% 
  left_join(zipcode) %>%
    st_sf() %>%
    ggplot() + 
      geom_sf(aes(fill = mean.MAPE),color="white") +
      facet_wrap(~Regression) +
      scale_fill_gradientn(colors=palette2,
                          name = "MAPE") +
      labs(title = "Mean test set MAPE by neighborhood",subtitle = "Mean test set MAPE by neighborhood",caption = "Figure 4.9.1") +mapTheme()

4.10 Plot MAPE by Neighborhood

This section plotted MAPE by neighborhood as a function of mean price by neighborhood. Neighborhoods with mean sale price below or above 250,000 have higher MAPE, it has the lowest MAPE around 250,000. This finding indicates that most errors occurred in the neighborhoods where house price was relatively lower or relatively higher.

ggplot()+
  geom_point(data=nhood_sum,aes(x=meanPrice,y=meanMAPE), color='#abd9e9')+
  stat_smooth(data=nhood_sum,aes(x = meanPrice, y = meanMAPE), method = "loess", se = FALSE, size = 1, colour="orange") + 
  labs(title = "MAPE by Neighborhood as a Function of\nMean Price by Neighborhood\n",
       caption = 'Figure DATA 4.10') +
  xlab('Mean Price by Neighborhood') +
  ylab('Mean MAPE by Neighborhood') +
  plotTheme()

4.11 Generalizability Examination

This section gathered Census data to test how well baseline model generalizes to different group contexts, such as race group and income level.

4.11.1 Context Group Building

we selected four census variables to display the spatial pattern of each group: * Majority White vs. Majority Non-White: divided whether the proportion of white population is accounted for more than 50% of total population or not. * High Income vs. Low Income: divided whether the median household income in certain census tract is above or below the average level.

census<-model_set.test %>%   
  mutate(raceContext = ifelse(pctWhite > .5, "Majority White", "Majority Non-White"),
         incomeContext = ifelse(MedHHInc > mean(MedHHInc,na.rm = T), "High Income", "Low Income"),
        vacantContext = ifelse(pctTotalVacant > .5, "Majority Vacant", "Majority Non-Vacant"),
        pctRenterContext = ifelse(pctPoverty > .5, "Majority Renter Occupied", "Majority Non-Renter Occupied")) %>%dplyr::select(raceContext,incomeContext,vacantContext,pctRenterContext)

census<-census %>% st_transform(crs=4326)

grid.arrange(ncol = 2,
             ggplot() + 
               geom_sf(data = st_union(tract), fill = 'white',color='grey90') + 
               geom_sf(data = na.omit(census), aes(color = raceContext),
                       show.legend = "point", size= .75) +
               scale_color_manual(values = c('#abd9e9', '#fecc5c'), name="Race Context") +
               labs(title = "Race Context of House\n",
                    caption = 'Figure 4.11.1') +
               mapTheme() + 
               theme(legend.position="bottom"),
             
             ggplot() + 
               geom_sf(data = st_union(tract), fill = 'white',color='grey90') +
               geom_sf(data = na.omit(census), aes(color = incomeContext),
                       show.legend = "point", size= .75) +
               scale_color_manual(values = c('#abd9e9', '#fecc5c'), name="Income Context") +
               labs(title = "Income Context of House\n",
                    caption = 'Figure 4.11.2') +
               mapTheme() + 
               theme(legend.position="bottom"))

4.11.2 Model Modifying with Neighborhood Effects

MAE, and MAPE of both regression models are shown below. The finding was not intuitive that the new model performed a little better in terms of MAE and MAPE.

Hence, it is imperative to further examine how well both models could be generalized in race context and income level context, and then to determine which model should be remained.

bothRegression<-rbind(train_model_set.test,Nhood_model.test) %>% dplyr::select(Regression,price.AbsError,price.APE)

st_drop_geometry(bothRegression) %>% gather(Variable,Value,-Regression) %>%  group_by(Regression, Variable) %>%
    summarize(meanValue = mean(Value, na.rm = T)) %>%
    spread(Variable, meanValue) %>%
    kable(caption = 'Table 4.11.1\nSummary of MAE, and MAPE by Regression Models') %>% kable_styling("striped", full_width = F)
Table 4.11.1 Summary of MAE, and MAPE by Regression Models
Regression price.AbsError price.APE
Baseline Regression 67903.84 0.1920308
Neighborhood Effects 64826.58 0.1808676

4.11.3 Generalizability on Context Groups

Considering context groups, our new model with neighborhood features performs better in terms of MAPE and MAE.

The Neighborhood Effects model generalizes well on both context groups with close MAPE to the one generated on the test group.

#scales::percent
bothRegression<-rbind(train_model_set.test,Nhood_model.test)

race.group <- st_join(bothRegression, census) %>% 
  group_by(Regression, raceContext) %>%
  summarize(mean.MAPE = (mean(price.APE, na.rm = T))) %>%
  st_drop_geometry() %>%
  spread(raceContext, mean.MAPE) %>%
  kable(caption = "Table 4.11.2\nTest set MAPE by Neighborhood Racial Context") %>% 
  kable_styling("striped", full_width = F)

race.group
Table 4.11.2 Test set MAPE by Neighborhood Racial Context
Regression Majority Non-White Majority White
Baseline Regression 0.1987990 0.1849117
Neighborhood Effects 0.1851025 0.1764149
#scales::percent
income.group <- st_join(bothRegression, census) %>% 
  group_by(Regression, incomeContext) %>%
  summarize(mean.MAPE = (mean(price.APE, na.rm = T))) %>%
  st_drop_geometry() %>%
  spread(incomeContext, mean.MAPE) %>%
  kable(caption = "Table 4.11.3\nTest set MAPE by Neighborhood Income Context") %>% 
  kable_styling("striped", full_width = F)

income.group
Table 4.11.3 Test set MAPE by Neighborhood Income Context
Regression High Income Low Income
Baseline Regression 0.1764773 0.2057311
Neighborhood Effects 0.1668223 0.1932346

4.11.4 Final Model

Eventually, we settled down with a model built with the Neighborhood Effects Model, and conducted prediction on the secret data set. Summary statistics of predicted sale price is attached below.

MODEL<-Nhood

challenge_set.predict<-challenge_set %>% 
  mutate(challenge.predict = predict(MODEL, newdata = challenge_set)) %>% 
  mutate(challenge.predict=ifelse(is.na(challenge.predict),515144,challenge.predict)) %>% dplyr::select(musaID,challenge.predict)

write.csv(challenge_set.predict,"pred_challenge_set.csv")

V.Discussion

Accuracy and Generalizability: Overall, the predictive model is relatively effective in terms of predicting sale price in the Mecklenburg County, NC. The model captured around 70% of variation in prices we were required to predict. For the sake of accuracy, the MAE on the test set is around 64K with 18.08% MAPE. As for the generalizability, MAPEs are between 16% and 19% considering the context groups to which the model was applied. A potential weakness of the model should be addressed is if the generalizability test result was overestimated or manipulated since proportion of white population and median household income were both features in the model.

Feature Effect: The baseline model contained the following features.

  • 1 non-numeric feature: bldggrade.m;
  • 5 internal characteristic features: shape_Area,heatedarea,fullbaths,bedrooms, units;
  • 9 tract features: pctRenterOccupied,pctWhite,MedHHInc,test.elementary, crime_sum,PctLongCommute,Pctgrocery,isin,zip;
  • 4 amenity features: college_nn5,library_nn2,hist_properties_nn5,capital_improve_nn2.

The highlight of this model is including whether the house is in Business Investment Opportunity Zones (isin) as one of the features to predict sale price. Since Business Investment Opportunity Zones displays areas that are within a 1/4 mile of a commercial property that has a value 70% the County average and a median household income of <= 55K in Mecklenburg County, we regarded it as an interesting addition to display the spatial environment the house unit is located at.

Considering the neighborhood effects, the geographical feature finally added into the model is the zipcode(zip) instead of census tract(GEOID). The reason is that some indicators of house price (e.g. crime, transit), need scale to exist. Blocks in zipcode is slightly larger than that of census tract, therefore we think it may better help explain why the price of houses cluster.

The accessibility of education resources is an easy-ignored but powerful feature to show how amenities affect the local house price. Rather than only focus on the entertainment demand, we cared more about if the local residents’requirements on education have been refilled. So we added some education-related features, such as test.elementary, and distance to libraries(knn). Evidence has indicated that such a consideration is worthy.

Additionally, although we included far more amenity features in the initial stage, such as the proximity to churches, historical cemeteries, and the proximity to medical facilities, those features were eventually excluded from the model based on the series of rigorous model evaluation. However, we still believe it is worthy taking those features into consideration and they might be helpful when being used in other regression models.

Errors and Spatial Variations: Generally speaking, larger MAEs occurred in the Downtown Charlotte and north region of Mecklenburg County, while the model performed better in the rest of areas. One guess is potential spatial features that are directly related to the two areas have been somehow missed out. But we still feel confident about the model based on the examinations above.

VI. Conclusion

If all the model examination and interpretation have been conducted correctly as shown above, we would like to recommend the Neighborhood Effects model to Zillow as it incorporates some Mecklenburg County’s local intelligent features that were absent in the previous models and performed comparatively well on either accuracy or generalizability.

To further reduce the error rate, more amenity features such as local restaurants and bars (data unavailable so far), financial related agencies, public transportation and spatial features such as zoning should be taken into consideration. And more context studies on downtown Charlotte and north region are also highly needed since we would like to extract some similarities between the two regions that would be the magic ingredients of cooking models in the future.