Mecklenburg County is a county located in the southwestern region of the state of North Carolina, in the United States. As of the 2020 census, the population was 1,115,482, making it the second-most populous county in North Carolina. Its county seat is Charlotte, the state’s largest city. Between 2004 and 2014, Charlotte was ranked as the country’s fastest-growing metro area, with 888,000 new residents.Based on U.S. Census data from 2005 to 2015, Charlotte tops the U.S. in millennial population growth, which indicates a growing demand in the housing market for the next decade. From this point of view, it is important to make a model that is adapted to local indicators and can accurately predict house prices.
In August 2022, Mecklenburg County home prices were up 15.1% compared to last year, selling for a median price of $420K. There were 1,723 homes sold in August this year, down from 2,131 last year. Across the nation, 0.96% of homebuyers searched to move into Charlotte from outside metros. 76% of Charlotte homebuyers searched to stay within the Charlotte metropolitan area.
We can see that even though the up-going price holds people from buying houses a little bit, the demand for homes is still robust. Homebuyers in Mecklenburg County mainly aim to relocate and stay in the metropolitan area. This group of buyers, who are already familiar with the different local resources in Charlotte, will be very conscious of the availability of various public amenities in the area and the quality of the home. we bring in the amenities feature and the building features from their perspectives. There are also immigrants from big cities like New York willing to move to Charlotte. This group of people might consider the basic context of the region. Therefore, census tract data are also important to analysis.
We also refer to Zillow and Redfin’s analysis of the local market, fully consider the price stratification of the house price market as well as the the property itself. Different from zillow and redfin’s predicted valuation, our model focuses on the prediction specific on the local market. This forecast will not consider the impact of time series changes on prices, due to the already clear challenge set as well as time nodes. We will also discard the top houses as training data based on the location of the forecast points inferring the price stratum to which they belong.
In this section, we loaded necessary libraries, created plot theme options and map theme options, and identified functions of quantile breaks, and average nearest neighbor distance for further analysis.
# You can set some global options for knitting chunks
::opts_chunk$set(echo = TRUE)
knitr
# Load some libraries
library(tidyverse)
library(sf)
library(spdep)
library(caret)
library(ckanr)
library(FNN)
library(grid)
library(gridExtra)
library(ggcorrplot) # plot correlation plot
library(corrr) # another way to plot correlation plot
library(kableExtra)
library(jtools) # for regression model plots
library(ggstance) # to support jtools plots
library(ggpubr) # plotting R^2 value on ggplot point scatter
library(broom.mixed) # needed for effects plots
library(stargazer)
library(tidycensus)
options(scipen = 999)
options(tigris_class = "sf")
# functions and data directory
source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")
<- c('#ffffd4','#fed98e','#fe9929','#d95f0e','#993404')
palette3 <- c('#4575b4', '#fecc5c')
palette4 <-c('#fc8d59','#fee090','#ffffbf','#e0f3f8','#91bfdb')
palette5<-c ('#eff3ff','#bdd7e7','#6baed6','#3182bd','#08519c')
palette2
<- function(base_size = 12, title_size = 16) {
plotTheme theme(
text = element_text( color = "black"),
plot.title = element_text(size = title_size, colour = "black"),
plot.subtitle = element_text(face="italic"),
plot.caption = element_text(hjust=0),
axis.ticks = element_blank(),
panel.background = element_blank(),
panel.grid.minor = element_line("grey90", size = 0.01),
panel.grid.major = element_line("grey90", size = 0.01),
panel.border = element_rect(colour = "white" , fill=NA, size=0),
strip.background = element_rect(color = "white",fill='white'),
strip.text = element_text(size=12),
axis.title = element_text(size=12),
axis.text = element_text(size=10),
plot.background = element_blank(),
legend.background = element_blank(),
legend.title = element_text(colour = "black", face = "italic"),
legend.text = element_text(colour = "black", face = "italic"),
strip.text.x = element_text(size = 14)
)
}
<- function(base_size = 12, title_size = 16) {
mapTheme theme(
text = element_text( color = "black"),
plot.title = element_text(size = title_size,colour = "black"),
plot.subtitle=element_text(face="italic"),
plot.caption=element_text(hjust=0),
axis.ticks = element_blank(),
panel.background = element_blank(),
axis.title = element_blank(),
axis.text = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.grid.minor = element_blank(),
strip.text.x = element_text(size = 14),
strip.background = element_rect(fill = "white"))
}setwd("~/Github/508_houseprice_prediction")
census_api_key("c1c749a5ad57fa58fe19e1403d894e9acef41a97", install=TRUE,overwrite = TRUE)
In this section, we loaded Mecklenburg data and other complementary datasets, conducted feature selection and engineering. Besides, for analysis, we separate the whole dataset into challenge_set and model_set in advance.
Seven tract-related datasets were used in this project:
Long_commute: Percentage of auto commuters traveling 20 minutes or more to work
Crime: crime violent incidences in tracts.
Pctgrocery: Percentage of housing units within ½-mile of a full-service, chain grocery store.
Pcttest1\2\3: Percentage of Charlotte-Mecklenburg Schools (CMS) students in grades 3-5 6-8 9-12 proficient in End of Grade testing
Bachelor: Percent of Bachelor’s Degree
Age: Median Age of Residents
Census Data: demographic variables from the ACS 2020 for census tracts in Mecklenburg County. We specifically selected 8 variables in the analysis:
TotalPop
: Total population in each census tractWhites
: White population in each census tractMedHHInc
: Median household income in each census
tractMedRent
: Median Rent for properties in each census
tractTotalPoverty
: Population living under the level of
poverty in each census tractTotalVacant
: Total vacant unit in each census
tractTotalUnit
: Total housing units in each census
tractRenterOccpied
: Total householder lived in
renter-occupied housing unitsWith variables above, we created the four features to be used in the model building.
pctWhite
: white population proportion in each census
tractpctPoverty
: Poverty population proportion in each
census tractpctTotalVacant
: Vacant unites proportion in each census
tractpctRenterOccupied
: Proportion of householder living in
renter-occupied housing units in each census tractAmenity Data: compilatory data chosen for describing amenities and public service of Mecklenburg County. The datasets we chose encompasses Churches, Colleges and Universities, Libraries, Medical Facilities, Business Investment Opportunity Zones (Polygon),parks, daycare. See each dataset below:
Churches: Churches dataset defines locations classified as religious organizations in Mecklenburg County, data is queried from Master Address Table..
Colleges and Universities: Dataset contains public and private colleges and universities in Mecklenburg County.
Capital Improvement Projects: Dataset identifies capital improvement projects for; Central Piedmont Community College, Charlotte-Mecklenburg Schools, Libraries, Park and Recreation, and other government facilities in Mecklenburg County.
Daycare: Dataset provides NC DHHS licensed daycare locations in Mecklenburg County.
Historic Cemeteries: This dataset represents historic and abandoned burial locations in Mecklenburg County, NC.
Historic Property National: This dataset represents nationally-designated historic properties in Mecklenburg County, NC.
Libraries: A point coverage of the Public Libraries of Charlotte-Mecklenburg, North Carolina system.
Medical Facilities: This dataset was generated to publish a geospatial inventory of Medical Facilities in Mecklenburg County, NC. This data serves to support and assist the Charlotte Mecklenburg Planning Department, The City of Charlotte, and others in Urban Planning decisions.
Business Investment Opportunity Zones(Polygon): This dataset displays areas that are within a 1/4 mile of a commercial property that has a value 70% the County average and a median household income of <= 55K in Mecklenburg County.
Park Locations: Mecklenburg County Park and Recreation park location entrance points.
Notice: For the tract-related data, we get the relevant data by spatial joining the house price points with the track, and for the data based on FACILITIES distribution points, we assign values to the house price information points by the knn algorithm.
#Raw Data:
<-st_read("~/Github/508_houseprice_prediction/data/studentData.geojson")
data
#Add age,pricepersq,bathrooms
<-data %>% mutate(AGE = 2022-yearbuilt,
datapricepersq = price/heatedarea,
bathrooms = fullbaths+halfbaths)
#Census Tract:
<-st_read('~/Github/508_houseprice_prediction/data/CensusTract/Census_Tracts_2020.shp')
tract<-tract %>% st_transform(crs=4326)
tract
#Zipcode:
<-st_read('~/Github/508_houseprice_prediction/data/zipcode/Zipcode.shp')
zipcode<-zipcode %>% st_transform(crs=4326) %>% rename(shapeLeng = shape_Leng,shapeArea = shape_Area) %>% dplyr::select(zip,po_name,shapeLeng,shapeArea)
zipcode
# Amenities
#1.college:
<-st_read('./data/college/Colleges.shp')
college<-college %>% st_transform(crs = 4326) %>% dplyr::select(geometry) %>% na.omit()
college.sf#2.daycare:
<-st_read('./data/daycare/Daycare.shp')
daycare<-daycare %>% st_transform(crs=4326)%>% dplyr::select(geometry) %>% na.omit()
daycare.sf
#3.Medical Facilities:
<-st_read('./data/medical/MedicalFacilities.shp')
MedicalFacilities<-MedicalFacilities %>% st_transform(crs=4326)%>% dplyr::select(geometry) %>% na.omit()
MedicalFacilities.sf
#4.library:
<-st_read('./data/library/Library.shp')
library<-library %>% st_transform(crs=4326)%>% dplyr::select(geometry) %>% na.omit()
library.sf
#5.Historical cemeteries:
<-st_read('./data/historic_cemeteries/Historic_Cemeteries.shp')
hist_cemeteries<-hist_cemeteries %>% st_transform(crs=4326)%>% st_centroid() %>% dplyr::select(geometry) %>% na.omit()
hist_cemeteries.sf
#6.Historical Properties:
<-st_read('./data/historic_properties/HistoricProperty_National.shp')
hist_properties<-hist_properties%>% st_transform(crs=4326)%>%st_centroid() %>% dplyr::select(geometry) %>% na.omit()
hist_properties.sf
#7.church
<-st_read('./data/church/Churches.shp')
church<-church %>% st_transform(crs=4326)%>% dplyr::select(geometry) %>% na.omit()
church.sf
#8.park location:
<- st_read('./data/Park_Locations/Park_Locations.shp')
park_location <- park_location %>% st_transform(crs=4326)%>%st_centroid() %>% dplyr::select(geometry) %>% na.omit()
park_location.sf
#9.Capital Improvement Projects:
<-st_read('./data/Capital_Improvement_Projects/Capital_Improvement_Projects.shp')
capital_improve<-capital_improve %>% st_transform(crs=4326)%>% dplyr::select(geometry) %>% na.omit()
capital_improve.sf
#10.Business Investment Opportunity Zones:
<-st_read('./data/Business/Business_Investment_Opportunity_Zones.shp')
business_invest<-business_invest %>% st_transform(crs=4326)
business_invest<-business_invest %>% st_union() %>% st_sf()
business_invest_union<-st_within(data,business_invest_union) %>% lengths>0
isin<-data.frame(isin)
isin<-cbind(data,isin)
data
#spatial features collection
#1.Crime:
<-st_read('./data/Crime - Violent.geojson')
crime<-crime %>% st_transform(crs=4326)
crime<-crime %>% mutate(crime_sum = X2019.raw+X2020.raw+X2021.raw) %>% dplyr::select(crime_sum)
crime1
#2.Long Commute:
<-st_read('./data/LongCommute.geojson')
long_commute<-long_commute %>% st_transform(crs=4326) %>% mutate(PctLongCommute = X2020) %>% dplyr::select(PctLongCommute)
long_commute
#3.Percent of Grocery:
<-st_read('./data/Proximity to a Grocery Store.geojson')
grocery<-grocery %>% st_transform(crs=4326) %>% mutate(Pctgrocery = X2021) %>% dplyr::select(Pctgrocery)
Pctgrocery
#4.Percent of Test Proficiency in Elementary Schools:
<-st_read('./data/Test Proficiency - Elementary School.geojson')
test_elementary<-test_elementary%>% st_transform(crs=4326) %>%mutate( test.elementary = X2019) %>% dplyr::select(test.elementary)
Pcttest1
#5.Percent of Test Proficiency in Middle Schools:
<-st_read('./data/Test Proficiency - Middle School.geojson')
test_middle<-test_middle%>% st_transform(crs=4326) %>%mutate( test.middle = X2019) %>% dplyr::select(test.middle)
Pcttest2
#6.Percent of Test Proficiency in High Schools:
<-st_read('./data/Test Proficiency - High School.geojson')
test_high<-test_high%>% st_transform(crs=4326) %>% mutate(test.high = X2019) %>% dplyr::select(test.high)
Pcttest3
#7.Percent of Bachelor's Degree:
<-st_read('./data/Education Level - Bachelor Degree.geojson')
bachelor<-bachelor %>% st_transform(crs=4326) %>% mutate(PctBachelor = X2020) %>% dplyr::select(PctBachelor)
PctBachelor
#8.Median Age of Residents:
<-st_read('./data/Age of Residents.geojson')
age<-age %>% st_transform(crs=4326) %>% mutate(MedAge = X2020) %>% dplyr::select(MedAge) MedAge
options(tigris_use_cache = TRUE)
.2020 <- load_variables(2020,
acs_variable_list"acs5",
cache = TRUE)
<-
tracts20 get_acs(geography = "tract", variables = c("B25026_001E","B02001_002E", "B19013_001E","B25058_001E","B06012_002E",
"B25002_003E","B25004_002E","B25004_003E","B25004_004E","B25004_005E",
"B25001_001E","B08137_003E"),
year=2020, state= 37, county= 119, geometry=T, output="wide")%>% st_transform(crs=4326) %>%
rename(TotalPop = B25026_001E,
Whites = B02001_002E,
MedHHInc = B19013_001E,
MedRent = B25058_001E,
TotalPoverty = B06012_002E,
TotalVacant = B25002_003E,
ForRent = B25004_002E,
ForRentVac = B25004_003E,
ForSale = B25004_004E,
ForSaleVac = B25004_005E,
TotalUnit = B25001_001E,
RenterOccupied = B08137_003E
%>%
) ::select(-NAME, -starts_with("B")) %>% #-starts_with("B") awesome!
dplyrmutate(pctWhite = ifelse(TotalPop > 0, Whites / TotalPop,0),
pctPoverty = ifelse(TotalPop > 0, TotalPoverty / TotalPop, 0),
pctTotalVacant = ifelse(TotalUnit > 0, TotalVacant / TotalUnit * 100, 0),
TotalOccupied = TotalUnit - TotalVacant,
pctRenterOccupied = ifelse(TotalOccupied >0, RenterOccupied/TotalOccupied, 0)) %>%
::select(-Whites, -TotalPoverty) dplyr
In this section, we created some new features to describe the amenities and public services in Mecklenburg County, as well as to each home sale observation’s internal characteristics.
First of all, we were aimed at creating features to measure the extent to which buyers can assess multiple public services, including Churches, Colleges and Universities, Libraries, Medical Facilities, parks, daycares. To utmost avoid scale bias triggered by setting arbitrary areal unit or fixed buffer distance of each home sale observation, we chose to calculate the average nearest neighbor distance from each home sale to its k nearest neighbor public services. Accordingly, the smaller the value is, the more possibility that the observation can access specific service. To better compare each feature’s impact, we set the k from 3 to 5.
<-st_join(data,crime1)
data<-st_join(data,long_commute)
data<-st_join(data,Pctgrocery)
data<-st_join(data,Pcttest1)
data<-st_join(data,Pcttest2)
data<-st_join(data,Pcttest3)
data<-st_join(data,PctBachelor)
data<-st_join(data,MedAge)
data<-st_join(data,zipcode)
data<-st_join(data,tracts20)
data
<-st_coordinates
st_c
#1.college:
<-
data %>%
data mutate(
college_nn1 = nn_function(st_c(data),st_c(college.sf), 1),
college_nn2 = nn_function(st_c(data),st_c(college.sf), 2),
college_nn3 = nn_function(st_c(data),st_c(college.sf), 3),
college_nn4 = nn_function(st_c(data),st_c(college.sf), 4),
college_nn5 = nn_function(st_c(data),st_c(college.sf), 5),
)
#2.daycare:
<-
data %>%
data mutate(
daycare_nn1 = nn_function(st_c(data),st_c(daycare.sf), 1),
daycare_nn2 = nn_function(st_c(data),st_c(daycare.sf), 2),
daycare_nn3 = nn_function(st_c(data),st_c(daycare.sf), 3),
daycare_nn4 = nn_function(st_c(data),st_c(daycare.sf), 4),
daycare_nn5 = nn_function(st_c(data),st_c(daycare.sf), 5),
)
#3.Medical Facilities:
<-
data %>%
data mutate(
MedicalFacilities_nn1 = nn_function(st_c(data),st_c(MedicalFacilities.sf), 1),
MedicalFacilities_nn2 = nn_function(st_c(data),st_c(MedicalFacilities.sf), 2),
MedicalFacilities_nn3 = nn_function(st_c(data),st_c(MedicalFacilities.sf), 3),
MedicalFacilities_nn4 = nn_function(st_c(data),st_c(MedicalFacilities.sf), 4),
MedicalFacilities_nn5 = nn_function(st_c(data),st_c(MedicalFacilities.sf), 5),
)
#4.library:
<-
data %>%
data mutate(
library_nn1 = nn_function(st_c(data),st_c(library.sf), 1),
library_nn2 = nn_function(st_c(data),st_c(library.sf), 2),
library_nn3 = nn_function(st_c(data),st_c(library.sf), 3),
library_nn4 = nn_function(st_c(data),st_c(library.sf), 4),
library_nn5 = nn_function(st_c(data),st_c(library.sf), 5),
)
#5.Historical cemeteries:
<-
data %>%
data mutate(
hist_cemeteries_nn1 = nn_function(st_c(data),st_c(hist_cemeteries.sf), 1),
hist_cemeteries_nn2 = nn_function(st_c(data),st_c(hist_cemeteries.sf), 2),
hist_cemeteries_nn3 = nn_function(st_c(data),st_c(hist_cemeteries.sf), 3),
hist_cemeteries_nn4 = nn_function(st_c(data),st_c(hist_cemeteries.sf), 4),
hist_cemeteries_nn5 = nn_function(st_c(data),st_c(hist_cemeteries.sf), 5),
)
#6.Historical Properties:
<-
data %>%
data mutate(
hist_properties_nn1 = nn_function(st_c(data),st_c(hist_properties.sf), 1),
hist_properties_nn2 = nn_function(st_c(data),st_c(hist_properties.sf), 2),
hist_properties_nn3 = nn_function(st_c(data),st_c(hist_properties.sf), 3),
hist_properties_nn4 = nn_function(st_c(data),st_c(hist_properties.sf), 4),
hist_properties_nn5 = nn_function(st_c(data),st_c(hist_properties.sf), 5),
)
#7.church
<-
data %>%
data mutate(
church_nn1 = nn_function(st_c(data),st_c(church.sf), 1),
church_nn2 = nn_function(st_c(data),st_c(church.sf), 2),
church_nn3 = nn_function(st_c(data),st_c(church.sf), 3),
church_nn4 = nn_function(st_c(data),st_c(church.sf), 4),
church_nn5 = nn_function(st_c(data),st_c(church.sf), 5),
)
#9.park easement:
<-
data %>%
data mutate(
park_location_nn1 = nn_function(st_c(data),st_c(park_location.sf), 1),
park_location_nn2 = nn_function(st_c(data),st_c(park_location.sf), 2),
park_location_nn3 = nn_function(st_c(data),st_c(park_location.sf), 3),
park_location_nn4 = nn_function(st_c(data),st_c(park_location.sf), 4),
park_location_nn5 = nn_function(st_c(data),st_c(park_location.sf), 5),
)
#10.Capital Improvement Projects:
<-
data %>%
data mutate(
capital_improve_nn1 = nn_function(st_c(data),st_c(capital_improve.sf), 1),
capital_improve_nn2 = nn_function(st_c(data),st_c(capital_improve.sf), 2),
capital_improve_nn3 = nn_function(st_c(data),st_c(capital_improve.sf), 3),
capital_improve_nn4 = nn_function(st_c(data),st_c(capital_improve.sf), 4),
capital_improve_nn5 = nn_function(st_c(data),st_c(capital_improve.sf), 5),
)
Through the histogram comparison we can see that the distribution of housing prices is very uneven and needs to be removed outliers according to certain criteria.
#challenge_set:
<-data %>% filter(price==0)
challenge_set#model_set:
<-data %>% filter(price!=0)
model_set
<- ggplot(model_set)+
plot1geom_histogram(aes(price), position = "dodge",stat = "bin",color = "white",fill= "orange") +
labs(title = "Distribution of prices\n(modelsetwith outlier)\n",caption = 'Figure DATA 2.1.1')+
scale_color_grey() + theme_classic()
# define outliers
= quantile(model_set$price)[2]
Q1 = quantile(model_set$price)[4]
Q3 <- Q1 - 3*(Q3-Q1)
lowerouter <- Q3 + 3*(Q3-Q1)
upperouter
<-data %>% filter(price!=0 & price < upperouter)
model_set<- ggplot(model_set)+
plot2 geom_histogram(aes(price), position = "dodge",stat = "bin",color = "white",fill= "orange") +
labs(title = "Distribution of prices\n(modelsetwithout outlier)\n",caption = 'Figure DATA 2.1.2')+ theme_classic()
grid.arrange(plot1, plot2, ncol=2)
As required, we provided three tables of summary statistics with variable descriptions for sorting features based on their categories below. For better visualization and analysis, features were divided according to their sources and catergories.
Based on the hedonic home price model, features are put into three categories:
Internal Features, features related to the quality of house, like the number of bathrooms;
Amenities Features, like hospital accessibility;
Tract Features, features related to the spatial characteristics, like census tracts data
In total, we took 99 features into consideration in the initial stage. Detailed summary statistics are provided below.
The table and plots show that the difference between houses is very
large, especially in terms of price, area, and age.
Also out of instinct, we might consider
fullbaths
&halfbaths
units
&bedrooms
,
shape_area
&heated area
might correlate
with each other.
#Summary and selection of internal features:
<- c("price",
internal.variable "shape_Area",
"pricepersq",
"AGE",
"heatedarea",
"numfirepla",
"fullbaths",
"halfbaths",
"bedrooms",
"units")
<-model_set[internal.variable] %>%
internal.featuresst_drop_geometry()
#stargazer:
stargazer(internal.features, type = "html",
title = "Table DATA 2.2 Summary Statistics of Internal Characteristics\n(numeric features)\n ",
header = FALSE,
single.row = TRUE)
Statistic | N | Mean | St. Dev. | Min | Max |
price | 44,288 | 406,077.200 | 189,009.900 | 12,000 | 1,182,000 |
shape_Area | 44,288 | 14,878.390 | 21,314.750 | 1,139.637 | 1,032,970.000 |
pricepersq | 44,288 | Inf.000 | 4.464 | Inf.000 | |
AGE | 44,288 | 28.497 | 30.510 | 0 | 2,022 |
heatedarea | 44,288 | 2,260.081 | 897.891 | 0 | 9,183 |
numfirepla | 44,288 | 0.771 | 0.450 | 0 | 5 |
fullbaths | 44,288 | 2.217 | 0.739 | 0 | 8 |
halfbaths | 44,288 | 0.578 | 0.513 | 0 | 4 |
bedrooms | 44,288 | 3.470 | 0.919 | 0 | 65 |
units | 44,288 | 0.976 | 0.195 | 0 | 8 |
We can see that AGE do not have correlation with the
dependent variable, which we will consider to remove in the future. Also
we can notice that heatedarea has a strong positive
influence for price. In addition the relationship of this variable with
other variables is very similar to their relationship with price, which
suggests that it may be an important indicator in the future that can
represent these variables to predict prices. Also we check the
relationship between fullbaths
&halfbaths
units
&bedrooms
,
shape_area
&heated area
. Surprisingly,
fullbaths
and halfbaths
do not have relation
with each other. Bedrooms
and units
have
negative correlation. This suggests that perhaps the distribution of
data in this dataset is not what we would expect by common sense.
#Correlation matrix of internal features:
<-
numericVars select_if(internal.features, is.numeric) %>% na.omit()
#['#d73027','#fc8d59','#fee090','#ffffbf','#e0f3f8','#91bfdb','#4575b4']
ggcorrplot(
round(cor(numericVars), 1),
p.mat = cor_pmat(numericVars),
colors = c('#4575b4', "#ffffff",'#fecc5c'),
type="upper",
insig = "pch") +
labs(title = "Correlation across numeric variables\n(internal features)\n", caption = 'Figure DATA 2.2.1') +
plotTheme()
<- c("price",
sel.internal.list "shape_Area",
"heatedarea",
"numfirepla",
"fullbaths",
"halfbaths",
"bedrooms",
"units")
From the results of the correlation, different amenities features have different values of nn related to price. Park does not have much impact on the overall results and will be removed in the next step of the analysis. capital improvement related variables are meaningful only when n=2. Similarly, college features are related when nn=2. Medical facilities are more meaningful when n>2.
<-model_set %>% dplyr::select(
amenities_feature.liststarts_with('college_'),
starts_with('daycare_'),
starts_with('MedicalFacilities_'),
starts_with('library_'),
starts_with('hist_'),
starts_with('church'),
starts_with('park_'),
starts_with('capital_'),
%>%
price) st_drop_geometry()
#stargazer:
stargazer(amenities_feature.list, type = "html",
title = "Table DATA 2.1 Summary Statistics of Internal Characteristics ",
out.header = TRUE,
single.row = TRUE,out="amenities_feature.htm")
Statistic | N | Mean | St. Dev. | Min | Max |
college_nn1 | 44,288 | 0.052 | 0.028 | 0.001 | 0.172 |
college_nn2 | 44,288 | 0.066 | 0.027 | 0.004 | 0.175 |
college_nn3 | 44,288 | 0.076 | 0.030 | 0.006 | 0.184 |
college_nn4 | 44,288 | 0.085 | 0.033 | 0.006 | 0.191 |
college_nn5 | 44,288 | 0.093 | 0.036 | 0.007 | 0.206 |
daycare_nn1 | 44,288 | 0.010 | 0.008 | 0.00001 | 0.067 |
daycare_nn2 | 44,288 | 0.012 | 0.009 | 0.0004 | 0.082 |
daycare_nn3 | 44,288 | 0.014 | 0.010 | 0.001 | 0.089 |
daycare_nn4 | 44,288 | 0.016 | 0.010 | 0.001 | 0.093 |
daycare_nn5 | 44,288 | 0.017 | 0.011 | 0.001 | 0.096 |
MedicalFacilities_nn1 | 44,288 | 0.018 | 0.014 | 0.0001 | 0.077 |
MedicalFacilities_nn2 | 44,288 | 0.019 | 0.014 | 0.0004 | 0.081 |
MedicalFacilities_nn3 | 44,288 | 0.021 | 0.014 | 0.0004 | 0.085 |
MedicalFacilities_nn4 | 44,288 | 0.022 | 0.014 | 0.0005 | 0.086 |
MedicalFacilities_nn5 | 44,288 | 0.023 | 0.014 | 0.0005 | 0.088 |
library_nn1 | 44,288 | 0.040 | 0.022 | 0.001 | 0.111 |
library_nn2 | 44,288 | 0.055 | 0.025 | 0.003 | 0.138 |
library_nn3 | 44,288 | 0.064 | 0.028 | 0.006 | 0.164 |
library_nn4 | 44,288 | 0.071 | 0.030 | 0.008 | 0.177 |
library_nn5 | 44,288 | 0.078 | 0.032 | 0.010 | 0.191 |
hist_cemeteries_nn1 | 44,288 | 0.016 | 0.008 | 0.0004 | 0.044 |
hist_cemeteries_nn2 | 44,288 | 0.019 | 0.007 | 0.001 | 0.046 |
hist_cemeteries_nn3 | 44,288 | 0.022 | 0.007 | 0.001 | 0.048 |
hist_cemeteries_nn4 | 44,288 | 0.024 | 0.008 | 0.004 | 0.049 |
hist_cemeteries_nn5 | 44,288 | 0.027 | 0.008 | 0.004 | 0.053 |
hist_properties_nn1 | 44,288 | 0.092 | 0.056 | 0.0004 | 0.254 |
hist_properties_nn2 | 44,288 | 0.095 | 0.055 | 0.001 | 0.260 |
hist_properties_nn3 | 44,288 | 0.101 | 0.056 | 0.002 | 0.263 |
hist_properties_nn4 | 44,288 | 0.105 | 0.056 | 0.008 | 0.266 |
hist_properties_nn5 | 44,288 | 0.109 | 0.056 | 0.014 | 0.270 |
church_nn1 | 44,288 | 0.009 | 0.006 | 0.0001 | 0.046 |
church_nn2 | 44,288 | 0.011 | 0.007 | 0.0001 | 0.046 |
church_nn3 | 44,288 | 0.012 | 0.007 | 0.0003 | 0.047 |
church_nn4 | 44,288 | 0.013 | 0.007 | 0.001 | 0.050 |
church_nn5 | 44,288 | 0.014 | 0.008 | 0.001 | 0.054 |
park_location_nn1 | 44,288 | 0.012 | 0.008 | 0.0002 | 0.052 |
park_location_nn2 | 44,288 | 0.015 | 0.008 | 0.001 | 0.052 |
park_location_nn3 | 44,288 | 0.017 | 0.008 | 0.001 | 0.059 |
park_location_nn4 | 44,288 | 0.020 | 0.009 | 0.001 | 0.065 |
park_location_nn5 | 44,288 | 0.021 | 0.009 | 0.002 | 0.072 |
capital_improve_nn1 | 44,288 | 0.020 | 0.015 | 0.00004 | 0.114 |
capital_improve_nn2 | 44,288 | 0.024 | 0.015 | 0.0004 | 0.121 |
capital_improve_nn3 | 44,288 | 0.028 | 0.015 | 0.001 | 0.127 |
capital_improve_nn4 | 44,288 | 0.031 | 0.016 | 0.004 | 0.132 |
capital_improve_nn5 | 44,288 | 0.034 | 0.017 | 0.005 | 0.136 |
price | 44,288 | 406,077.200 | 189,009.900 | 12,000 | 1,182,000 |
ggcorrplot(
round(cor(amenities_feature.list), 1),
p.mat = cor_pmat(amenities_feature.list),
colors = c('#4575b4', "#ffffff",'#fecc5c'),
type="upper",
insig = "pch") +
labs(title = "Correlation across numeric variables\n(amenity features)\n",
caption = 'Figure DATA 2.2.2')+plotTheme()+
theme(axis.text.x = element_text(angle = 45, hjust = 1, size=7),
axis.text.y = element_text(size=7))
#Amenities
<-model_set%>% dplyr::select(
sel.amenitiestarts_with('college_'),
starts_with('daycare_'),
starts_with('MedicalFacilities_'),
starts_with('library_'),
starts_with('hist_'),
starts_with('church'),
starts_with('capital_'),
%>%
price) st_drop_geometry()
From the result of correlation, Most tract features are
price-dependent. pctTotalVacant
has nothing to do with
price and can be deleted.
pctBachelor
,test.high
,
test.middle
, test.elementary
are correlated
can keep one, MedAge
, MedHHInc
,
MedRent
highly relevant, can keep one.
#Tract Feature
<- c("pctTotalVacant", "pctRenterOccupied", "pctWhite","pctPoverty","MedRent" ,"MedHHInc","MedAge","crime_sum", "PctLongCommute", "Pctgrocery", "test.elementary","test.middle", "test.high","PctBachelor",'price')
tract.list
<-model_set[tract.list]%>%
tract_feature.listst_drop_geometry()
<-
numeric.tractselect_if(tract_feature.list, is.numeric) %>% na.omit()
ggcorrplot(
round(cor(numeric.tract), 1),
p.mat = cor_pmat(numeric.tract),
colors = c('#4575b4', "#ffffff",'#fecc5c'),
type="upper",
insig = "pch") +
labs(title = "Correlation across numeric variables\n public service features\n",
caption = 'Figure DATA 2.2.3')+plotTheme()+
theme(axis.title.x = element_text(size=3),axis.title.y = element_text(size=3))
<- c( "pctRenterOccupied", "pctWhite","pctPoverty","MedRent" ,"MedHHInc","MedAge","crime_sum", "PctLongCommute", "Pctgrocery", "test.elementary","PctBachelor",'price') sel.tract.list
From the histograms we can see that most of the indicators are correlated with prices, and that different indicators affect different ranges of prices: - bldggrade(building grade) and extwall related to the high price as well as the price of the low price range. - aheatingty and storyheigh relate to the central and high price range - heatedfuel relates to the central price range
Based on their distribution in different price ranges, we reclassify the nonnumeric feature to facilitate better training learning.
<- select_if(data, is.numeric)%>% na.omit()%>%
numericst_drop_geometry()
<- select_if(data, !(colnames(data)%in% colnames(numeric)))
nonnumeric # 1.Modify extwall:
<-c("STONE")
extwall.stone<-
data %>%
data mutate(extwall.m = ifelse(extwall %in% extwall.stone,"STONE","Non-STONE"))
# 2.Modify Aheatingty:
<-c("HOT WATER")
hotwater<-data %>% mutate(aheatingty.m = ifelse(aheatingty %in% hotwater,"HOT WATER","ELSE" ))
data
# 3.Modify heatedfuel:
<-c("GAS")
gas<-data %>% mutate(heatedfuel.m = ifelse(heatedfuel %in% gas,"GAS","ELSE" ))
data
# 4.Modify bldggrade
<-c("FAIR","MINIMUM","NA")
fairplus<-data %>% mutate(bldggrade.m = ifelse(bldggrade %in% fairplus,"ELSE",bldggrade))
data
# 5.Modify storyheigh
<-c("NA","SPLIT LEVEL","BI-LEVEL","CAPE COD","RANCH W/BSMT")
storyelse<-data %>% mutate(storyheigh.m = ifelse(storyheigh %in% storyelse,"ELSE",storyheigh ))
data
#非数字的列
<- c("descbuildi", "storyheigh.m", "aheatingty.m", "heatedfuel.m", "extwall.m", "foundation", "bldggrade.m","isin","price") nonnumeric.feature1
%>%
model_set st_drop_geometry() %>%
::select(price,extwall,storyheigh,heatedfuel,aheatingty,bldggrade) %>%
dplyrfilter(price <= upperouter) %>%
gather(Variable, Value, -price) %>%
ggplot(aes(Value, price)) +
geom_bar(position = "dodge", stat = "summary", fun = mean, fill= '#abd9e9') +
facet_wrap(~Variable, ncol = 3, scales = "free") +
labs(title = "Price as a function of\nnonnumeric variables", y = "Mean_Price",
caption = 'Figure DATA 2.3') +plotTheme()+
theme(axis.text.x = element_text(angle = 45, hjust = 1, size=6), strip.background = element_rect(fill = "white"),panel.grid.minor =element_blank() ,panel.grid.major =element_blank())
In this section, we specifically explored the correlation between sale price and four interesting and meaningful features: Heated Area, Crime Summary, Distance to capital improvement Projects, Distance to libraries.
Through graphs below, it is obvious that the first two of four features have strong correlation with sale price, while for the distance to amenities, there is no particularly strong correlation with price. But in general distance and price are inversely correlated, which is consistent with our intuition. Meanwhile, by comparing the scatter plot and correlation of capital improvement and library, we can see that there is a specific distance boundary for capital improvement, while the distance to library has no specific boundary although it is related to price.
# DATA - 4 Scatter plot
## 1. Two House Internal characteristics
<- c('Heated Area', 'Crime Summary')
features names(features) <- c("heatedarea", "crime_sum")
<-
price.house.plot st_drop_geometry(model_set) %>%
::select(price, heatedarea, crime_sum) %>%
dplyrfilter(price <= upperouter) %>%
gather(Variable, Value, -price) %>%
ggplot(aes(Value, price)) +
geom_point(size = .5, colour='#abd9e9',alpha=0.5) +
geom_smooth(method = "lm", se=F, colour = 'orange') +
facet_wrap(~Variable, ncol = 2, scales = "free",
labeller = labeller(Variable = features)) +
plotTheme() +
labs(title = "Price as a Function of House Internal Characteristics Variables:\nCrime Summary, Heated Area\n",
caption = 'Figure DATA 2.4.1')
price.house.plot
## 2. Average Distance of Nearest 2 capital improvement projects
<-
price.capital_improve_nn2.plot st_drop_geometry(model_set) %>%
::select(price, capital_improve_nn2) %>%
dplyrfilter(price <= upperouter) %>%
gather(Variable, Value, -price) %>%
ggplot(aes(Value, price)) +
geom_point(size = .5,colour='#abd9e9',alpha=0.5) +
geom_smooth(method = "lm", se=F, colour = 'orange') +
labs(title = "Price as a Function of Average Distance of\nNearest Two Capital Improvement Projects\n",subtitle="Mecklenburg, NC",
caption = 'Figure DATA 2.4.2') +
plotTheme()
price.capital_improve_nn2.plot
## 3. Average Distance of Nearest 4 libraries
<-
price.library_nn4.plot st_drop_geometry(model_set) %>%
::select(price, library_nn4) %>%
dplyrfilter(price <= upperouter) %>%
gather(Variable, Value, -price) %>%
ggplot(aes(Value, price)) +
geom_point(size = .5, colour='#abd9e9',alpha=0.5) +
geom_smooth(method = "lm", se=F, colour = 'orange') +
labs(title = "Price as a Function of Average Distance of\nNearest Four Libraries\n",subtitle="Mecklenburg, NC",
caption = 'Figure DATA 2.4.3') +
plotTheme()
price.library_nn4.plot
The sale price per SqFt in Mecklenberg shows a clear spatial pattern. The sale price in Belk, Lofton, Harris is significantly higher than other areas.
#Plot of home price per square feet:
ggplot()+
geom_sf(data=st_union(tract),fill="white", color = "grey")+
geom_sf(data=model_set,aes(color=q5(pricepersq)),show.legend="point",size=.75)+
scale_color_manual(values = palette2,labels = qBr(model_set, "pricepersq"),name = "House Price\nPer Square Feet\n(Quintile Breaks)")+
labs(title="House Price",subtitle = "Mecklenberg County,NC",caption="Figure2.5")+
mapTheme()
After having a preliminary idea of the correlation between variables and the distribution of house prices, we would like to map the correlation with the age of the house, which is not particularly correlated with house prices, to see if the age of the house is completely irrelevant to the overall price distribution.The map results are unexpected, although in some parts of age and the distribution of housing prices are not related, but for some areas, the areas with high housing age but higher prices, but this may be due to these areas are located in the old city, housing construction time but other resources in the district is more excellent.
ggplot()+
geom_sf(data=model_set,aes(color=q5(AGE)),alpha= 0.5,show.legend = "point",size=.75)+
geom_sf(data=st_union(tract),fill= "transparent", color = "grey")+
scale_color_manual(values = palette2,labels = qBr(model_set, "AGE"),name = "Quintile Breaks\n(Year)")+
labs(title="Distribution of the House Age",subtitle = "Based on Each Home Sale Observation\nMecklenburg, NC",caption="Figure2.6.1")+mapTheme()
According to the results of correlation we know that the accessibility of daycare facilities, testscore and price are positively correlated, but it is not clear how much overlap they have in geographic distribution, here we would like to compare and observe.
<- ggplot()+
pcttest1.plot geom_sf(data=st_union(tract),fill="white", color = "grey")+
geom_sf(data=model_set,aes(color=q5(test.elementary)),show.legend = "point",size=.75)+
scale_color_manual(values = c('#eff3ff','#bdd7e7','#6baed6','#3182bd','#08519c'),labels = qBr(model_set, "test.elementary"),name = "Quintile Breaks\n(Percent)")+mapTheme()+
labs(title="Test Proficiency - Elementary School",subtitle = "Percentage of Charlotte-Mecklenburg Schools (CMS) students \nin grades 3-5 proficient in End of Grade testing",caption="Figure2.6.2")
<- model_set %>%
daycare_nn4 mutate(daycare_nn4.expanded = daycare_nn4 * 1000)
<- ggplot() +
daycare.plot geom_sf(data = st_union(tract),fill ="white", color = "grey") +
geom_sf(data = daycare_nn4,aes(color = q5(daycare_nn4.expanded)),size = .75) +
scale_color_manual(values = palette2,
labels = qBr(daycare_nn4,'daycare_nn4.expanded'),name = "Quintile Breaks\n(Percent)") + mapTheme() +
labs(title = 'Average Distrance to Nearest 4 Daycare Center',
subtitle = "Average Distrance to Nearest Daycare\n(timed by 1000 for better visualization)",
caption = 'Figure2.6.3')
grid.arrange(pcttest1.plot, daycare.plot, ncol=2)
To further understand the data set to reduce the error, we counted the NA values of each variable and selected as few as possible with NA values.
#challenge_set
<-data %>% filter(price==0)
challenge_set#model_set
<-data %>% filter(price!=0)
model_set# na
<- colSums(is.na(model_set))
na .100 <-as.data.frame(subset(na, na>100))
na.100<-tibble::rownames_to_column(na.100, "col_names")
na
ggplot(na.100) +
geom_col(aes(x=col_names, y=subset(na, na>100)),fill = '#fecc5c',color="white") +
labs(title="NA counts in DataFrame",
subtitle = "Na > 100\n",
caption = "Figure DATA 2.7") +
xlab('Variable Column Names') +
ylab('Count') + plotTheme()+theme(axis.text.x = element_text(angle = 45, hjust = 1, size=8))
The primary goal of this analysis is to train a robust model to predict house price as accurately as possible. Accordingly, this section started to provide a detailed interpretation on the process of the model building.
Data and Sample:In order to accurately build and
test models, we randomly split the whole model_set
into two
parts: model_set.train
with 77% of observations for model
training and model_set.test
with 23% of observations for
model testing.
Dependent Variable: The dependent variable is
home sale price: price
.
Independent Variables: Based on the hedonic
home price model, features are put into three catogories.
After creating features and engineering and checking correlation
coefficients, 88 features are eligible for the model and obviously or
slightly correlated with price
that were the initial
independent variables for model building. We gradually screened out
variables and eventually selected only 26 robust
features included in the final model
Outliers: For the determination of outlier we refer to zillow’s price tier classification: The ZHVI generally applies to three market tiers - bottom, middle and top - roughly equivalent to one-third of each market. The mid-tier ZHVI includes all homes between 35% and 65% of the value of the home. This mid-market range is closest to the standard definition of a “typical” home in a region, and because of the geographic distribution of this challenge we primarily considered a model with a price range in the middle of the market for training purposes.
Procedure: The primitive approach used in this
analysis for model building is Ordinary Least Squares Regression
(OLS) which is aimed at predicting the price
of
houses as a function of several statistical parameters such as
intercept, slope and selected features. In
order to determine which features were able to significantly increase
the accuracy and generalizability of the model, we systematically
compared, plotted, and mapping the mean absolute errors
(MAE), mean absolute percent errors (MAPE), as
well as root mean square errors (RMSE, the standard
deviation of the residuals) between original SalePrice
and
predicted price
of both training set and test set,
conducted the cross-validation on 100 folds, explored the underlying
spatial pattern of price and errors caused by problematic prediction. To
address the problem of neglecting the underlying spatial correlation
caused by neighborhood, we then compared, plotted, and mapped the mean
absolute errors (MAE), and mean absolute percent errors (MAPE) on the
test set between baseline model without neighborhood features and
updated model with neighborhood features, and then
determined our final model based on series model examinations.
Initially, as indicated in the III. Methods,
the dataset was divided into training set and test set, with the ratio
of 6:4. Here we use paste()
command to make sure to include
all levels of our specified variables in the training set, therefore, in
the end, the raio of training set and test set goes up to 77:23.
#1.Split Model Set:
set.seed(7278)
# get index for training sample:
<-caret::createDataPartition(
inTrainy = paste(model_set$descbuildi,model_set$storyheigh,
$extwall,model_set$bldggrade,
model_set$price),
model_setp = .60,
list=FALSE)
#Find the number of outliers:
<-quantile(model_set$price)[2]
Q1<-quantile(model_set$price)[4]
Q3
= Q3+3*(Q3-Q1)
number_out
# split model_set into train and test
<-model_set[inTrain,]
model_set.train0<-model_set[inTrain,] %>%
model_set.trainfilter(price<=number_out)
<-model_set[-inTrain,]
model_set.test
<-model_set[inTrain,] %>%
outlierfilter(price>Q3+3*(Q3-Q1))
# Feature Collection
##Non-numeric Features
<- model_set.train %>%
nonnumeric ::select(all_of(nonnumeric.feature1))%>%
dplyrst_drop_geometry()
##Internal Features
<-model_set.train %>%
sel.internal::select(all_of(sel.internal.list)) %>%
dplyrst_drop_geometry()
##Tract Features
<-model_set.train %>%
sel.tract::select(all_of(sel.tract.list)) %>%
dplyrst_drop_geometry()
##AmenitY Features
<-model_set.train%>%
sel.amenitie1::select(
dplyrstarts_with('college_'),
starts_with('daycare_'),
starts_with('MedicalFacilities_'),
starts_with('library_'),
starts_with('hist_'),
starts_with('church'),
starts_with('capital_'),
starts_with('park_l'),
%>%
price) st_drop_geometry()
Secondly, we built three models with different features to examine which features should be included.
price
in Model 1.The baseline model contained the following features.
bldggrade.m
;shape_Area
,heatedarea
,fullbaths
,bedrooms
,
units
;pctRenterOccupied
,pctWhite
,MedHHInc
,test.elementary
,crime_sum
,PctLongCommute
,Pctgrocery
,isin
;college_nn5
,`library_nn2
,hist_properties_nn5
,
capital_improve_nn2
.Model summaries were all shown below.
<-model_set.train %>% dplyr::select(price,
train_set.allall_of(nonnumeric.feature1),all_of(sel.internal.list),all_of(sel.tract.list),
starts_with('college_'),
starts_with('daycare_'),
starts_with('MedicalFacilities_'),
starts_with('library_'),
starts_with('hist_'),
starts_with('church'),
starts_with('capital_'),
starts_with('park_l')
%>% st_drop_geometry()
)
<- lm(price ~ ., data = train_set.all)
M0
stargazer(M0, type = "html",
title = "Table 4.1.1 Summary Statistics of Model.1 ",
header = FALSE,
single.row = TRUE)
Dependent variable: | |
price | |
descbuildiGROUP HOME | -41,288.880 (106,072.300) |
descbuildiPATIO HOME | -61,280.650** (25,644.670) |
descbuildiRES | -9,986.222 (7,808.843) |
descbuildiSFR MODULAR | -29,127.580 (29,352.890) |
descbuildiSFR TINY | -43,556.940 (75,228.800) |
storyheigh.m1 STORY | -27,773.900 (74,947.140) |
storyheigh.m1.5 STORY | -20,701.790 (74,958.720) |
storyheigh.m2.0 STORY | -42,492.290 (74,930.780) |
storyheigh.m2.5 STORY | -64,654.710 (75,016.770) |
storyheigh.m3.0 STORY | -85,106.020 (76,397.190) |
storyheigh.mELSE | -49,622.480 (74,995.120) |
aheatingty.mHOT WATER | 32,527.840 (29,398.270) |
heatedfuel.mGAS | 7,524.349*** (1,856.781) |
extwall.mSTONE | 8,385.035 (26,489.620) |
foundationCRAWL SPACE | 23,712.870*** (4,803.560) |
foundationPIER-NO FOUND WALL | 4,674.889 (75,058.990) |
foundationSLAB-ABV GRD | -16,376.690 (21,325.410) |
foundationSLAB-COM | 9,560.336 (53,156.830) |
foundationSLAB-RES | 15,284.410*** (4,905.522) |
bldggrade.mCUSTOM | 133,211.000*** (32,134.720) |
bldggrade.mELSE | -40,230.530*** (4,468.043) |
bldggrade.mEXCELLENT | 254,977.500*** (7,768.602) |
bldggrade.mGOOD | 64,956.860*** (1,812.582) |
bldggrade.mVERY GOOD | 176,920.800*** (2,932.553) |
isin | -8,576.152*** (2,232.975) |
shape_Area | 0.499*** (0.029) |
heatedarea | 82.809*** (1.352) |
numfirepla | 8,501.768*** (1,524.920) |
fullbaths | 24,863.760*** (1,386.495) |
halfbaths | 9,889.747*** (1,564.081) |
bedrooms | -3,807.025*** (801.045) |
units | -42,573.770*** (3,357.588) |
pctRenterOccupied | 38,936.350*** (3,818.520) |
pctWhite | 90,014.570*** (5,055.748) |
pctPoverty | 71,316.520*** (11,335.610) |
MedRent | 2.583 (2.971) |
MedHHInc | 0.402*** (0.040) |
MedAge | 729.766*** (122.373) |
crime_sum | -158.630*** (18.376) |
PctLongCommute | -542.353*** (66.427) |
Pctgrocery | 293.203*** (29.541) |
test.elementary | 770.958*** (67.549) |
PctBachelor | 803.744*** (66.697) |
college_nn1 | 232,770.200*** (87,164.980) |
college_nn2 | -1,676,351.000*** (294,463.700) |
college_nn3 | 1,721,574.000*** (388,118.200) |
college_nn4 | 4,087,610.000*** (656,997.100) |
college_nn5 | -5,557,639.000*** (482,525.800) |
daycare_nn1 | -866,608.500** (346,049.700) |
daycare_nn2 | 2,660,691.000*** (910,155.500) |
daycare_nn3 | -2,323,108.000 (1,590,074.000) |
daycare_nn4 | -2,786,717.000 (2,036,884.000) |
daycare_nn5 | 4,087,477.000*** (1,168,428.000) |
MedicalFacilities_nn1 | -1,636,967.000*** (321,208.000) |
MedicalFacilities_nn2 | 2,554,779.000** (999,597.600) |
MedicalFacilities_nn3 | 1,203,957.000 (1,958,851.000) |
MedicalFacilities_nn4 | -4,606,285.000* (2,636,509.000) |
MedicalFacilities_nn5 | 2,079,076.000 (1,448,033.000) |
library_nn1 | -312,612.700*** (99,276.840) |
library_nn2 | -1,966,036.000*** (311,835.800) |
library_nn3 | 5,982,407.000*** (553,155.900) |
library_nn4 | -4,744,998.000*** (500,061.800) |
library_nn5 | 245,433.500 (276,283.900) |
hist_cemeteries_nn1 | -32,462.860 (224,511.400) |
hist_cemeteries_nn2 | -503,041.600 (574,705.000) |
hist_cemeteries_nn3 | 2,288,197.000** (926,490.600) |
hist_cemeteries_nn4 | -413,975.500 (1,139,783.000) |
hist_cemeteries_nn5 | -1,026,067.000 (688,260.800) |
hist_properties_nn1 | 344,364.800* (206,602.700) |
hist_properties_nn2 | -301,629.400 (381,469.600) |
hist_properties_nn3 | 260,879.700 (717,215.700) |
hist_properties_nn4 | 1,527,876.000* (876,978.600) |
hist_properties_nn5 | -2,154,458.000*** (518,231.600) |
church_nn1 | 288,215.100 (457,448.300) |
church_nn2 | -353,314.400 (1,029,434.000) |
church_nn3 | -1,838,581.000 (1,703,750.000) |
church_nn4 | 8,361,436.000*** (2,664,523.000) |
church_nn5 | -6,927,791.000*** (1,630,967.000) |
capital_improve_nn1 | 412,495.100** (186,382.100) |
capital_improve_nn2 | -2,059,174.000*** (462,101.700) |
capital_improve_nn3 | 5,866,141.000*** (800,195.700) |
capital_improve_nn4 | -7,140,197.000*** (973,193.400) |
capital_improve_nn5 | 5,039,698.000*** (595,806.800) |
park_location_nn1 | 510,160.900* (278,551.900) |
park_location_nn2 | 3,010,701.000*** (672,624.800) |
park_location_nn3 | -3,473,263.000*** (1,057,989.000) |
park_location_nn4 | -2,605,383.000 (1,619,162.000) |
park_location_nn5 | 2,376,481.000** (1,020,605.000) |
Constant | 180,448.100** (76,328.620) |
Observations | 31,287 |
R2 | 0.712 |
Adjusted R2 | 0.711 |
Residual Std. Error | 105,738.300 (df = 31198) |
F Statistic | 876.185*** (df = 88; 31198) |
Note: | p<0.1; p<0.05; p<0.01 |
.1<-model_set.train %>% dplyr::select(
train_set
price,#Non-numeric Feature
bldggrade.m,#Internal Feature
shape_Area ,heatedarea,fullbaths,bedrooms, units,#Tract Feature
pctRenterOccupied , pctWhite ,pctPoverty,MedHHInc,
test.elementary,crime_sum,PctLongCommute,Pctgrocery,MedAge,isin,#Amenities Feature
college_nn4,college_nn5,library_nn3,library_nn4,
library_nn2,hist_properties_nn5,capital_improve_nn4,
capital_improve_nn5,park_location_nn2,%>% st_drop_geometry()
park_location_nn3)
<- lm(price ~ ., data = train_set.1)
M1
stargazer(M1, type = "html",
title = "Table 4.1.2 Summary Statistics of Model.2 ",
header = FALSE,
single.row = TRUE)
Dependent variable: | |
price | |
bldggrade.mCUSTOM | 158,633.500*** (32,722.530) |
bldggrade.mELSE | -39,671.670*** (4,475.939) |
bldggrade.mEXCELLENT | 268,849.500*** (7,754.282) |
bldggrade.mGOOD | 73,074.650*** (1,690.832) |
bldggrade.mVERY GOOD | 190,765.300*** (2,771.332) |
shape_Area | 0.533*** (0.027) |
heatedarea | 82.655*** (1.169) |
fullbaths | 19,889.830*** (1,205.370) |
bedrooms | -5,792.850*** (779.885) |
units | -41,056.630*** (2,945.156) |
pctRenterOccupied | 49,516.610*** (3,591.019) |
pctWhite | 116,152.900*** (4,193.958) |
pctPoverty | 48,076.890*** (10,545.090) |
MedHHInc | 0.623*** (0.032) |
test.elementary | 1,166.981*** (61.624) |
crime_sum | -156.560*** (17.586) |
PctLongCommute | -645.459*** (60.158) |
Pctgrocery | 378.763*** (28.141) |
MedAge | 386.205*** (112.264) |
isin | -6,736.273*** (2,197.607) |
college_nn4 | 2,993,805.000*** (175,618.500) |
college_nn5 | -3,925,922.000*** (162,503.400) |
library_nn3 | 8,355,500.000*** (461,825.900) |
library_nn4 | -6,547,574.000*** (330,246.800) |
library_nn2 | -2,814,707.000*** (185,514.300) |
hist_properties_nn5 | -258,243.100*** (21,605.490) |
capital_improve_nn4 | -2,784,548.000*** (384,867.600) |
capital_improve_nn5 | 4,769,145.000*** (400,984.800) |
park_location_nn2 | 2,742,174.000*** (371,010.500) |
park_location_nn3 | -3,218,861.000*** (384,560.700) |
Constant | 177,698.700*** (8,424.407) |
Observations | 32,671 |
R2 | 0.705 |
Adjusted R2 | 0.705 |
Residual Std. Error | 107,947.200 (df = 32640) |
F Statistic | 2,605.773*** (df = 30; 32640) |
Note: | p<0.1; p<0.05; p<0.01 |
.2<-model_set.train %>%
train_set::select(
dplyr
price,#Non-numeric Feature
bldggrade.m,#Internal Feature
shape_Area ,heatedarea,fullbaths,bedrooms , units,#Tract Feature
pctRenterOccupied , pctWhite,MedHHInc,test.elementary,crime_sum,PctLongCommute,Pctgrocery,isin,#Amenities Feature
college_nn5,library_nn2,hist_properties_nn5,capital_improve_nn2%>% st_drop_geometry()
)
<- lm(price ~ ., data = train_set.2)
M2
stargazer(M2, type = "html",
title = "Table 4.1.3 Summary Statistics of Model.2 ",
header = FALSE,
single.row = TRUE)
Dependent variable: | |
price | |
bldggrade.mCUSTOM | 187,603.300*** (33,322.080) |
bldggrade.mELSE | -44,502.180*** (4,551.078) |
bldggrade.mEXCELLENT | 280,653.900*** (7,873.788) |
bldggrade.mGOOD | 80,274.340*** (1,695.223) |
bldggrade.mVERY GOOD | 199,805.200*** (2,779.488) |
shape_Area | 0.517*** (0.027) |
heatedarea | 80.575*** (1.179) |
fullbaths | 20,678.420*** (1,224.301) |
bedrooms | -6,848.791*** (791.729) |
units | -40,946.250*** (2,990.728) |
pctRenterOccupied | 57,237.990*** (3,487.492) |
pctWhite | 123,198.500*** (4,081.952) |
MedHHInc | 0.682*** (0.029) |
test.elementary | 1,110.838*** (61.527) |
crime_sum | -38.499** (17.298) |
PctLongCommute | -458.964*** (58.735) |
Pctgrocery | 326.338*** (28.401) |
isin | 3,863.954* (2,191.058) |
college_nn5 | -1,027,085.000*** (27,256.650) |
library_nn2 | -705,187.300*** (33,561.000) |
hist_properties_nn5 | -133,047.700*** (14,653.380) |
capital_improve_nn2 | 814,010.000*** (53,984.940) |
Constant | 147,814.400*** (6,864.642) |
Observations | 32,799 |
R2 | 0.695 |
Adjusted R2 | 0.695 |
Residual Std. Error | 109,970.800 (df = 32776) |
F Statistic | 3,392.057*** (df = 22; 32776) |
Note: | p<0.1; p<0.05; p<0.01 |
The results of the regression on the training set are in the
following table. The current model embraced 18 independent features and
reached 0.695 R-square, indicating that 69.5% of the variance for
price
can be explained by this model.
.2<-model_set.train %>%
train_set::select(
dplyr
price,#Non-numeric Feature
bldggrade.m,#Internal Feature
shape_Area ,heatedarea,fullbaths,bedrooms , units,#Tract Feature
pctRenterOccupied , pctWhite,MedHHInc,test.elementary,crime_sum,PctLongCommute,Pctgrocery,isin,#Amenities Feature
college_nn5,library_nn2,hist_properties_nn5,capital_improve_nn2%>% st_drop_geometry()
)
<- lm(price ~ ., data = train_set.2)
M2
stargazer(M2, type = "html",
title = "Table 4.2.1 Summary Statistics of Model.2 ",
header = FALSE,
single.row = TRUE)
Dependent variable: | |
price | |
bldggrade.mCUSTOM | 187,603.300*** (33,322.080) |
bldggrade.mELSE | -44,502.180*** (4,551.078) |
bldggrade.mEXCELLENT | 280,653.900*** (7,873.788) |
bldggrade.mGOOD | 80,274.340*** (1,695.223) |
bldggrade.mVERY GOOD | 199,805.200*** (2,779.488) |
shape_Area | 0.517*** (0.027) |
heatedarea | 80.575*** (1.179) |
fullbaths | 20,678.420*** (1,224.301) |
bedrooms | -6,848.791*** (791.729) |
units | -40,946.250*** (2,990.728) |
pctRenterOccupied | 57,237.990*** (3,487.492) |
pctWhite | 123,198.500*** (4,081.952) |
MedHHInc | 0.682*** (0.029) |
test.elementary | 1,110.838*** (61.527) |
crime_sum | -38.499** (17.298) |
PctLongCommute | -458.964*** (58.735) |
Pctgrocery | 326.338*** (28.401) |
isin | 3,863.954* (2,191.058) |
college_nn5 | -1,027,085.000*** (27,256.650) |
library_nn2 | -705,187.300*** (33,561.000) |
hist_properties_nn5 | -133,047.700*** (14,653.380) |
capital_improve_nn2 | 814,010.000*** (53,984.940) |
Constant | 147,814.400*** (6,864.642) |
Observations | 32,799 |
R2 | 0.695 |
Adjusted R2 | 0.695 |
Residual Std. Error | 109,970.800 (df = 32776) |
F Statistic | 3,392.057*** (df = 22; 32776) |
Note: | p<0.1; p<0.05; p<0.01 |
::glance(M2) %>%
broomkable(caption = 'Table 4.2.2 Summary of Model.2 Evaluation Parameters') %>%
kable_styling("striped", full_width = F)
r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
---|---|---|---|---|---|---|---|---|---|---|---|
0.6948266 | 0.6946218 | 109970.8 | 3392.057 | 0 | 22 | -427258.1 | 854564.2 | 854765.7 | 396379127629018 | 32776 | 32799 |
The Mean Absolute Errors (MAE) and Mean Absolute Percent Errors (MAPE) are statistical measures of how accurate a forecast system is.Based on the following table, the MAE is 67903.84, the errors are accounted for 19.20%.
<-M2
training_model
<-model_set.test %>%
train_model_set.testmutate(Regression = "Baseline Regression",
price.Predict = predict(training_model, newdata = model_set.test),
price.Error = price - price.Predict,
price.AbsError = abs(price - price.Predict),
price.APE = (abs(price - price.Predict)) / price)
#%>% filter(price<=1500000)
<-as.data.frame(cbind(mean(train_model_set.test$price.AbsError,na.rm = T),
test_GoodnessFitmean(train_model_set.test$price.APE,na.rm = T)))
colnames(test_GoodnessFit)<-c("Mean Absolute Errors (MAE)","MAPE")
kable(test_GoodnessFit,
caption = 'Table 4.3.1. MAE and MAPE for Test Set') %>%
kable_styling("striped", full_width = F)
Mean Absolute Errors (MAE) | MAPE |
---|---|
67903.84 | 0.1920308 |
grid.arrange(ncol=2,
ggplot()+
geom_sf(data=st_union(zipcode), fill="transparent", color='grey90')+
geom_sf(data=train_model_set.test,
aes(color=q5(price)),show.legend="point",size=1.0)+
scale_color_manual(values = palette2,labels = qBr(train_model_set.test, "price"),name = "Quintile Breaks")+
labs(title="Test set sale price",subtitle = "Mecklenberg County,NC",caption="Figure 4.3.1")+
mapTheme(),
ggplot()+
geom_sf(data=st_union(zipcode), fill="transparent", color='grey90')+
geom_sf(data=train_model_set.test,
aes(color=q5(price.AbsError)),show.legend="point",size=1.0)+
scale_color_manual(values = palette2,labels = qBr(train_model_set.test, "price.AbsError"),name = "Quintile Breaks")+
labs(title="Test set absolute sale price errors",subtitle = "Mecklenberg County,NC",caption="Figure 4.3.2")+
mapTheme())
This section conducted cross-validation to assess how our model would generalize to other independent data sets, in which we could flag problem of overfitting or feature selection bias.
Both a summary table and distribution plots of R-square, RMSE, and MAE in each of 100 fold are shown below.
The variation of 100 folds can also be visualized with a histogram of across-fold MAE. Although most of the errors are clustering tightly together, some errors are scattered, indicating the inconsistency of the model prediction.
# 4. Cross-Validation Test on traning model
##use caret package cross-validation method
<- caret:: trainControl(method = "cv",
fitControl number = 100,
savePredictions = TRUE)
set.seed(2777)
<-
model_set.train.cv ::train(price ~ ., data = st_drop_geometry(train_model_set.test) %>%
caret::select(
dplyr
price,#Non-numeric Feature
bldggrade.m,#Internal Feature
shape_Area ,heatedarea,fullbaths,bedrooms , units,#Tract Feature
pctRenterOccupied , pctWhite ,MedHHInc ,test.elementary,crime_sum,PctLongCommute,Pctgrocery,isin,#Amenities Feature
college_nn5,library_nn2,hist_properties_nn5,capital_improve_nn2
),method = "lm",
trControl = fitControl,
na.action = na.pass)
#resample<-model_set.train.cv$resample
## show RMSE, MAE, R^2
kable(model_set.train.cv$resample,
caption = 'Table 4.4.1 Cross-validation Test: Summary of RMSE, R Squared and MAE') %>%
kable_styling("striped", full_width = F)
RMSE | Rsquared | MAE | Resample |
---|---|---|---|
80912.91 | 0.7522926 | 61988.09 | Fold001 |
121436.00 | 0.7463505 | 74018.64 | Fold002 |
84759.52 | 0.8062709 | 58258.07 | Fold003 |
106158.15 | 0.6756605 | 70056.23 | Fold004 |
77961.65 | 0.7357602 | 57834.81 | Fold005 |
73482.07 | 0.6995119 | 60563.83 | Fold006 |
90359.49 | 0.8297619 | 61888.08 | Fold007 |
123246.61 | 0.6600689 | 72162.88 | Fold008 |
89536.26 | 0.7077044 | 68944.49 | Fold009 |
79729.21 | 0.8622709 | 60546.61 | Fold010 |
105376.36 | 0.8211884 | 74854.02 | Fold011 |
85153.64 | 0.7593606 | 62359.78 | Fold012 |
138457.91 | 0.7389223 | 76872.49 | Fold013 |
117106.38 | 0.6930716 | 68160.00 | Fold014 |
77425.56 | 0.7984826 | 58554.28 | Fold015 |
81324.56 | 0.7790243 | 65255.28 | Fold016 |
76080.88 | 0.7718194 | 59405.70 | Fold017 |
77030.65 | 0.8606175 | 56063.12 | Fold018 |
82696.65 | 0.6776215 | 61610.77 | Fold019 |
74572.40 | 0.8444962 | 59572.49 | Fold020 |
96769.75 | 0.7429385 | 74764.35 | Fold021 |
81408.27 | 0.7492455 | 59731.98 | Fold022 |
72124.31 | 0.7617940 | 57996.25 | Fold023 |
93899.61 | 0.7655386 | 68447.70 | Fold024 |
92350.77 | 0.7479819 | 69831.77 | Fold025 |
79315.84 | 0.7646417 | 61775.05 | Fold026 |
88734.00 | 0.6529469 | 68161.92 | Fold027 |
85941.28 | 0.6139372 | 63653.07 | Fold028 |
84634.52 | 0.7608822 | 61516.12 | Fold029 |
157107.57 | 0.5184221 | 74765.82 | Fold030 |
95249.51 | 0.7892243 | 69531.25 | Fold031 |
229790.59 | 0.3267036 | 84076.76 | Fold032 |
111974.08 | 0.8990451 | 71539.04 | Fold033 |
75757.38 | 0.7395878 | 59932.93 | Fold034 |
81072.31 | 0.7906971 | 60444.22 | Fold035 |
81416.65 | 0.7585552 | 64237.95 | Fold036 |
82397.53 | 0.7850780 | 60302.60 | Fold037 |
76614.44 | 0.8437298 | 59364.15 | Fold038 |
77155.37 | 0.6453661 | 61788.12 | Fold039 |
88923.53 | 0.7384771 | 67362.11 | Fold040 |
77696.50 | 0.6399515 | 57394.28 | Fold041 |
96197.80 | 0.7283474 | 68342.53 | Fold042 |
85975.44 | 0.7235721 | 66455.44 | Fold043 |
82635.11 | 0.7420236 | 66598.59 | Fold044 |
191299.54 | 0.8068155 | 76287.26 | Fold045 |
77791.53 | 0.7771961 | 62149.63 | Fold046 |
84071.75 | 0.7720400 | 55966.00 | Fold047 |
128838.55 | 0.8780460 | 71954.15 | Fold048 |
117833.10 | 0.6207811 | 77862.55 | Fold049 |
88806.16 | 0.8260653 | 64398.22 | Fold050 |
222750.62 | 0.3160261 | 93176.26 | Fold051 |
67587.63 | 0.8609748 | 55261.32 | Fold052 |
131942.65 | 0.5363726 | 67728.07 | Fold053 |
78807.51 | 0.6802213 | 63149.29 | Fold054 |
111583.45 | 0.7117481 | 71604.75 | Fold055 |
75843.45 | 0.7808069 | 62636.57 | Fold056 |
79765.84 | 0.8380014 | 52958.86 | Fold057 |
152709.24 | 0.3062003 | 69524.11 | Fold058 |
91941.43 | 0.6870728 | 66069.61 | Fold059 |
83946.55 | 0.9090477 | 59549.83 | Fold060 |
82260.87 | 0.8998816 | 54390.27 | Fold061 |
94727.41 | 0.7267390 | 71415.28 | Fold062 |
74954.90 | 0.7525417 | 59888.64 | Fold063 |
84846.60 | 0.7388638 | 60976.16 | Fold064 |
87286.07 | 0.6511729 | 61133.68 | Fold065 |
73718.70 | 0.6295785 | 57520.68 | Fold066 |
100651.76 | 0.7853676 | 67289.36 | Fold067 |
92009.72 | 0.8420283 | 60763.85 | Fold068 |
83402.51 | 0.6873405 | 63908.65 | Fold069 |
78725.59 | 0.7958884 | 62264.96 | Fold070 |
82053.26 | 0.7329766 | 62662.11 | Fold071 |
117373.73 | 0.7646259 | 65532.60 | Fold072 |
117038.68 | 0.8892069 | 68011.10 | Fold073 |
78187.35 | 0.7468970 | 59884.77 | Fold074 |
127063.38 | 0.8246603 | 77283.72 | Fold075 |
82866.05 | 0.7934121 | 63257.25 | Fold076 |
80706.16 | 0.7181297 | 64232.99 | Fold077 |
100171.13 | 0.6949291 | 69780.16 | Fold078 |
78924.85 | 0.7310373 | 60734.92 | Fold079 |
209152.19 | 0.2315615 | 78269.54 | Fold080 |
80822.99 | 0.7194550 | 59160.08 | Fold081 |
93313.42 | 0.8006180 | 63157.04 | Fold082 |
80407.82 | 0.6689253 | 65451.87 | Fold083 |
81344.46 | 0.8013644 | 60615.17 | Fold084 |
92029.97 | 0.7730008 | 66479.96 | Fold085 |
93015.49 | 0.6081515 | 66038.53 | Fold086 |
111518.41 | 0.7668593 | 74610.24 | Fold087 |
88582.27 | 0.7614341 | 61794.83 | Fold088 |
90290.17 | 0.7626602 | 66784.21 | Fold089 |
90975.72 | 0.8030460 | 65394.76 | Fold090 |
95298.47 | 0.8801259 | 64869.53 | Fold091 |
100044.48 | 0.5777809 | 67635.81 | Fold092 |
101148.57 | 0.7557638 | 72189.06 | Fold093 |
82931.81 | 0.7227645 | 63470.03 | Fold094 |
96950.00 | 0.5679764 | 65768.87 | Fold095 |
97109.87 | 0.6834293 | 69545.44 | Fold096 |
88167.20 | 0.7719486 | 58542.91 | Fold097 |
73682.37 | 0.6315007 | 56274.40 | Fold098 |
72719.75 | 0.7735177 | 57527.89 | Fold099 |
127925.53 | 0.7875308 | 78286.84 | Fold100 |
$resample %>%
model_set.train.cvpivot_longer(-Resample) %>%
mutate(name = as.factor(name)) %>%
ggplot(., aes(x = name, y = value, color = name)) +
geom_jitter(width = 0.1) +
facet_wrap(~name, ncol = 3, scales = "free") +
theme_bw() +
theme(legend.position = "none") +
labs(title = 'Cross-validation Test: Distribution of MAE, RMSE, R Squared\n',
caption = "Figure DATA 4.4.1") +
plotTheme()
ggplot(data = model_set.train.cv$resample) +
geom_histogram(aes(x = model_set.train.cv$resample$MAE), fill = '#fecc5c',color="white",bins=30) +
labs(title="Distribution of Cross-validation MAE",
subtitle = "K = 100\n",
caption = "Figure DATA 4.4.2") +
xlab('MAE of Training Model') +
ylab('Count') + plotTheme()
The Figure 5.5.1 below shows the predicted prices as a function of observed price. This plot indicates that our model can predict price but not so perfectly since the regression line (shown in dark-blue) and the reference line (shown in dark-red) are not completely overlapped.
# 5. Plot predicted prices as a function of observed prices
<-model_set[inTrain,] %>% filter(price<=3000000)
model_set.train0<- data.frame(pred = predict(training_model, model_set.train0),
preds.train actual = model_set.train0$price,
source = "Training Set")
<- data.frame(pred = predict(training_model, train_model_set.test),
preds.test actual = train_model_set.test$price,
source = "Test Set")
<- rbind(preds.train, preds.test)
preds
## Overall Plotting
ggplot(preds.test, aes(x = actual, y = pred, color = source)) +
geom_point(color='#abd9e9') +
geom_smooth(method = "lm", color = "darkblue") +
geom_abline(color = "darkred") +
coord_equal() +
theme_bw() +
labs(title = "Predicted Prices as a\nFunction of Observed Prices",
subtitle = 'Blue line is model regression line\nwhile red line is reference line\n',
caption = 'Figure DATA 4.5.1',
x = "Observed Prices",
y = "Predicted Prices") +
theme(
legend.position = "none"
+
) plotTheme()
<- ggplot(preds, aes(x = actual, y = pred,color=source)) +
vs.plotgeom_point() +
geom_smooth(method = "lm", color = "darkblue") +
geom_abline(color = "darkred") +
coord_equal() +
theme_bw() +
facet_wrap(~source, ncol = 2) +
labs(title = "Predicted Prices as a Function of Observed Prices, by Sets",
subtitle = 'Blue line is model regression line while red line is reference line\n',
caption = 'Figure DATA 4.5.2',
x = "Observed Prices",
y = "Predicted Prices") +
scale_color_manual(values = c('#abd9e9', '#fecc5c'))+
theme(
legend.position = "none"
+
) plotTheme()
+ geom_vline(xintercept=3000000, color='orange') +
vs.plot annotate("text", x = 3300000, y = 2000000, label = "Traning set\n upperbound",size=3)
In this section, we leveraged three different tools to examine if prediction errors display certain spatial patterns, and accordingly to determine whether we need to add more spatial features in the existing model.
Mapping residuals: In the map of the residuals on the test set, we can see there is no clear pattern of the error directions.
Plotting spatial lags: A spatial lag is a variable that averages the neighboring values of a location. The spatial lag of error plot shows that as home price errors increase, the nearby home price errors slightly increase, indicating that our model errors are rarely spatially autocorrelated.
Moran’s I: Another approach for measuring spatial autocorrelation is Moran’s I. In this model Moran’s I is around 0.18, indicating a certain level of clustering on model errors. The frequency of all 999 randomly permuted I were plotted as a histogram with the Observed I indicated by the orange line (see Figure 4.6.3). That the observed I is higher than all of the 999 randomly generated I’s provides visual confirmation of spatial autocorrelation despite being slight. This suggests that variation in price is likely related to the spatial process has been omitted from our current model.
Therefore, we thought it was worthy to include neighborhood features in the model for the further test.
ggplot()+
geom_sf(data=st_union(tract),fill="white",color='grey90')+
geom_sf(data=train_model_set.test,aes(color=q5(price.Error)),show.legend = "point",size=1.2)+
scale_color_manual(values=palette5,labels = qBr(train_model_set.test,'price.Error'),
name = "Residuals")+
labs(title = 'Map of residuals on test set\n',subtitle = "Mecklenburg County, NC",
caption = 'Figure 4.6.1') +
mapTheme()
<-train_model_set.test %>% filter(is.na(price.AbsError)==F)
abs<-st_coordinates(abs)
coords.test<- knn2nb(knearneigh(coords.test, 5))
neighborList.test
<- nb2listw(neighborList.test, style="W")
spatialWeights.test
$lagPrice.Error <- lag.listw(spatialWeights.test, abs$price.Error,NAOK = TRUE)
abs
ggplot(abs, aes(x = lagPrice.Error, y = price.Error)) +
geom_point(colour = '#abd9e9') +
geom_smooth(method = "lm", se = FALSE, colour ='orange') +
labs(title = "Error on Test Set as a function of the Spatial Lag of Price\n",
caption = "Figure 4.6.2",
x = "Spatial lag of errors (Mean error of 5 nearest neighbors)",
y = "Errors of Sale Price") +
plotTheme()
<- moran.mc(abs$price.AbsError,
moranTest nsim = 999)
spatialWeights.test,
<- ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
moran.plot geom_histogram(binwidth = 0.01) +
geom_vline(aes(xintercept = moranTest$statistic), colour = "orange",size=1) +
scale_x_continuous(limits = c(-1, 1)) +
labs(title="Observed and permuted Moran's I",
subtitle= "Observed Moran's I in orange, I = 0.18",
x="Moran's I",
y="Count",
caption="Figure4.6.3") +
plotTheme()
moran.plot
Existing findings have indicated that neighborhood could also be a
potential feature for model building. Hence, this section built a new
model called Neighborhood Effects model based on the
baseline model by adding neighborhood feature zip
.
.3<-model_set.train %>% filter(price<=1100000) %>%
train_set::select(
dplyr
price,#Non-numeric Feature
bldggrade.m,#Internal Feature
shape_Area ,heatedarea,fullbaths,bedrooms , units,#Tract Feature
pctRenterOccupied , pctWhite ,MedHHInc ,test.elementary,crime_sum,PctLongCommute,Pctgrocery,zip,isin,#Amenities Feature
%>% st_drop_geometry()
college_nn5,library_nn2,hist_properties_nn5,capital_improve_nn2)
<- lm(price ~ ., data = train_set.3)
Nhood
stargazer(Nhood, type = "html",
title = "Table 4.7.1 Summary Statistics of Neighborhood Effects Model ",
header = FALSE,
single.row = TRUE)
Dependent variable: | |
price | |
bldggrade.mCUSTOM | 26,376.520 (38,667.860) |
bldggrade.mELSE | -31,650.480*** (4,282.240) |
bldggrade.mEXCELLENT | 231,077.900*** (8,117.218) |
bldggrade.mGOOD | 71,902.980*** (1,633.686) |
bldggrade.mVERY GOOD | 178,487.400*** (2,678.114) |
shape_Area | 0.548*** (0.026) |
heatedarea | 82.809*** (1.119) |
fullbaths | 19,921.830*** (1,154.489) |
bedrooms | -5,483.186*** (782.413) |
units | -43,767.990*** (2,808.664) |
pctRenterOccupied | 20,847.170*** (3,485.965) |
pctWhite | 66,434.720*** (4,796.226) |
MedHHInc | 0.477*** (0.032) |
test.elementary | 774.515*** (60.706) |
crime_sum | -166.230*** (18.298) |
PctLongCommute | -442.497*** (64.490) |
Pctgrocery | 155.427*** (28.238) |
zip28031 | 78,749.000** (36,398.030) |
zip28036 | 13,920.070 (36,363.820) |
zip28078 | 4,354.406 (36,080.780) |
zip28104 | 116,992.300*** (41,153.540) |
zip28105 | 62,712.940* (36,552.890) |
zip28107 | 40,377.610 (48,779.050) |
zip28134 | 91,440.900** (36,903.350) |
zip28202 | 244,731.500*** (46,631.940) |
zip28203 | 237,280.200*** (37,020.310) |
zip28204 | 206,283.400*** (37,722.830) |
zip28205 | 157,953.300*** (36,437.010) |
zip28206 | 100,655.400*** (37,004.960) |
zip28207 | 233,806.900*** (38,290.860) |
zip28208 | 84,233.620** (36,601.070) |
zip28209 | 202,412.800*** (36,675.530) |
zip28210 | 118,848.300*** (36,575.750) |
zip28211 | 144,037.000*** (36,523.230) |
zip28212 | 63,046.490* (36,583.670) |
zip28213 | 51,518.250 (36,774.660) |
zip28214 | 66,615.310* (36,485.170) |
zip28215 | 74,792.990** (36,526.410) |
zip28216 | 59,080.300 (36,311.460) |
zip28217 | 41,638.530 (36,971.960) |
zip28226 | 75,541.000** (36,581.770) |
zip28227 | 66,664.740* (36,504.830) |
zip28262 | 31,160.350 (36,683.380) |
zip28269 | 38,916.500 (36,223.850) |
zip28270 | 76,740.100** (36,547.930) |
zip28273 | 88,377.940** (36,852.600) |
zip28277 | 114,149.600*** (36,627.970) |
zip28278 | 122,936.900*** (36,831.400) |
isin | -4,633.993** (2,169.319) |
college_nn5 | -148,562.200** (61,725.460) |
library_nn2 | -536,522.900*** (47,126.170) |
hist_properties_nn5 | -633,455.000*** (45,726.650) |
capital_improve_nn2 | 995,714.400*** (66,391.810) |
Constant | 114,512.000*** (36,962.150) |
Observations | 32,574 |
R2 | 0.715 |
Adjusted R2 | 0.714 |
Residual Std. Error | 101,761.100 (df = 32520) |
F Statistic | 1,536.509*** (df = 53; 32520) |
Note: | p<0.1; p<0.05; p<0.01 |
The Mean Absolute Errors (MAE) and Mean Absolute Percent Errors (MAPE) are statistical measures of how accurate a forecast system is.Based on the following table, the MAE is 65032.58, the errors are accounted for 18.24%. The Neighborhood Effects Model is slightly better than the Baseline Regression Model.
<-model_set.test %>%
Nhood_model.testmutate(Regression = "Neighborhood Effects",
price.Predict = predict(Nhood, newdata = model_set.test),
price.Error = price - price.Predict,
price.AbsError = abs(price - price.Predict),
price.APE = (abs(price - price.Predict)) / price)
#%>% filter(price<=1500000)
<-as.data.frame(cbind(mean(Nhood_model.test$price.AbsError,na.rm = T),
test_GoodnessFit2mean(Nhood_model.test$price.APE,na.rm = T)))
colnames(test_GoodnessFit2)<-c("Mean Absolute Errors (MAE)","MAPE")
kable(test_GoodnessFit2,
caption = 'Table 4.7.2 MAE and MAPE for Neighborhood Effects') %>%
kable_styling("striped", full_width = F)
Mean Absolute Errors (MAE) | MAPE |
---|---|
64826.58 | 0.1808676 |
Applying the new Neighborhood Effects model on the
whole dataset, the distribution of sales price is shown in the map
below. Sale prices are generally higher in south and north Mecklenburg
County. There is no significant difference between the pattern of the
distribution of price predictions in the two datasets
(CHALLENGE
& MODELLING
).
# $392500: Median house value in Charlotte, NC, 2022 August.
$Predicted.Price<- predict(Nhood, newdata = data)
data<-data %>% mutate(Predicted.Price=ifelse(is.na(Predicted.Price),392500,Predicted.Price))
data1
ggplot() +
geom_sf(data = st_union(tract),fill = 'white', color='gray90')+
geom_sf(data = data1,aes(color = q5(Predicted.Price)),size = .75) +
scale_color_manual(values = palette5,
labels = qBr(data,'Predicted.Price'),
name = "Price") +
facet_wrap(~toPredict) +
labs(title = 'Map of Predicted Sale Price\n',
subtitle = 'Mecklenburg, NC',
caption = 'Figure 4.8') +
mapTheme()
Indicated in the summary table as well as the choropleth map below, MAPE varies across neighborhoods (only 31 neighborhoods left in the test set).
<- Nhood_model.test %>%
nhood_sum group_by(zip) %>%
summarize(meanPrice = mean(price, na.rm = T),
meanPrediction = mean(price.Predict, na.rm = T),
meanMAE = mean(price.AbsError, na.rm = T),
meanMAPE = mean(price.APE, na.rm = T))
%>%
nhood_sum st_drop_geometry ()%>%
arrange(desc(meanMAPE)) %>%
kable(caption = 'Table 4.9.1 Map of MAPE by Neighborhood') %>%
kable_styling("striped", full_width = F)
zip | meanPrice | meanPrediction | meanMAE | meanMAPE |
---|---|---|---|---|
28206 | 302294.6 | 306900.2 | 65613.83 | 0.2898917 |
28217 | 268027.3 | 256764.5 | 62816.12 | 0.2834093 |
28202 | 1075000.0 | 781470.2 | 293529.85 | 0.2730510 |
28208 | 280820.1 | 276215.9 | 63010.26 | 0.2622166 |
28031 | 454742.3 | 474134.7 | 102210.32 | 0.2481083 |
28209 | 574234.4 | 568169.0 | 136564.86 | 0.2417120 |
28207 | 1380000.0 | 1010612.8 | 399460.15 | 0.2359640 |
28036 | 522035.3 | 522523.0 | 117511.25 | 0.2347435 |
28211 | 698479.6 | 626548.7 | 166087.37 | 0.2287594 |
28205 | 430583.1 | 441666.0 | 76832.06 | 0.2252741 |
28203 | 678025.6 | 693575.3 | 117119.47 | 0.2078613 |
28216 | 301299.0 | 299101.9 | 56398.98 | 0.2011882 |
28213 | 293816.8 | 294920.6 | 54584.41 | 0.1996086 |
28227 | 337765.6 | 333025.3 | 60205.70 | 0.1949460 |
28262 | 322175.6 | 313407.7 | 66999.60 | 0.1944832 |
28204 | 834266.7 | 756309.9 | 153203.80 | 0.1851236 |
28212 | 280618.4 | 277305.3 | 49801.35 | 0.1840066 |
28269 | 310648.3 | 305612.5 | 51840.69 | 0.1729488 |
28214 | 296309.4 | 294868.0 | 47349.03 | 0.1706755 |
28226 | 527570.8 | 503374.5 | 89843.24 | 0.1619056 |
28278 | 404804.8 | 414181.5 | 62373.07 | 0.1612400 |
28273 | 318289.1 | 319602.1 | 46263.07 | 0.1585278 |
28270 | 501356.1 | 503340.1 | 76674.33 | 0.1575808 |
28078 | 414550.2 | 423749.8 | 63269.58 | 0.1572238 |
28210 | 506077.4 | 482239.1 | 80324.70 | 0.1545924 |
28215 | 297685.4 | 299806.0 | 44688.39 | 0.1541592 |
28134 | 366493.8 | 353847.9 | 54851.67 | 0.1526134 |
28105 | 386259.7 | 377921.8 | 53941.96 | 0.1479396 |
28277 | 523426.3 | 511843.4 | 78758.56 | 0.1477522 |
28104 | 408500.0 | 424366.1 | 30888.99 | 0.0918594 |
NA | 490000.0 | NaN | NaN | NaN |
The following Figure 4.9.2 shows the difference between the Baseline Regression model and Neighborhood Effects model on the mean MAPE on the test set by neighborhood, and we can see that the latter model performs better. Besides, in the test set, both model exhibited larger MAPE in Downtown Charlotte.
<-colorRampPalette(c("#fff7fb","#ece2f0","#d0d1e6","#a6bddb","#67a9cf","#3690c0","#02818a","#016c59","#014636"))
pal<-pal(90)
pal_1
<-bothRegression<-rbind(train_model_set.test,Nhood_model.test)
bothRegression
st_drop_geometry(bothRegression) %>%
group_by(Regression, zip) %>%
summarize(mean.MAPE = mean(price.APE, na.rm = T)) %>%
ungroup() %>%
left_join(zipcode) %>%
st_sf() %>%
ggplot() +
geom_sf(aes(fill = mean.MAPE),color="white") +
facet_wrap(~Regression) +
scale_fill_gradientn(colors=palette2,
name = "MAPE") +
labs(title = "Mean test set MAPE by neighborhood",subtitle = "Mean test set MAPE by neighborhood",caption = "Figure 4.9.1") +mapTheme()
This section plotted MAPE by neighborhood as a function of mean price by neighborhood. Neighborhoods with mean sale price below or above 250,000 have higher MAPE, it has the lowest MAPE around 250,000. This finding indicates that most errors occurred in the neighborhoods where house price was relatively lower or relatively higher.
ggplot()+
geom_point(data=nhood_sum,aes(x=meanPrice,y=meanMAPE), color='#abd9e9')+
stat_smooth(data=nhood_sum,aes(x = meanPrice, y = meanMAPE), method = "loess", se = FALSE, size = 1, colour="orange") +
labs(title = "MAPE by Neighborhood as a Function of\nMean Price by Neighborhood\n",
caption = 'Figure DATA 4.10') +
xlab('Mean Price by Neighborhood') +
ylab('Mean MAPE by Neighborhood') +
plotTheme()
This section gathered Census data to test how well baseline model generalizes to different group contexts, such as race group and income level.
we selected four census variables to display the spatial pattern of each group: * Majority White vs. Majority Non-White: divided whether the proportion of white population is accounted for more than 50% of total population or not. * High Income vs. Low Income: divided whether the median household income in certain census tract is above or below the average level.
<-model_set.test %>%
censusmutate(raceContext = ifelse(pctWhite > .5, "Majority White", "Majority Non-White"),
incomeContext = ifelse(MedHHInc > mean(MedHHInc,na.rm = T), "High Income", "Low Income"),
vacantContext = ifelse(pctTotalVacant > .5, "Majority Vacant", "Majority Non-Vacant"),
pctRenterContext = ifelse(pctPoverty > .5, "Majority Renter Occupied", "Majority Non-Renter Occupied")) %>%dplyr::select(raceContext,incomeContext,vacantContext,pctRenterContext)
<-census %>% st_transform(crs=4326)
census
grid.arrange(ncol = 2,
ggplot() +
geom_sf(data = st_union(tract), fill = 'white',color='grey90') +
geom_sf(data = na.omit(census), aes(color = raceContext),
show.legend = "point", size= .75) +
scale_color_manual(values = c('#abd9e9', '#fecc5c'), name="Race Context") +
labs(title = "Race Context of House\n",
caption = 'Figure 4.11.1') +
mapTheme() +
theme(legend.position="bottom"),
ggplot() +
geom_sf(data = st_union(tract), fill = 'white',color='grey90') +
geom_sf(data = na.omit(census), aes(color = incomeContext),
show.legend = "point", size= .75) +
scale_color_manual(values = c('#abd9e9', '#fecc5c'), name="Income Context") +
labs(title = "Income Context of House\n",
caption = 'Figure 4.11.2') +
mapTheme() +
theme(legend.position="bottom"))
MAE, and MAPE of both regression models are shown below. The finding was not intuitive that the new model performed a little better in terms of MAE and MAPE.
Hence, it is imperative to further examine how well both models could be generalized in race context and income level context, and then to determine which model should be remained.
<-rbind(train_model_set.test,Nhood_model.test) %>% dplyr::select(Regression,price.AbsError,price.APE)
bothRegression
st_drop_geometry(bothRegression) %>% gather(Variable,Value,-Regression) %>% group_by(Regression, Variable) %>%
summarize(meanValue = mean(Value, na.rm = T)) %>%
spread(Variable, meanValue) %>%
kable(caption = 'Table 4.11.1\nSummary of MAE, and MAPE by Regression Models') %>% kable_styling("striped", full_width = F)
Regression | price.AbsError | price.APE |
---|---|---|
Baseline Regression | 67903.84 | 0.1920308 |
Neighborhood Effects | 64826.58 | 0.1808676 |
Considering context groups, our new model with neighborhood features performs better in terms of MAPE and MAE.
The Neighborhood Effects model generalizes well on both context groups with close MAPE to the one generated on the test group.
#scales::percent
<-rbind(train_model_set.test,Nhood_model.test)
bothRegression
<- st_join(bothRegression, census) %>%
race.group group_by(Regression, raceContext) %>%
summarize(mean.MAPE = (mean(price.APE, na.rm = T))) %>%
st_drop_geometry() %>%
spread(raceContext, mean.MAPE) %>%
kable(caption = "Table 4.11.2\nTest set MAPE by Neighborhood Racial Context") %>%
kable_styling("striped", full_width = F)
race.group
Regression | Majority Non-White | Majority White |
---|---|---|
Baseline Regression | 0.1987990 | 0.1849117 |
Neighborhood Effects | 0.1851025 | 0.1764149 |
#scales::percent
<- st_join(bothRegression, census) %>%
income.group group_by(Regression, incomeContext) %>%
summarize(mean.MAPE = (mean(price.APE, na.rm = T))) %>%
st_drop_geometry() %>%
spread(incomeContext, mean.MAPE) %>%
kable(caption = "Table 4.11.3\nTest set MAPE by Neighborhood Income Context") %>%
kable_styling("striped", full_width = F)
income.group
Regression | High Income | Low Income |
---|---|---|
Baseline Regression | 0.1764773 | 0.2057311 |
Neighborhood Effects | 0.1668223 | 0.1932346 |
Eventually, we settled down with a model built with the Neighborhood Effects Model, and conducted prediction on the secret data set. Summary statistics of predicted sale price is attached below.
<-Nhood
MODEL
<-challenge_set %>%
challenge_set.predictmutate(challenge.predict = predict(MODEL, newdata = challenge_set)) %>%
mutate(challenge.predict=ifelse(is.na(challenge.predict),515144,challenge.predict)) %>% dplyr::select(musaID,challenge.predict)
write.csv(challenge_set.predict,"pred_challenge_set.csv")
Accuracy and Generalizability: Overall, the predictive model is relatively effective in terms of predicting sale price in the Mecklenburg County, NC. The model captured around 70% of variation in prices we were required to predict. For the sake of accuracy, the MAE on the test set is around 64K with 18.08% MAPE. As for the generalizability, MAPEs are between 16% and 19% considering the context groups to which the model was applied. A potential weakness of the model should be addressed is if the generalizability test result was overestimated or manipulated since proportion of white population and median household income were both features in the model.
Feature Effect: The baseline model contained the following features.
bldggrade.m
;shape_Area
,heatedarea
,fullbaths
,bedrooms
,
units
;pctRenterOccupied
,pctWhite
,MedHHInc
,test.elementary
,
crime_sum
,PctLongCommute
,Pctgrocery
,isin
,zip
;college_nn5
,library_nn2
,hist_properties_nn5
,capital_improve_nn2
.The highlight of this model is including whether the house is
in Business Investment Opportunity Zones (isin
) as
one of the features to predict sale price. Since Business
Investment Opportunity Zones displays areas that are within a
1/4 mile of a commercial property that has a value 70% the County
average and a median household income of <= 55K in Mecklenburg
County, we regarded it as an interesting addition to display the spatial
environment the house unit is located at.
Considering the neighborhood effects, the geographical feature
finally added into the model is the
zipcode(zip
) instead of census
tract(GEOID
). The reason is that some indicators
of house price (e.g. crime, transit), need scale to exist. Blocks in
zipcode is slightly larger than that of census
tract, therefore we think it may better help explain why the
price of houses cluster.
The accessibility of education resources is an
easy-ignored but powerful feature to show how amenities affect the local
house price. Rather than only focus on the entertainment demand, we
cared more about if the local residents’requirements on education have
been refilled. So we added some education-related features, such as
test.elementary
, and distance to libraries(knn). Evidence
has indicated that such a consideration is worthy.
Additionally, although we included far more amenity features in the initial stage, such as the proximity to churches, historical cemeteries, and the proximity to medical facilities, those features were eventually excluded from the model based on the series of rigorous model evaluation. However, we still believe it is worthy taking those features into consideration and they might be helpful when being used in other regression models.
Errors and Spatial Variations: Generally speaking, larger MAEs occurred in the Downtown Charlotte and north region of Mecklenburg County, while the model performed better in the rest of areas. One guess is potential spatial features that are directly related to the two areas have been somehow missed out. But we still feel confident about the model based on the examinations above.
If all the model examination and interpretation have been conducted correctly as shown above, we would like to recommend the Neighborhood Effects model to Zillow as it incorporates some Mecklenburg County’s local intelligent features that were absent in the previous models and performed comparatively well on either accuracy or generalizability.
To further reduce the error rate, more amenity features such as local restaurants and bars (data unavailable so far), financial related agencies, public transportation and spatial features such as zoning should be taken into consideration. And more context studies on downtown Charlotte and north region are also highly needed since we would like to extract some similarities between the two regions that would be the magic ingredients of cooking models in the future.