Beginning in 1992 with the publication of the Bike 2000 Plan, the City of Chicago has consistently demonstrated its commitment to enabling and encouraging bicycling as a safe, healthy, environmentally-friendly and fun way to travel. However, from the pricing of bike-sharing to the location of the stations provided, bike-sharing has been facing the issues of EQUITY and SAFETY.
From 2021-2022 alone, the number of bike-sharing stations in Chicago has increased from 828 to 1,481. According to divvy– Chicago’s largest bike-sharing provider, their site selection takes full account of the principles of EQUITY as well as SAFETY. The purpose of this study is to test the reasonableness of site selection. I will use 2021 bicycle use data to predict bicycle use in 2022 (given that people traveling in 2021 receive COVID19, we focus more on the impact of spatial distribution in the study).
We will apply two different spatial forecasting approaches to predict bicycle use a regression forecast incorporating spatial sharpness coefficients and a forecast based on a single risk factor. And then we will compare it to the actual usage in 2022 to see if the increase in stations improves the inequity of bicycle use. We will also obtain the intuitive effect of the spatial influence of the site distribution by a simple calculation of the kernel density.
In this section, we loaded necessary libraries, created plot theme options and map theme options, and identified functions of quintile breaks, and average nearest neighbor distance for further analysis.
1. Base Map Datasets
Chicago Boundary: A base and outline of the City of Chicago.
Chicago Communities: A base and outline of the neighborhoods of Chicago.
Tax Increment Districts: Geojson file contains Tax Increment Districts information.
2. Traffic Data: This analysis looks at the use of shared bicycles that also suffers from selection bias, such as income bias when it comes to shared bike prediction and station allocation.
3. Risk Factors Data: This analysis selects five point/line level features to build the model. See each dataset below:
Traffic Crashes: A point feature class of each traffic crash on city streets within the City of Chicago limits and under the jurisdiction of Chicago Police Department (CPD).
CTA - Bus Routes : Shapefile containing line data for CTA bus routes.
Grocery Stores: A point feature class of the location of Grocery Stores.
Bike Routes:Shapefile containing line data for Bike Routes.
Speed Camera: A point feature class of the location, first operational date, and approaches of the speed cameras in the City of Chicago.
Census Data: demographic variables from the ACS 2020 for census tracts in Chicago. This analysis mainly focuses on median household income, race context, poverty level, and umployment rate.
# polygon
<-
chicagoBoundary st_read(file.path(root.dir,"/Chapter5/chicagoBoundary.geojson")) %>%
st_transform('ESRI:102271')
# 2021 Bike Usage
<-
BikeUsages st_read('~/Github/predictive-policing/data/21_trip') %>%
na.omit() %>%
st_transform('ESRI:102271') %>%
st_intersection(.,chicagoBoundary)
# Creating a fishnet grid
<-
chicago_fishnet st_make_grid(chicagoBoundary,
cellsize = 500,
square = TRUE) %>%
%>%
.[chicagoBoundary] st_sf() %>%
mutate(uniqueID = 1:n())
<-
chicagoCoummunity st_read("https://raw.githubusercontent.com/blackmad/neighborhoods/master/chicago.geojson") %>%
st_transform(st_crs(chicago_fishnet))
# import risk factors
<-
TaxIncrement st_read("https://data.cityofchicago.org/resource/i7gi-vcr9.geojson") %>%
st_transform('ESRI:102271') %>%
st_transform(st_crs(chicago_fishnet)) %>%
st_join(chicagoBoundary) %>%
::select(name)%>%
dplyrna.omit()
<-
Traffic_crash st_read('https://data.cityofchicago.org/resource/85ca-t3if.geojson') %>%
::select(geometry)%>%
dplyrfilter(!st_is_empty(.)) %>%
st_transform('ESRI:102271') %>%
st_transform(st_crs(chicago_fishnet)) %>%
mutate(Legend = "Traffic Crashes") %>%
::select(geometry, Legend)
dplyr
<-
Bike_route st_read('https://data.cityofchicago.org/resource/hvv9-38ut.geojson') %>%
na.omit() %>%
st_transform('ESRI:102271') %>%
st_transform(st_crs(chicago_fishnet)) %>%
mutate(Legend = "Bike Route") %>%
::select(geometry, Legend)
dplyr
<-
Bus_route st_read('https://data.cityofchicago.org/resource/pfsx-4n4m.geojson') %>%
st_transform('ESRI:102271') %>%
st_transform(st_crs(chicago_fishnet)) %>%
mutate(Legend = "bus route") %>%
::select(geometry, Legend) %>%
dplyrna.omit()
<-
Grocery_Stores st_read('https://data.cityofchicago.org/resource/53t8-wyrc.geojson') %>%
st_transform('ESRI:102271') %>%
st_transform(st_crs(chicago_fishnet)) %>%
mutate(Legend = "Grocery_Stores") %>%
::select(geometry, Legend)%>%
dplyrna.omit()
<-
speed_camera st_read('https://data.cityofchicago.org/resource/4i42-qv3h.geojson') %>%
st_transform('ESRI:102271') %>%
st_transform(st_crs(chicago_fishnet)) %>%
mutate(Legend = "speed_camera") %>%
::select(geometry, Legend)%>%
dplyrna.omit()
# 2022 Biking Data
<-
bikings22 st_read('~/Github/predictive-policing/data/22_trip') %>%
na.omit() %>%
st_transform('ESRI:102271') %>%
distinct() %>%
.[chicago_fishnet,]
# Census data
# View(load_variables(2020,'acs5',cache = TRUE))
<-
tracts20 get_acs(geography = "tract", variables = c("B01003_001E","B02001_002E","B19013_001E","B25002_001E","B06012_002E","B27011_008E"),
year=2020, state="IL", county="Cook", geometry=T, output="wide") %>%
st_transform('ESRI:102271') %>%
rename(TotalPop = B01003_001E,
Whites = B02001_002E,
MedHHInc = B19013_001E,
TotalUnit = B25002_001E,
TotalPoverty = B06012_002E,
TotalUnemployment = B27011_008E) %>%
::select(-NAME, -starts_with("B")) %>% #-starts_with("B") awesome!
dplyrmutate(pctWhite = ifelse(TotalPop > 0, Whites / TotalPop * 100,0),
pctPoverty = ifelse(TotalPop > 0, TotalPoverty / TotalPop *100, 0),
pctUnemploy = ifelse(TotalPop > 0, TotalUnemployment / TotalPop *100, 0)
%>%
) ::select(-Whites, -TotalPoverty ,-TotalUnemployment,-GEOID) %>%
dplyrst_transform(st_crs(chicago_fishnet))
<- tracts20 %>%
tracts20.MedHHInc ::select(MedHHInc) %>%
dplyrrename(Legend = MedHHInc)
<- tracts20 %>%
tracts20.pctWhite ::select(pctWhite)%>%
dplyrrename(Legend = pctWhite)
<- tracts20 %>%
tracts20.pctPoverty ::select(pctPoverty)%>%
dplyrrename(Legend = pctPoverty)
<- tracts20 %>%
tracts20.pctUnemploy ::select(pctUnemploy)%>%
dplyrrename(Legend = pctUnemploy)
METHOD Because of the different scale and form of the data, this analysis is based on a fishnet grid, breaking the city into square cells to design a model which predicts the probability of biking in each cell based on its risk factor data.
Figures presents the biking data in a fishnet which is the basic scale this analysis mainly works at. It can be seen that the frequency of bicycle use in 2022 is significantly higher than in 2021, but the basic distribution is similar.And the use of share bikes has become more centralized.
# Aggregate points to the fishnet
<-
bike_net ::select(BikeUsages) %>%
dplyrmutate(countbikings = BikeUsages$count) %>%
aggregate(., chicago_fishnet, sum) %>%
mutate(countbikings = replace_na(countbikings, 0),
uniqueID = 1:n(),
cvID = sample(round(nrow(chicago_fishnet) / 24),
size=nrow(chicago_fishnet), replace = TRUE))
.2 <-
bike_net::select(bikings22) %>%
dplyrmutate(countbikings = bikings22$count) %>%
aggregate(., chicago_fishnet, sum) %>%
mutate(countbikings = replace_na(countbikings, 0),
uniqueID = 1:n(),
cvID = sample(round(nrow(chicago_fishnet) / 24),
size=nrow(chicago_fishnet), replace = TRUE))
<-
bike_net.plot ggplot() +
geom_sf(data = bike_net, aes(fill =log2(countbikings)), color = NA) +
scale_fill_viridis(option = "C",name = 'Biking Counts',na.value = "#f5f5f5", direction = -1)+
scale_colour_hue(l = 50, c = 30)+
labs(title = "Observed Shootings Joined to Fishnet",
subtitle = 'Chicago, IL, 2021\n(log transformed for visualization)',
caption = 'Figure 2.1') +
mapTheme()
<-
bike_net.plot2 ggplot() +
geom_sf(data = bike_net.2, aes(fill =log2(countbikings)), color = NA) +
scale_fill_viridis(option = "C",name = 'Biking Counts',na.value = "#f5f5f5",direction = -1)+
scale_colour_hue(l = 50, c = 30)+
labs(title = "Observed Shootings Joined to Fishnet",
subtitle = 'Chicago, IL, 2022\n(log transformed for visualization)',
caption = 'Figure 2.2') +
mapTheme()
grid.arrange(ncol=2,bike_net.plot,bike_net.plot2)
To build a model which can capture more characteristics of biking pattern, this analysis selects risk factors that are correlated with biking more or less based on common transportation studies. These risk factors are related to the transportation infrastructures within the City of Chicago, Traffic Crashes, CTA - Bus Routes, Grocery Stores, Bike Routes, Speed Camera and other demographic factors (household income, poverty level, unemployment rate, and racial group) fetched from census data.
# All variables in fishnet
<-
vars_net rbind(Bus_route, Bike_route, speed_camera, Traffic_crash, Grocery_Stores) %>%
st_join(., chicago_fishnet, join=st_within) %>%
st_drop_geometry() %>%
group_by(uniqueID, Legend) %>%
summarize(count = n()) %>%
full_join(chicago_fishnet, by = "uniqueID") %>%
spread(Legend, count, fill=0) %>%
st_sf() %>%
na.omit() %>%
::select(-`<NA>`) %>%
dplyrungroup()
### Multiple map for feature counts in fishnet
<-
vars_net.long gather(vars_net, Variable, value, -geometry, -uniqueID)
<- unique(vars_net.long$Variable)
vars <- list()
mapList
for(i in vars){
<-
mapList[[i]] ggplot() +
geom_sf(data = filter(vars_net.long, Variable == i), aes(fill=value), colour=NA) +
scale_fill_viridis(option = "C",name = '',na.value = "#f5f5f5")+
labs(title=i) +
mapTheme()}
do.call(grid.arrange,c(mapList, ncol =3, top = "Figure. 3.1 Risk Factors by Fishnet\n"))
It can be seen that most of the variables are relatively evenly distributed spatially, except for bus route and bike route, which accumulate in the northeastern part of the city.
## 3.2 Nearest Neighbor Feature
# convenience to reduce length of function names.
<- st_coordinates
st_c <- st_centroid
st_coid
## create NN from point data, k = 3
<- na.omit(vars_net)
vars_net
$bikeRoute.nn <-
vars_netnn_function(st_c(st_coid(vars_net)),
st_c(Bike_route)[,c(1:2)], k = 3)
$BusRoute.nn <-
vars_netnn_function(st_c(st_coid(vars_net)),
st_c(Bus_route),
k = 3)
$GroceryStore.nn <-
vars_netnn_function(st_c(st_coid(vars_net)),
st_c(Grocery_Stores), k = 3)
$SpeedCamera.nn <-
vars_netnn_function(st_c(st_coid(vars_net)),
st_c(speed_camera),
k = 3)
$Traffic_crash.nn <-
vars_netnn_function(st_c(st_coid(vars_net)), st_c(Traffic_crash),k = 3)
## Visualize the nearest three features
<-
vars_net.long.nn ::select(vars_net, ends_with(".nn")) %>%
dplyrgather(Variable, value, -geometry)
<- unique(vars_net.long.nn$Variable)
vars <- list()
mapList
for(i in vars){
<-
mapList[[i]] ggplot() +
geom_sf(data = filter(vars_net.long.nn, Variable == i), aes(fill=value), colour=NA) +
scale_fill_viridis(option = "C",name = '',na.value = "#f5f5f5")+
labs(title=i) +
mapTheme() }
do.call(grid.arrange,c(mapList, ncol = 3, top = "Figure 3.2 Nearest Neighbor risk Factors by Fishnet\n"))
# IV and DVs all in fishnet
<-
chicago_final_net full_join(bike_net, st_drop_geometry(vars_net), by="uniqueID")
<-
chicago_final_net st_centroid(chicago_final_net) %>%
st_join(dplyr::select(chicagoCoummunity, name), by = "uniqueID", left = TRUE) %>%
st_join(dplyr::select(TaxIncrement,name), by = "uniqueID", left = TRUE) %>%
st_drop_geometry() %>%
left_join(dplyr::select(chicago_final_net, geometry, uniqueID)) %>%
st_sf()
Now with our complete set of risk factor features, we explore the correlation between the features and observed biking usages Figure 5 below visualizes some of these correlations. It is intuitive that all the features selected are correlated to shared bike usage case counts.
# 5. A small multiple scatterplot with correlations
<-
chicago.correlation.long st_drop_geometry(chicago_final_net) %>%
::select(-uniqueID, -cvID,-starts_with('name'),-starts_with('community')) %>%
dplyrgather(Variable, Value, -countbikings)
<-
chicago.correlation.cor %>%
chicago.correlation.long group_by(Variable) %>%
summarize(chicago.correlation = cor(Value, countbikings, use = "complete.obs"))
ggplot(chicago.correlation.long, aes(Value, countbikings)) +
geom_point(size = 0.1, color='darkblue') +
geom_text(data = chicago.correlation.cor, aes(label = paste("r =", round(chicago.correlation, 2))),
x=-Inf, y=Inf, vjust = 1.5, hjust = -.1) +
geom_smooth(method = "lm", se = FALSE, colour = "darkred") +
facet_wrap(~Variable, ncol = 4, scales = "free") +
labs(title = "Observed Shared Biking Usages as a Function of Risk Factors",
subtitle = "Chicago, IL\n",
caption = "Figure 5")+
plotTheme()
Given biking occurs in space only on the road, it is reasonable for most grid cells to contain no biking. And based on the histogram below, the distribution of the dependent variable is Poisson distribution, indicating a Poisson model should be considered for model building.
# 6. A histogram of dependent variable
ggplot(data = chicago_final_net) +
geom_histogram(aes(x = countbikings), fill = 'darkblue') +
scale_x_continuous(breaks = seq(0, 8, by = 1)) +
labs(title="Histogram of Dependent Variable: Count of Observed Shared Bikes Usage",
subtitle = "Chicago, IL\n",
caption = "Figure 6") +
xlab('Count of Bikings') +
ylab('Count') +
plotTheme()
For a given risk factor, I selected feature counts in each fishnet and distance to biking hotpots as well as distance to significant biking hotpots as features used in the model to avoid colinearity. I also added Local Moran’s I spatial process features based on the result in Question 4. Therefore, we have eleven features in total.
Goodness of fit metrics are generated for four regressions - two
including Just Risk Factors (reg.vars
), and a
second (reg.ss.vars
) includes risk factors plus the Local
Moran’s I Spatial Process features created
A well generalized predictive model learns the risk ‘experience’ at
both citywide and local spatial scales. The best way to test for this is
to hold out one local area, train the model on the remaining n - 1
areas, predict for the hold out, and record the goodness of fit, which
is the cross-validation approach called LEAVE ONE GROUP OUT (LOGO-CV)
used in the analysis. reg.ss.cv
performs random k-fold
cross validation using spatial process features, while
reg.ss.spatialCV
performs LOGO-CV, spatial cross-validation
on neighborhood name, using the same features. Same processes are also
conducted on Just Risk Factors.
<-
crossValidatefunction(dataset, id, dependentVariable, indVariables) {
<- data.frame()
allPredictions <- unique(dataset[[id]])
cvID_list
for (i in cvID_list) {
<- i
thisFold cat("This hold out fold is", thisFold, "\n")
<- filter(dataset, dataset[[id]] != thisFold) %>% as.data.frame() %>%
fold.train ::select(id, geometry, indVariables, dependentVariable)
dplyr<- filter(dataset, dataset[[id]] == thisFold) %>% as.data.frame() %>%
fold.test ::select(id, geometry, indVariables, dependentVariable)
dplyr
<-
regression glm(countbikings ~ ., family = "poisson",
data = fold.train %>%
::select(-geometry, -id))
dplyr
<-
thisPrediction mutate(fold.test, Prediction = predict(regression, fold.test, type = "response"))
<-
allPredictions rbind(allPredictions, thisPrediction)
}return(st_sf(allPredictions))
}
## define the variables
<- c("Bike Route","bus route","speed_camera","Traffic Crashes","Grocery_Stores" )
reg.vars
<- c("Bike Route","bus route","speed_camera","Traffic Crashes","Grocery_Stores",
reg.ss.vars "biking.isSig" ,"biking.isSig.dist")
### create random k-fold cv
<- crossValidate(
reg.cv dataset = chicago_final_net,
id = "cvID",
dependentVariable = "countbikings",
indVariables = reg.vars) %>%
::select(cvID = cvID, countbikings, Prediction, geometry)
dplyr
<- crossValidate(
reg.ss.cv dataset = chicago_final_net,
id = "cvID",
dependentVariable = "countbikings",
indVariables = reg.ss.vars) %>%
::select(cvID = cvID, countbikings, Prediction, geometry)
dplyr
$name.x <- ifelse(is.na(chicago_final_net$name.x ), 0, chicago_final_net$name.x)
chicago_final_net
### create spatial cross validation
<- crossValidate(
reg.spatialCV dataset = chicago_final_net,
id = "name.x", ### !!! <- really important line
dependentVariable = "countbikings",
indVariables = reg.vars) %>%
::select(cvID = name.x, countbikings, Prediction, geometry)
dplyr
<- crossValidate(
reg.ss.spatialCV dataset = chicago_final_net,
id = "name.x", ### !!! <- really important line
dependentVariable = "countbikings",
indVariables = reg.ss.vars) %>%
::select(cvID = name.x, countbikings, Prediction, geometry)
dplyr
# Bind four CVs together
<-
reg.summary rbind(
mutate(reg.cv, Error = Prediction - countbikings,
Regression = "Random k-fold CV: Just Risk Factors"),
mutate(reg.ss.cv, Error = Prediction - countbikings,
Regression = "Random k-fold CV: Spatial Process"),
mutate(reg.spatialCV, Error = Prediction - countbikings,
Regression = "Spatial LOGO-CV: Just Risk Factors"),
mutate(reg.ss.spatialCV, Error = Prediction - countbikings,
Regression = "Spatial LOGO-CV: Spatial Process")) %>%
st_sf()
# calculate and visualize MAE for each fold
<-
error_by_reg_and_fold %>%
reg.summary group_by(Regression, cvID) %>%
summarize(Mean_Error = mean(Prediction - countbikings),
MAE = mean(abs(Mean_Error)),
SD_MAE = mean(abs(Mean_Error))) %>%
ungroup()
Figure 7.1 below displays the distribution of MAE across all models. Some small errors clustering can be captured, indicating the models perform well in terms of generalizability across the city. Also, we can clearly see that by adding the spatial process of Moran’s I can greatly reduce the error of the model.
#### 7.1 Distribution of MAE
%>%
error_by_reg_and_fold ggplot(aes(MAE)) +
geom_histogram(bins = 30, colour="white", fill = "#810f7c") +
facet_wrap(~Regression) +
geom_vline(xintercept = 0) +
scale_x_continuous(breaks = seq(0, 8, by = 1)) +
labs(title="Distribution of MAE",
subtitle = "k-fold cross validation vs. LOGO-CV\n",
caption = "Figure 7.1",
x="Mean Absolute Error", y="Count") +
plotTheme()
Figure 7.2 provides additional information on the errors across models. It is clearly that the error of random k-fold is more evenly distributed in space while the error of LOGO-CV has stronger spatial characteristics (large error in some neighborhoods and near zero in others). Also, we can see that the error cell of k-fold is based on the original fishnet cell, however the error cell of LOGO-cv is based on the neighborhood’s cell. Therefore, the error interpretation of LOGO-CV is more practical.
%>%
error_by_reg_and_fold ggplot() +
geom_sf(aes(fill = MAE),color = 'transparent',size=0.001) +
facet_wrap(~Regression,ncol=4) +
scale_fill_viridis_c(option = "plasma",alpha = 1) +
labs(title = "Biking Errors by Rondom k-fold and LOGO-CV Regression\n",
caption = 'Figure 7.2') +
mapTheme() +
theme(legend.position="bottom",strip.text.x = element_text(size = 10))
Interestingly, Figure 7.3 below shows that all models over-predict in low biking-rate areas and under-predict in hot spot areas. Under-prediction in higher biking areas may reflect difficulty predicting the hot spots.
st_drop_geometry(reg.summary) %>%
group_by(Regression) %>%
mutate(biking_Decile = ntile(countbikings, 10)) %>%
group_by(Regression, biking_Decile) %>%
summarize(meanObserved = mean(countbikings, na.rm=T),
meanPrediction = mean(Prediction, na.rm=T)) %>%
gather(Variable, Value, -Regression, -biking_Decile) %>%
ggplot(aes(biking_Decile, Value, shape = Variable)) +
geom_point(size = 2) +
geom_path(aes(group = biking_Decile), colour = "black") +
scale_shape_manual(values = c(2, 17)) +
facet_wrap(~Regression) + xlim(0,10) +
labs(title = "Predicted and Observed Bikings by Observed Biking Decile\n",
subtitle = "Chicago, IL\n",
caption = 'Figure 7.3',
x = 'Biking Decile') +
plotTheme()
Table 8 below confirms my assumption that the Spatial Process features improve the model. The model appears consistently robust even in conservative LOGO-CV. For your reference, the average count of observed biking cases in each cell is 1962.103, while the average count of predicted biking cases in each cell is 2008.001 and the average count of spatial predicted biking cases in each cell is 2055.131.
st_drop_geometry(error_by_reg_and_fold) %>%
group_by(Regression) %>%
summarize(Mean_MAE = round(mean(MAE, na.rm=T), 2),
SD_MAE = round(sd(MAE, na.rm=T), 2)) %>%
kable(caption = "Table 8. MAE by regression") %>%
kable_styling("striped", full_width = F)
Regression | Mean_MAE | SD_MAE |
---|---|---|
Random k-fold CV: Just Risk Factors | 1365.50 | 1697.67 |
Random k-fold CV: Spatial Process | 810.04 | 1109.45 |
Spatial LOGO-CV: Just Risk Factors | 5051.35 | 9142.35 |
Spatial LOGO-CV: Spatial Process | 2001.91 | 4186.14 |
To be generalizable, the models must produce errors on the similar
level in both majority white and majority non-white neighborhoods.
However, shown in Table 9, there is huge difference between the MAE of
model with spatial process and without spatial process in non-white and
white neighborhood. The absolute value of MAE without spatial process of
Majority_White
neighborhoods is lower than
non-white
neighborhoods. But the absolute value of MAE with
spatial process of Majority_White
neighborhoods is higher
than the non-white
neighborhoods. This shows high spatial
bias in shared bikes usage.
Besides, we get negative MAE in Majority_White
neighborhoods and positive MAE in Majority_Non_White
neighborhoods. This means each model on average, over-predicts in
Majority_Non_White
neighborhoods and under-predicts in
Majority_White
neighborhoods. It looks like this algorithm
does not generalize well with respect to race.
## 9.1 Fetch census data
<-
tracts20.race get_acs(geography = "tract", variables = c("B01003_001E","B02001_002E"),
year=2020, state="IL", county="Cook", geometry=T, output="wide") %>%
st_transform('ESRI:102271') %>%
rename(TotalPop = B01003_001E,
Whites = B02001_002E) %>%
mutate(percentWhite = Whites / TotalPop,
raceContext = ifelse(percentWhite > .5, "Majority_White", "Majority_Non_White")) %>%
.[chicagoCoummunity,]
ggplot() +
geom_sf(data = na.omit(tracts20.race), aes(fill = raceContext),color = 'white') +
scale_fill_manual(values = c("#810f7c", "#e0ecf4"), name="Race Context") +
labs(title = "Race Context",
subtitle = "chicago, IL\n",
caption = 'Figure 9') +
mapTheme() +
theme(legend.position="bottom")
%>%
reg.summary filter(str_detect(Regression, "LOGO")) %>%
st_centroid() %>%
st_join(tracts20.race) %>%
na.omit() %>%
st_drop_geometry() %>%
group_by(Regression, raceContext) %>%
summarize(mean.Error = mean(Error, na.rm = T)) %>%
spread(raceContext, mean.Error) %>%
kable(caption = "Table 9. Mean Error by Neighborhood Racial Context") %>%
kable_styling("striped", full_width = F)
Regression | Majority_Non_White | Majority_White |
---|---|---|
Spatial LOGO-CV: Just Risk Factors | 1294.770 | -1115.539 |
Spatial LOGO-CV: Spatial Process | 176.069 | -323.101 |
In the last section, this analysis is aimed at figuring out if the predictive policing model is able to better predict shared bike usages compared with traditional model - Kernel Density Model.
By making kernel density maps with different search radius scale (1000 Ft, 1500 Ft, and 2000 Ft), this analysis select the one with 1000 Ft search radius since it can better display the density of shared bike usage (see Figure 10.1).
### 10.1 Make Kernel Density Map
<- as.ppp(st_coordinates(BikeUsages), W = st_bbox(chicago_final_net))
bike_ppp .1000 <- spatstat.core::density.ppp(bike_ppp, 1000)
bike_KD.1500 <- spatstat.core::density.ppp(bike_ppp, 1500)
bike_KD.2000 <- spatstat.core::density.ppp(bike_ppp, 2000)
bike_KD<- rbind(
bike_KD.df mutate(data.frame(rasterToPoints(mask(raster(bike_KD.1000), as(chicagoCoummunity, 'Spatial')))), Legend = "1000 Ft."),
mutate(data.frame(rasterToPoints(mask(raster(bike_KD.1500), as(chicagoCoummunity, 'Spatial')))), Legend = "1500 Ft."),
mutate(data.frame(rasterToPoints(mask(raster(bike_KD.2000), as(chicagoCoummunity, 'Spatial')))), Legend = "2000 Ft."))
$Legend <- factor(bike_KD.df$Legend, levels = c("1000 Ft.", "1500 Ft.", "2000 Ft."))
bike_KD.df
ggplot(data = bike_KD.df, aes(x = x, y = y)) +
geom_raster(aes(fill=layer)) +
facet_wrap(~Legend) +
coord_sf(crs=st_crs(chicago_final_net)) +
scale_fill_viridis_c(option = "plasma",
name = "Density") +
labs(title = "Kernel Density with 3 Different Search Radius Scales",
subtitle = "Biking Cases, 2022, Chicago, IL\n",
caption = 'Figure 10.1') +
mapTheme() + theme(legend.position="bottom")
Next, a comparison map is generated of the risk categories for both model types with a sample of shared bike usage in 2022 points overlaid. A strongly fit model should show that the highest risk category is uniquely targeted to places with a high density of shared bike usage points. Indicated from Figure 10.2, due to the construction of new shared bike station, both the traditional model and the predictive policing model can show the shared biking pattern in 2022 more or less. However, none of these two models is accurate and direct spatially.
<- as.data.frame(bike_KD.1000) %>%
bike_KDE_sf st_as_sf(coords = c("x", "y"), crs = st_crs(chicago_final_net)) %>%
aggregate(., chicago_final_net, mean) %>%
mutate(label = "Kernel Density",
Risk_Category = ntile(value, 100),
Risk_Category = case_when(
>= 90 ~ "90% to 100%",
Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%")) %>%
Risk_Category cbind(
aggregate(
::select(bikings22) %>% mutate(bikingCount = bikings22$count), ., sum) %>%
dplyrmutate(bikingCount = replace_na(bikingCount, 0))) %>%
::select(label, Risk_Category, bikingCount)
dplyr
##### 10.4 Prediction by Risk Prediction Model
<-
bike_risk_sf %>%
reg.ss.spatialCV mutate(label = "Risk Predictions",
Risk_Category = ntile(Prediction, 100),
Risk_Category = case_when(
>= 90 ~ "90% to 100%",
Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%")) %>%
Risk_Category cbind(
aggregate(
::select(bikings22) %>% mutate(bikingCount = bikings22$count), ., sum) %>%
dplyrmutate(bikingCount = replace_na(bikingCount, 0))) %>%
::select(label, Risk_Category, bikingCount)
dplyr
rbind(bike_KDE_sf, bike_risk_sf) %>%
na.omit() %>%
gather(Variable, Value, -label, -Risk_Category, -geometry) %>%
ggplot() +
geom_sf(aes(fill = Risk_Category), colour = NA) +
geom_sf(data = sample_n(bikings22, 200), size = 0.9, colour = "white") +
facet_wrap(~label, ) +
scale_fill_viridis_d(option = "plasma",
name = 'Risk Category') +
labs(title="Comparison of Kernel Density and Risk Predictions",
subtitle="Bottom layer is 2022 predicted biking counts.\nDot is observed 2022 biking counts.\n",
caption = 'Figure 10.2') +
mapTheme() + theme(legend.position="bottom")
Therefore, I do a further step to map the count of shared bike usage in Chicago, 2022.From the figure, we can see that the spatial accumulation of shared bike use is more obvious in 2022, but we can also see the emergence of some new usage hotspots driven by the new site construction.
grid.arrange(ncol=2,
ggplot() +
geom_sf(data = chicagoBoundary, fill = "#f5f5f5", color = "#f5f5f5") +
geom_sf(data = bikings22, aes(colour = q5(bikings22$count)), size = 1, show.legend = "point") +
scale_colour_brewer(palette ='RdPu')+
geom_sf(data = bikings22, aes(colour = 'white', size=(bikings22$count)), alpha=0.05, show.legend = "point") +
scale_alpha(range = c(0.00, 0.35), guide = FALSE) +
labs(title= "Observed Shared Bike Useages Points",
subtitle = 'Chicago, IL, 2022\n',
caption = 'Figure 10.3.1') +
mapTheme() +
theme(legend.position = "none"),
ggplot() +
geom_sf(data = chicagoBoundary, fill = "#f5f5f5", color = "#f5f5f5") +
stat_density2d(data = data.frame(st_coordinates(bikings22)),
aes(X, Y, fill = ..level.., alpha = ..level..),
size = 0.01, bins = 40, geom = 'polygon') +
scale_fill_viridis(option = "C", alpha = 0.7) +
scale_alpha(range = c(0.00, 0.35), guide = FALSE) +
labs(title = "Density of Observed Shared Bike Useages Points",
subtitle = 'Chicago, IL, 2022\n',
caption = 'Figure 10.3.2') +
mapTheme() +
theme(legend.position = "none") +
mapTheme())
Indicated from Figure 11, it is seemingly the advantages of predictive policing model is not consistent across all risk categories. For all risk categories lower than 70%, the traditional Kernel Density model is actually more accurate. For the highest category, the predictive policing risk model is more accurate. This finding might imply that the predictive policing model built in this analysis is better at locating hot spots.
rbind(bike_KDE_sf, bike_risk_sf) %>%
st_set_geometry(NULL) %>%
na.omit() %>%
gather(Variable, Value, -label, -Risk_Category) %>%
group_by(label, Risk_Category) %>%
summarize(countBikings = sum(Value)) %>%
ungroup() %>%
group_by(label) %>%
mutate(Rate_of_test_set_bikes = countBikings / sum(countBikings)) %>%
ggplot(aes(Risk_Category,Rate_of_test_set_bikes)) +
geom_bar(aes(fill=label), position="dodge", stat="identity") +
scale_fill_manual(values = c("darkblue", "#e0ecf4"), name="Model")+
labs(title = "Risk Prediction vs. Kernel Density, 2022 Biking Cases",
subtitle = 'Chicago, IL',
caption = 'Figure 11',
x = 'Risk Category',
y = 'Rate of Test Set Bikings') +
plotTheme()
Generally speaking, this model achieves the goal of identifying shared bike useage hotspots in Chicago since it predicts hotspots and it identifies areas with no observed shared bike usage cases at present that have latent shared biking possibility. But it is hard to tell that the overall model works better than the traditional - Kernel Density model, for the figure 11 show that these two kinds of prediction work well in different risk category.
Also, one of the limitations of this model is that it fails to avoid selection bias - it systematically over-predicts shared bike usages in majority non-white neighborhoods. This would cover the inequity in shared bike usage between different races.
Another limitation is clustered in the choice of dependent variable - shared bike usages. In fact, share bike usage is greatly influenced by the supply of shared bikes(showed in Figure 10.2), but we can compare the predictive policing model and the kernel density model’s predictions with the actual observations to detect the construction of new bicycle stations and the effectiveness of the increased supply.