Geospatial Risk Prediction

I. Introduction

Beginning in 1992 with the publication of the Bike 2000 Plan, the City of Chicago has consistently demonstrated its commitment to enabling and encouraging bicycling as a safe, healthy, environmentally-friendly and fun way to travel. However, from the pricing of bike-sharing to the location of the stations provided, bike-sharing has been facing the issues of EQUITY and SAFETY.

From 2021-2022 alone, the number of bike-sharing stations in Chicago has increased from 828 to 1,481. According to divvy– Chicago’s largest bike-sharing provider, their site selection takes full account of the principles of EQUITY as well as SAFETY. The purpose of this study is to test the reasonableness of site selection. I will use 2021 bicycle use data to predict bicycle use in 2022 (given that people traveling in 2021 receive COVID19, we focus more on the impact of spatial distribution in the study).

We will apply two different spatial forecasting approaches to predict bicycle use a regression forecast incorporating spatial sharpness coefficients and a forecast based on a single risk factor. And then we will compare it to the actual usage in 2022 to see if the increase in stations improves the inequity of bicycle use. We will also obtain the intuitive effect of the spatial influence of the site distribution by a simple calculation of the kernel density.

II. Setup and Data Loading

1. Setup

In this section, we loaded necessary libraries, created plot theme options and map theme options, and identified functions of quintile breaks, and average nearest neighbor distance for further analysis.

2. Data Preparation

1. Base Map Datasets

Chicago Boundary: A base and outline of the City of Chicago.
Chicago Communities: A base and outline of the neighborhoods of Chicago.
Tax Increment Districts: Geojson file contains Tax Increment Districts information.

2. Traffic Data: This analysis looks at the use of shared bicycles that also suffers from selection bias, such as income bias when it comes to shared bike prediction and station allocation.

SharedBike Usage: City-wide bike usages per trip provided by divvy in Chicago from 2015 to 2022, including trip time period and trip start and end points. I aggregated the data by the points and the model will be built based on bike usages in 2021 and be tested by usages in 2022.

3. Risk Factors Data: This analysis selects five point/line level features to build the model. See each dataset below:

Traffic Crashes: A point feature class of each traffic crash on city streets within the City of Chicago limits and under the jurisdiction of Chicago Police Department (CPD).
CTA - Bus Routes : Shapefile containing line data for CTA bus routes.
Grocery Stores: A point feature class of the location of Grocery Stores.
Bike Routes:Shapefile containing line data for Bike Routes.
Speed Camera: A point feature class of the location, first operational date, and approaches of the speed cameras in the City of Chicago.
Census Data: demographic variables from the ACS 2020 for census tracts in Chicago. This analysis mainly focuses on median household income, race context, poverty level, and umployment rate.

# polygon

chicagoBoundary <- 
  st_read(file.path(root.dir,"/Chapter5/chicagoBoundary.geojson")) %>%
  st_transform('ESRI:102271') 

# 2021 Bike Usage
BikeUsages <- 
  st_read('~/Github/predictive-policing/data/21_trip') %>% 
  na.omit() %>%
  st_transform('ESRI:102271') %>%
  st_intersection(.,chicagoBoundary)

# Creating a fishnet grid

chicago_fishnet <- 
  st_make_grid(chicagoBoundary,
               cellsize = 500, 
               square = TRUE) %>%
  .[chicagoBoundary] %>% 
  st_sf() %>%
  mutate(uniqueID = 1:n())

chicagoCoummunity <- 
  st_read("https://raw.githubusercontent.com/blackmad/neighborhoods/master/chicago.geojson") %>%
  st_transform(st_crs(chicago_fishnet)) 

# import risk factors

TaxIncrement <- 
  st_read("https://data.cityofchicago.org/resource/i7gi-vcr9.geojson")  %>%
  st_transform('ESRI:102271') %>%
  st_transform(st_crs(chicago_fishnet)) %>%
  st_join(chicagoBoundary) %>% 
  dplyr::select(name)%>%
  na.omit() 

Traffic_crash <-
  st_read('https://data.cityofchicago.org/resource/85ca-t3if.geojson') %>% 
  dplyr::select(geometry)%>%
  filter(!st_is_empty(.)) %>%
  st_transform('ESRI:102271') %>%
  st_transform(st_crs(chicago_fishnet)) %>%
  mutate(Legend = "Traffic Crashes") %>%
  dplyr::select(geometry, Legend)

Bike_route <-
  st_read('https://data.cityofchicago.org/resource/hvv9-38ut.geojson') %>% 
  na.omit() %>%
  st_transform('ESRI:102271') %>%
  st_transform(st_crs(chicago_fishnet)) %>%
  mutate(Legend = "Bike Route") %>%
  dplyr::select(geometry, Legend)

Bus_route <-
  st_read('https://data.cityofchicago.org/resource/pfsx-4n4m.geojson') %>% 
  st_transform('ESRI:102271') %>%
  st_transform(st_crs(chicago_fishnet)) %>%
  mutate(Legend = "bus route") %>%
  dplyr::select(geometry, Legend) %>%
  na.omit() 

Grocery_Stores <-
  st_read('https://data.cityofchicago.org/resource/53t8-wyrc.geojson') %>% 
  st_transform('ESRI:102271') %>%
  st_transform(st_crs(chicago_fishnet)) %>%
  mutate(Legend = "Grocery_Stores") %>%
  dplyr::select(geometry, Legend)%>%
  na.omit()

speed_camera <-
  st_read('https://data.cityofchicago.org/resource/4i42-qv3h.geojson') %>% 
  st_transform('ESRI:102271') %>%
  st_transform(st_crs(chicago_fishnet)) %>%
  mutate(Legend = "speed_camera") %>%
  dplyr::select(geometry, Legend)%>%
  na.omit()

# 2022 Biking Data
bikings22 <- 
  st_read('~/Github/predictive-policing/data/22_trip') %>% 
  na.omit() %>%
  st_transform('ESRI:102271') %>%
  distinct() %>%
  .[chicago_fishnet,] 

# Census data
# View(load_variables(2020,'acs5',cache = TRUE))
tracts20 <- 
  get_acs(geography = "tract", variables = c("B01003_001E","B02001_002E","B19013_001E","B25002_001E","B06012_002E","B27011_008E"), 
          year=2020, state="IL", county="Cook", geometry=T, output="wide") %>%
  st_transform('ESRI:102271') %>%
  rename(TotalPop = B01003_001E, 
         Whites = B02001_002E,
         MedHHInc = B19013_001E,
         TotalUnit = B25002_001E,
         TotalPoverty = B06012_002E,
         TotalUnemployment =    B27011_008E) %>%
dplyr::select(-NAME, -starts_with("B")) %>% #-starts_with("B") awesome!
  mutate(pctWhite = ifelse(TotalPop > 0, Whites / TotalPop * 100,0),
         pctPoverty = ifelse(TotalPop > 0, TotalPoverty / TotalPop *100, 0),
         pctUnemploy = ifelse(TotalPop > 0, TotalUnemployment / TotalPop *100, 0)
         ) %>%
  dplyr::select(-Whites, -TotalPoverty ,-TotalUnemployment,-GEOID) %>%
  st_transform(st_crs(chicago_fishnet)) 

tracts20.MedHHInc <- tracts20 %>%
  dplyr::select(MedHHInc) %>%
  rename(Legend = MedHHInc)

tracts20.pctWhite <- tracts20 %>%
  dplyr::select(pctWhite)%>%
  rename(Legend = pctWhite)

tracts20.pctPoverty <- tracts20 %>%
  dplyr::select(pctPoverty)%>%
  rename(Legend = pctPoverty)

tracts20.pctUnemploy <- tracts20 %>%
  dplyr::select(pctUnemploy)%>%
  rename(Legend = pctUnemploy)

III. Visualizaion and Analysis

METHOD Because of the different scale and form of the data, this analysis is based on a fishnet grid, breaking the city into square cells to design a model which predicts the probability of biking in each cell based on its risk factor data.

(I) Exploratory Analysis

1. A map of shared biking in 2021, Chicago

Figure 1.1 and 1.2 below show the distribution of Chicago observed biking points in 2021. Obviously, Upper North, North, East have more share bike trips compared to other neighborhoods. Considering those neighborhoods are also known for more tourism resources and commercial land. Potential model might identify areas in a neighborhood where shared bikes are more likely to be used because its social environment such as better-developed economy, gather of commercial spaces, and etc where the selection bias ignoring the actual demand from low-income and minorities might occur.

# 1. A map of biking in 2021, Chicago

grid.arrange(ncol=2,
            plot1<- ggplot() + 
               geom_sf(data = chicagoBoundary, fill = "#f5f5f5", color = '#f5f5f5') +
               geom_sf(data = BikeUsages, aes(colour = q5(BikeUsages$count)), size = 1, show.legend = "point") +
               scale_colour_brewer(palette ='RdPu')+
               scale_alpha(range = c(0.00, 0.35), guide = TRUE) +
               geom_sf(data = BikeUsages, aes(colour = 'white',  size=(BikeUsages$count)), alpha=0.1,  show.legend = "point") +  
               labs(title= "Observed Shared Bike Points",
                    subtitle = 'Chicago, IL, 2021\n',
                    caption = 'Figure 1.1') +
               mapTheme() +
               theme(legend.position = "none") +
               mapTheme(),
             
             ggplot() + 
               geom_sf(data = chicagoBoundary, fill = "#f5f5f5", color = '#f5f5f5') +
               stat_density2d(data = data.frame(st_coordinates(BikeUsages)), 
                              aes(X, Y, fill = ..level.., alpha = ..level..),
                              size = 0.01, bins = 40, geom = 'polygon') +
              scale_fill_viridis(option = "C", alpha = 0.7) +
               scale_alpha(range = c(0.00, 0.35), guide = FALSE) +
               labs(title = "Density of Observed Shared Bike Points",
                    subtitle = 'Chicago, IL, 2021\n',
                    caption = 'Figure 1.2') +
               mapTheme() + 
               theme(legend.position = "none") +
               mapTheme())

2. Comparison between maps of biking joined to the fishnet

Figures presents the biking data in a fishnet which is the basic scale this analysis mainly works at. It can be seen that the frequency of bicycle use in 2022 is significantly higher than in 2021, but the basic distribution is similar.And the use of share bikes has become more centralized.

# Aggregate points to the fishnet
bike_net <- 
  dplyr::select(BikeUsages) %>% 
  mutate(countbikings = BikeUsages$count) %>% 
  aggregate(., chicago_fishnet, sum) %>%
  mutate(countbikings = replace_na(countbikings, 0),
         uniqueID = 1:n(),
         cvID = sample(round(nrow(chicago_fishnet) / 24), 
                       size=nrow(chicago_fishnet), replace = TRUE))

bike_net.2 <- 
  dplyr::select(bikings22) %>% 
  mutate(countbikings = bikings22$count) %>% 
  aggregate(., chicago_fishnet, sum) %>%
  mutate(countbikings = replace_na(countbikings, 0),
         uniqueID = 1:n(),
         cvID = sample(round(nrow(chicago_fishnet) / 24), 
                       size=nrow(chicago_fishnet), replace = TRUE))

bike_net.plot <-
  ggplot() +
  geom_sf(data = bike_net, aes(fill =log2(countbikings)), color = NA) +
  scale_fill_viridis(option = "C",name = 'Biking Counts',na.value = "#f5f5f5", direction = -1)+
  scale_colour_hue(l = 50, c = 30)+
  labs(title = "Observed Shootings Joined to Fishnet",
       subtitle = 'Chicago, IL, 2021\n(log transformed for visualization)',
       caption = 'Figure 2.1') +
  mapTheme() 

bike_net.plot2 <-
  ggplot() +
  geom_sf(data = bike_net.2, aes(fill =log2(countbikings)), color = NA) +
  scale_fill_viridis(option = "C",name = 'Biking Counts',na.value = "#f5f5f5",direction = -1)+
  scale_colour_hue(l = 50, c = 30)+
  labs(title = "Observed Shootings Joined to Fishnet",
       subtitle = 'Chicago, IL, 2022\n(log transformed for visualization)',
       caption = 'Figure 2.2') +
  mapTheme() 
grid.arrange(ncol=2,bike_net.plot,bike_net.plot2)

(II) Risk Factor Exploratory Analysis

To build a model which can capture more characteristics of biking pattern, this analysis selects risk factors that are correlated with biking more or less based on common transportation studies. These risk factors are related to the transportation infrastructures within the City of Chicago, Traffic Crashes, CTA - Bus Routes, Grocery Stores, Bike Routes, Speed Camera and other demographic factors (household income, poverty level, unemployment rate, and racial group) fetched from census data.

3. Map set of risk factors in the fishnet

# All variables in fishnet 

vars_net <- 
 rbind(Bus_route, Bike_route, speed_camera, Traffic_crash, Grocery_Stores) %>%
  st_join(., chicago_fishnet, join=st_within) %>%
  st_drop_geometry() %>%
  group_by(uniqueID, Legend) %>%
  summarize(count = n()) %>%
  full_join(chicago_fishnet, by = "uniqueID") %>%
  spread(Legend, count, fill=0) %>%
  st_sf() %>%
  na.omit() %>% 
  dplyr::select(-`<NA>`) %>%
  ungroup()

### Multiple map for feature counts in fishnet
vars_net.long <- 
  gather(vars_net, Variable, value, -geometry, -uniqueID)

vars <- unique(vars_net.long$Variable)
mapList <- list()

for(i in vars){
  mapList[[i]] <- 
    ggplot() +
    geom_sf(data = filter(vars_net.long, Variable == i), aes(fill=value), colour=NA) +
    scale_fill_viridis(option = "C",name = '',na.value = "#f5f5f5")+
    labs(title=i) +
    mapTheme()}

do.call(grid.arrange,c(mapList, ncol =3, top = "Figure. 3.1 Risk Factors by Fishnet\n"))

It can be seen that most of the variables are relatively evenly distributed spatially, except for bus route and bike route, which accumulate in the northeastern part of the city.

## 3.2 Nearest Neighbor Feature
# convenience to reduce length of function names.
st_c <- st_coordinates
st_coid <- st_centroid

## create NN from point data, k = 3
vars_net <- na.omit(vars_net)

vars_net$bikeRoute.nn <- 
  nn_function(st_c(st_coid(vars_net)), 
              st_c(Bike_route)[,c(1:2)], k = 3)

vars_net$BusRoute.nn <-
           nn_function(st_c(st_coid(vars_net)), 
                                           st_c(Bus_route),
                                           k = 3)

vars_net$GroceryStore.nn <- 
           nn_function(st_c(st_coid(vars_net)), 
                                         st_c(Grocery_Stores), k = 3)
vars_net$SpeedCamera.nn <- 
           nn_function(st_c(st_coid(vars_net)), 
                                 st_c(speed_camera),
                                 k = 3)
vars_net$Traffic_crash.nn <-
           nn_function(st_c(st_coid(vars_net)), st_c(Traffic_crash),k = 3)

## Visualize the nearest three features
vars_net.long.nn <- 
  dplyr::select(vars_net, ends_with(".nn")) %>%
  gather(Variable, value, -geometry)

vars <- unique(vars_net.long.nn$Variable)
mapList <- list()

for(i in vars){
  mapList[[i]] <- 
    ggplot() +
    geom_sf(data = filter(vars_net.long.nn, Variable == i), aes(fill=value), colour=NA) +
    scale_fill_viridis(option = "C",name = '',na.value = "#f5f5f5")+
    labs(title=i) +
    mapTheme() }

do.call(grid.arrange,c(mapList, ncol = 3, top = "Figure 3.2 Nearest Neighbor risk Factors by Fishnet\n"))

# IV and DVs all in fishnet
chicago_final_net <-
  full_join(bike_net, st_drop_geometry(vars_net), by="uniqueID") 

chicago_final_net <-
  st_centroid(chicago_final_net) %>%
  st_join(dplyr::select(chicagoCoummunity, name), by = "uniqueID",  left = TRUE) %>%
  st_join(dplyr::select(TaxIncrement,name), by = "uniqueID",  left = TRUE) %>%
  st_drop_geometry() %>%
  left_join(dplyr::select(chicago_final_net, geometry, uniqueID)) %>%
  st_sf()

(III) Hotpots and Correlation Explorary Analysis

4. Local Moran’s I-related small multiple map of outcome

As a critical part before build a robust model, it is imperative to figure out whether, where and how observed shared bike usages cluster spatially. By leveraging the Local Moran’s I statistic, I locate areas in Chicago where the clustering of biking usages is greater than would be expected with random distribution. In Figure 4 we see that only some areas with high observed biking usages have high clustering (high Moran’s I), such Northeast Chicago only have a very high level of confidence (low P Value) in their Moran’s I. Areas with high shared bike usage counts, clustering, and confidence are likely to be identified as risk hotpot.

With the hotspot identified, I continue engineer two new features for our model: biking.isSig tells whether a cell is in a certain hotspot, and biking.isSig.dist is the distance from a cell to its nearest hotspot. These two features allow us to factor in the spatial clustering of shared bike usage.

## 4.1 spatial weights matrix

chicago_final_net.nb <- poly2nb(as_Spatial(chicago_final_net), queen=TRUE)
chicago_final_net.weights <- nb2listw(chicago_final_net.nb, style="W", zero.policy=TRUE)

chicago_final_net.localMorans <- 
  cbind(
    as.data.frame(localmoran(chicago_final_net$countbikings, chicago_final_net.weights, zero.policy=TRUE)),
    as.data.frame(chicago_final_net)) %>% 
  st_sf() %>%
  dplyr::select(Bike_Count = countbikings, 
                Local_Morans_I = Ii, 
                P_Value = `Pr(z != E(Ii))`) %>%
  mutate(Significant_Hotspots = ifelse(P_Value <= 0.05, 1, 0)) %>%
  gather(Variable, Value, -geometry)

vars <- unique(chicago_final_net.localMorans$Variable)
varList <- list()

for(i in vars){
  varList[[i]] <- 
    ggplot() +
    geom_sf(data = filter(chicago_final_net.localMorans, Variable == i), 
            aes(fill = Value), colour=NA) +
    scale_fill_viridis(option = "C",name = '',  na.value = "#f5f5f5")+
    labs(title=i,
         caption = "Figure 4. Local Moran's I Statistics, Observed Shooting Cases") +
    mapTheme() + theme(legend.position="bottom") }

do.call(grid.arrange,c(varList, ncol = 4, top = "Local Morans I statistics, Burglary"))

### Distance to highly significant shooting hotpot
chicago_final_net <-
  chicago_final_net %>% 
  mutate(biking.isSig = 
           ifelse(localmoran(chicago_final_net$countbikings, 
                             chicago_final_net.weights,zero.policy=TRUE)[,5] <= 0.0000001, 1, 0)) %>%
  mutate(biking.isSig.dist = 
           nn_function(st_coordinates(st_centroid(chicago_final_net)),
                       st_coordinates(st_centroid(
                         filter(chicago_final_net, biking.isSig == 1))), 1))

5. A small multiple scatterplot with correlations

Now with our complete set of risk factor features, we explore the correlation between the features and observed biking usages Figure 5 below visualizes some of these correlations. It is intuitive that all the features selected are correlated to shared bike usage case counts.

# 5. A small multiple scatterplot with correlations
chicago.correlation.long <-
  st_drop_geometry(chicago_final_net) %>%
  dplyr::select(-uniqueID, -cvID,-starts_with('name'),-starts_with('community')) %>%
  gather(Variable, Value, -countbikings)

chicago.correlation.cor <-
  chicago.correlation.long %>%
  group_by(Variable) %>%
  summarize(chicago.correlation = cor(Value, countbikings, use = "complete.obs"))

ggplot(chicago.correlation.long, aes(Value, countbikings)) +
  geom_point(size = 0.1, color='darkblue') +
  geom_text(data = chicago.correlation.cor, aes(label = paste("r =", round(chicago.correlation, 2))),
            x=-Inf, y=Inf, vjust = 1.5, hjust = -.1) +
  geom_smooth(method = "lm", se = FALSE, colour = "darkred") +
  facet_wrap(~Variable, ncol = 4, scales = "free") +
  labs(title = "Observed Shared Biking Usages as a Function of Risk Factors",
       subtitle = "Chicago, IL\n",
       caption = "Figure 5")+
  plotTheme()

6. A histogram of dependent variable

Given biking occurs in space only on the road, it is reasonable for most grid cells to contain no biking. And based on the histogram below, the distribution of the dependent variable is Poisson distribution, indicating a Poisson model should be considered for model building.

# 6. A histogram of dependent variable

ggplot(data = chicago_final_net) +
  geom_histogram(aes(x = countbikings), fill = 'darkblue') +
  scale_x_continuous(breaks = seq(0, 8, by = 1)) + 
  labs(title="Histogram of Dependent Variable: Count of Observed Shared Bikes Usage",
       subtitle = "Chicago, IL\n",
       caption = "Figure 6") +
  xlab('Count of Bikings') +
  ylab('Count') +
  plotTheme()

(IV) Regression Model Building and Evaluation

For a given risk factor, I selected feature counts in each fishnet and distance to biking hotpots as well as distance to significant biking hotpots as features used in the model to avoid colinearity. I also added Local Moran’s I spatial process features based on the result in Question 4. Therefore, we have eleven features in total.

Goodness of fit metrics are generated for four regressions - two including Just Risk Factors (reg.vars), and a second (reg.ss.vars) includes risk factors plus the Local Moran’s I Spatial Process features created

A well generalized predictive model learns the risk ‘experience’ at both citywide and local spatial scales. The best way to test for this is to hold out one local area, train the model on the remaining n - 1 areas, predict for the hold out, and record the goodness of fit, which is the cross-validation approach called LEAVE ONE GROUP OUT (LOGO-CV) used in the analysis. reg.ss.cv performs random k-fold cross validation using spatial process features, while reg.ss.spatialCV performs LOGO-CV, spatial cross-validation on neighborhood name, using the same features. Same processes are also conducted on Just Risk Factors.

crossValidate<-
function(dataset, id, dependentVariable, indVariables) {
  
  allPredictions <- data.frame()
  cvID_list <- unique(dataset[[id]])
  
  for (i in cvID_list) {
    
    thisFold <- i
    cat("This hold out fold is", thisFold, "\n")
    
    fold.train <- filter(dataset, dataset[[id]] != thisFold) %>% as.data.frame() %>% 
      dplyr::select(id, geometry, indVariables, dependentVariable)
    fold.test  <- filter(dataset, dataset[[id]] == thisFold) %>% as.data.frame() %>% 
      dplyr::select(id, geometry, indVariables, dependentVariable)
    
    regression <-
      glm(countbikings ~ ., family = "poisson", 
          data = fold.train %>% 
            dplyr::select(-geometry, -id))
    
    thisPrediction <- 
      mutate(fold.test, Prediction = predict(regression, fold.test, type = "response"))
    
    allPredictions <-
      rbind(allPredictions, thisPrediction)
    
  }
  return(st_sf(allPredictions))
}

## define the variables 
reg.vars <- c("Bike Route","bus route","speed_camera","Traffic Crashes","Grocery_Stores" )

reg.ss.vars <- c("Bike Route","bus route","speed_camera","Traffic Crashes","Grocery_Stores",
                 "biking.isSig" ,"biking.isSig.dist")


###  create random k-fold cv
reg.cv <- crossValidate(
  dataset = chicago_final_net,
  id = "cvID",
  dependentVariable = "countbikings",
  indVariables = reg.vars) %>%
  dplyr::select(cvID = cvID, countbikings, Prediction, geometry)

reg.ss.cv <- crossValidate(
  dataset = chicago_final_net,
  id = "cvID",
  dependentVariable = "countbikings",
  indVariables = reg.ss.vars) %>%
  dplyr::select(cvID = cvID, countbikings, Prediction, geometry)

chicago_final_net$name.x <- ifelse(is.na(chicago_final_net$name.x ), 0, chicago_final_net$name.x)
  
###  create spatial cross validation
reg.spatialCV <- crossValidate(
  dataset = chicago_final_net,
  id = "name.x",                            ### !!! <- really important line
  dependentVariable = "countbikings",
  indVariables = reg.vars) %>%
  dplyr::select(cvID = name.x, countbikings, Prediction, geometry)

reg.ss.spatialCV <- crossValidate(
  dataset = chicago_final_net,
  id = "name.x",                            ### !!! <- really important line
  dependentVariable = "countbikings",
  indVariables = reg.ss.vars) %>%
  dplyr::select(cvID = name.x, countbikings, Prediction, geometry)

# Bind four CVs together

reg.summary <- 
  rbind(
    mutate(reg.cv,           Error = Prediction - countbikings,
                             Regression = "Random k-fold CV: Just Risk Factors"),
                             
    mutate(reg.ss.cv,        Error = Prediction - countbikings,
                             Regression = "Random k-fold CV: Spatial Process"),
    
    mutate(reg.spatialCV,    Error = Prediction - countbikings,
                             Regression = "Spatial LOGO-CV: Just Risk Factors"),
                             
    mutate(reg.ss.spatialCV, Error = Prediction - countbikings,
                             Regression = "Spatial LOGO-CV: Spatial Process")) %>%
    st_sf() 


# calculate and visualize MAE for each fold 
error_by_reg_and_fold <- 
  reg.summary %>%
  group_by(Regression, cvID) %>% 
  summarize(Mean_Error = mean(Prediction - countbikings),
            MAE = mean(abs(Mean_Error)),
            SD_MAE = mean(abs(Mean_Error))) %>%
  ungroup()

7. Model errors by random k-fold and spatial cross validation

7.1 Distribution of MAE

Figure 7.1 below displays the distribution of MAE across all models. Some small errors clustering can be captured, indicating the models perform well in terms of generalizability across the city. Also, we can clearly see that by adding the spatial process of Moran’s I can greatly reduce the error of the model.

#### 7.1 Distribution of MAE
error_by_reg_and_fold %>%
  ggplot(aes(MAE)) + 
  geom_histogram(bins = 30, colour="white", fill = "#810f7c") +
  facet_wrap(~Regression) +  
  geom_vline(xintercept = 0) + 
  scale_x_continuous(breaks = seq(0, 8, by = 1)) + 
  labs(title="Distribution of MAE", 
       subtitle = "k-fold cross validation vs. LOGO-CV\n",
       caption = "Figure 7.1",
       x="Mean Absolute Error", y="Count") +
  plotTheme()

7.2 Visualizes the random k-fold and LOGO-CV errors spatially.

Figure 7.2 provides additional information on the errors across models. It is clearly that the error of random k-fold is more evenly distributed in space while the error of LOGO-CV has stronger spatial characteristics (large error in some neighborhoods and near zero in others). Also, we can see that the error cell of k-fold is based on the original fishnet cell, however the error cell of LOGO-cv is based on the neighborhood’s cell. Therefore, the error interpretation of LOGO-CV is more practical.

error_by_reg_and_fold %>%
  ggplot() +
  geom_sf(aes(fill = MAE),color = 'transparent',size=0.001) +
  facet_wrap(~Regression,ncol=4) +
  scale_fill_viridis_c(option = "plasma",alpha = 1) +
  labs(title = "Biking Errors by Rondom k-fold and LOGO-CV Regression\n",
       caption = 'Figure 7.2') +
  mapTheme() + 
  theme(legend.position="bottom",strip.text.x = element_text(size = 10))

7.3 Predicted biking and observed biking

Interestingly, Figure 7.3 below shows that all models over-predict in low biking-rate areas and under-predict in hot spot areas. Under-prediction in higher biking areas may reflect difficulty predicting the hot spots.

st_drop_geometry(reg.summary) %>%
  group_by(Regression) %>%
  mutate(biking_Decile = ntile(countbikings, 10)) %>%
  group_by(Regression, biking_Decile) %>%
  summarize(meanObserved = mean(countbikings, na.rm=T),
            meanPrediction = mean(Prediction, na.rm=T)) %>%
  gather(Variable, Value, -Regression, -biking_Decile) %>%          
  ggplot(aes(biking_Decile, Value, shape = Variable)) +
  geom_point(size = 2) + 
  geom_path(aes(group = biking_Decile), colour = "black") +
  scale_shape_manual(values = c(2, 17)) +
  facet_wrap(~Regression) + xlim(0,10) +
  labs(title = "Predicted and Observed Bikings by Observed Biking Decile\n",
       subtitle = "Chicago, IL\n",
       caption = 'Figure 7.3',
       x = 'Biking Decile') +
  plotTheme()

8. A table of MAE and standard deviation MAE by regression.

Table 8 below confirms my assumption that the Spatial Process features improve the model. The model appears consistently robust even in conservative LOGO-CV. For your reference, the average count of observed biking cases in each cell is 1962.103, while the average count of predicted biking cases in each cell is 2008.001 and the average count of spatial predicted biking cases in each cell is 2055.131.

st_drop_geometry(error_by_reg_and_fold) %>%
  group_by(Regression) %>% 
  summarize(Mean_MAE = round(mean(MAE, na.rm=T), 2),
            SD_MAE = round(sd(MAE, na.rm=T), 2)) %>%
  kable(caption = "Table 8. MAE by regression") %>%
  kable_styling("striped", full_width = F)

Table 8. MAE by regression
Regression	Mean_MAE	SD_MAE
Random k-fold CV: Just Risk Factors	1365.50	1697.67
Random k-fold CV: Spatial Process	810.04	1109.45
Spatial LOGO-CV: Just Risk Factors	5051.35	9142.35
Spatial LOGO-CV: Spatial Process	2001.91	4186.14

9. A table of raw errors by race context for a random k-fold vs. spatial cross validation regression.

To be generalizable, the models must produce errors on the similar level in both majority white and majority non-white neighborhoods. However, shown in Table 9, there is huge difference between the MAE of model with spatial process and without spatial process in non-white and white neighborhood. The absolute value of MAE without spatial process of Majority_White neighborhoods is lower than non-white neighborhoods. But the absolute value of MAE with spatial process of Majority_White neighborhoods is higher than the non-white neighborhoods. This shows high spatial bias in shared bikes usage.

Besides, we get negative MAE in Majority_White neighborhoods and positive MAE in Majority_Non_White neighborhoods. This means each model on average, over-predicts in Majority_Non_White neighborhoods and under-predicts in Majority_White neighborhoods. It looks like this algorithm does not generalize well with respect to race.

## 9.1 Fetch census data
tracts20.race <- 
  get_acs(geography = "tract", variables = c("B01003_001E","B02001_002E"), 
           year=2020, state="IL", county="Cook", geometry=T, output="wide") %>%
  st_transform('ESRI:102271') %>%
  rename(TotalPop = B01003_001E, 
         Whites = B02001_002E) %>%
  mutate(percentWhite = Whites / TotalPop,
         raceContext = ifelse(percentWhite > .5, "Majority_White", "Majority_Non_White")) %>%
  .[chicagoCoummunity,]

ggplot() + 
  geom_sf(data = na.omit(tracts20.race), aes(fill = raceContext),color = 'white') +
  scale_fill_manual(values = c("#810f7c", "#e0ecf4"), name="Race Context") +
  labs(title = "Race Context",
       subtitle = "chicago, IL\n",
       caption = 'Figure 9') +
  mapTheme() + 
  theme(legend.position="bottom")

reg.summary %>% 
  filter(str_detect(Regression, "LOGO")) %>%
    st_centroid() %>%
    st_join(tracts20.race) %>%
    na.omit() %>%
      st_drop_geometry() %>%
      group_by(Regression, raceContext) %>%
      summarize(mean.Error = mean(Error, na.rm = T)) %>%
      spread(raceContext, mean.Error) %>%
      kable(caption = "Table 9. Mean Error by Neighborhood Racial Context") %>%
        kable_styling("striped", full_width = F)

Table 9. Mean Error by Neighborhood Racial Context
Regression	Majority_Non_White	Majority_White
Spatial LOGO-CV: Just Risk Factors	1294.770	-1115.539
Spatial LOGO-CV: Spatial Process	176.069	-323.101

(IV) Model Alternative and Comparison

10. The map comparing kernel density to risk predictions

In the last section, this analysis is aimed at figuring out if the predictive policing model is able to better predict shared bike usages compared with traditional model - Kernel Density Model.

By making kernel density maps with different search radius scale (1000 Ft, 1500 Ft, and 2000 Ft), this analysis select the one with 1000 Ft search radius since it can better display the density of shared bike usage (see Figure 10.1).

### 10.1 Make Kernel Density Map
bike_ppp <- as.ppp(st_coordinates(BikeUsages), W = st_bbox(chicago_final_net))
bike_KD.1000 <- spatstat.core::density.ppp(bike_ppp, 1000)
bike_KD.1500 <- spatstat.core::density.ppp(bike_ppp, 1500)
bike_KD.2000 <- spatstat.core::density.ppp(bike_ppp, 2000)
bike_KD.df <- rbind(
  mutate(data.frame(rasterToPoints(mask(raster(bike_KD.1000), as(chicagoCoummunity, 'Spatial')))), Legend = "1000 Ft."),
  mutate(data.frame(rasterToPoints(mask(raster(bike_KD.1500), as(chicagoCoummunity, 'Spatial')))), Legend = "1500 Ft."),
  mutate(data.frame(rasterToPoints(mask(raster(bike_KD.2000), as(chicagoCoummunity, 'Spatial')))), Legend = "2000 Ft.")) 

bike_KD.df$Legend <- factor(bike_KD.df$Legend, levels = c("1000 Ft.", "1500 Ft.", "2000 Ft."))

ggplot(data = bike_KD.df, aes(x = x, y = y)) +
  geom_raster(aes(fill=layer)) + 
  facet_wrap(~Legend) +
  coord_sf(crs=st_crs(chicago_final_net)) + 
  scale_fill_viridis_c(option = "plasma",
                       name = "Density") +
  labs(title = "Kernel Density with 3 Different Search Radius Scales",
       subtitle = "Biking Cases, 2022, Chicago, IL\n",
       caption = 'Figure 10.1') +
  mapTheme() + theme(legend.position="bottom")

Next, a comparison map is generated of the risk categories for both model types with a sample of shared bike usage in 2022 points overlaid. A strongly fit model should show that the highest risk category is uniquely targeted to places with a high density of shared bike usage points. Indicated from Figure 10.2, due to the construction of new shared bike station, both the traditional model and the predictive policing model can show the shared biking pattern in 2022 more or less. However, none of these two models is accurate and direct spatially.

bike_KDE_sf <- as.data.frame(bike_KD.1000) %>%
  st_as_sf(coords = c("x", "y"), crs = st_crs(chicago_final_net)) %>%
  aggregate(., chicago_final_net, mean) %>%
  mutate(label = "Kernel Density",
         Risk_Category = ntile(value, 100),
         Risk_Category = case_when(
           Risk_Category >= 90 ~ "90% to 100%",
           Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
           Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
           Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
           Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%")) %>%
  cbind(
    aggregate(
      dplyr::select(bikings22) %>% mutate(bikingCount = bikings22$count), ., sum) %>%
      mutate(bikingCount = replace_na(bikingCount, 0))) %>%
  dplyr::select(label, Risk_Category, bikingCount)

##### 10.4 Prediction by Risk Prediction Model
bike_risk_sf <-
  reg.ss.spatialCV %>%
  mutate(label = "Risk Predictions",
         Risk_Category = ntile(Prediction, 100),
         Risk_Category = case_when(
           Risk_Category >= 90 ~ "90% to 100%",
           Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
           Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
           Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
           Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%")) %>%
  cbind(
    aggregate(
      dplyr::select(bikings22) %>% mutate(bikingCount = bikings22$count), ., sum) %>%
      mutate(bikingCount = replace_na(bikingCount, 0))) %>%
  dplyr::select(label, Risk_Category, bikingCount)

rbind(bike_KDE_sf, bike_risk_sf) %>%
  na.omit() %>%
  gather(Variable, Value, -label, -Risk_Category, -geometry) %>%
  ggplot() +
  geom_sf(aes(fill = Risk_Category), colour = NA) +
  geom_sf(data = sample_n(bikings22, 200), size = 0.9, colour = "white") +
  facet_wrap(~label, ) +
  scale_fill_viridis_d(option = "plasma",
                       name = 'Risk Category') + 
  labs(title="Comparison of Kernel Density and Risk Predictions",
       subtitle="Bottom layer is 2022 predicted biking counts.\nDot is observed 2022 biking counts.\n",
       caption = 'Figure 10.2') +
  mapTheme() + theme(legend.position="bottom")

Therefore, I do a further step to map the count of shared bike usage in Chicago, 2022.From the figure, we can see that the spatial accumulation of shared bike use is more obvious in 2022, but we can also see the emergence of some new usage hotspots driven by the new site construction.

 grid.arrange(ncol=2,
               ggplot() + 
               geom_sf(data = chicagoBoundary, fill = "#f5f5f5", color = "#f5f5f5") +
               geom_sf(data = bikings22, aes(colour = q5(bikings22$count)), size = 1, show.legend = "point") +
               scale_colour_brewer(palette ='RdPu')+
               geom_sf(data = bikings22, aes(colour = 'white',  size=(bikings22$count)), alpha=0.05,  show.legend = "point") +  
               scale_alpha(range = c(0.00, 0.35), guide = FALSE) +
               labs(title= "Observed Shared Bike Useages Points",
                    subtitle = 'Chicago, IL, 2022\n',
                    caption = 'Figure 10.3.1') +
               mapTheme() +
               theme(legend.position = "none"),
            
               ggplot() + 
               geom_sf(data = chicagoBoundary, fill = "#f5f5f5", color = "#f5f5f5") +
               stat_density2d(data = data.frame(st_coordinates(bikings22)), 
                              aes(X, Y, fill = ..level.., alpha = ..level..),
                              size = 0.01, bins = 40, geom = 'polygon') +
               scale_fill_viridis(option = "C", alpha = 0.7) +
               scale_alpha(range = c(0.00, 0.35), guide = FALSE) +
               labs(title = "Density of Observed Shared Bike Useages Points",
                    subtitle = 'Chicago, IL, 2022\n',
                    caption = 'Figure 10.3.2') +
               mapTheme() + 
               theme(legend.position = "none") +
               mapTheme())

11. The bar plot making this comparison.

Indicated from Figure 11, it is seemingly the advantages of predictive policing model is not consistent across all risk categories. For all risk categories lower than 70%, the traditional Kernel Density model is actually more accurate. For the highest category, the predictive policing risk model is more accurate. This finding might imply that the predictive policing model built in this analysis is better at locating hot spots.

rbind(bike_KDE_sf, bike_risk_sf) %>%
  st_set_geometry(NULL) %>% 
  na.omit() %>%
  gather(Variable, Value, -label, -Risk_Category) %>%
  group_by(label, Risk_Category) %>%
  summarize(countBikings = sum(Value)) %>%
  ungroup() %>%
  group_by(label) %>%
  mutate(Rate_of_test_set_bikes = countBikings / sum(countBikings)) %>%
  ggplot(aes(Risk_Category,Rate_of_test_set_bikes)) +
  geom_bar(aes(fill=label), position="dodge", stat="identity") +
  scale_fill_manual(values = c("darkblue", "#e0ecf4"), name="Model")+
  labs(title = "Risk Prediction vs. Kernel Density, 2022 Biking Cases",
       subtitle = 'Chicago, IL',
       caption = 'Figure 11',
       x = 'Risk Category',
       y = 'Rate of Test Set Bikings') +
  plotTheme()

IV. Conclusion

Generally speaking, this model achieves the goal of identifying shared bike useage hotspots in Chicago since it predicts hotspots and it identifies areas with no observed shared bike usage cases at present that have latent shared biking possibility. But it is hard to tell that the overall model works better than the traditional - Kernel Density model, for the figure 11 show that these two kinds of prediction work well in different risk category.

Also, one of the limitations of this model is that it fails to avoid selection bias - it systematically over-predicts shared bike usages in majority non-white neighborhoods. This would cover the inequity in shared bike usage between different races.

Another limitation is clustered in the choice of dependent variable - shared bike usages. In fact, share bike usage is greatly influenced by the supply of shared bikes(showed in Figure 10.2), but we can compare the predictive policing model and the kernel density model’s predictions with the actual observations to detect the construction of new bicycle stations and the effectiveness of the increased supply.