Bike Share Prediction

I. Introduction

Bikeshare and e-scooters are types of micro-mobility, a category of modes of transportation that includes very light, low-occupancy vehicles such as electric scooters (e-scooters), electric skateboards, shared bicycles, and electric pedal-assisted bicycles (e-bikes). As of July 2022, 61 docked bike-share systems open to the general public operated 8,457 docking stations in the U.S. On average, there are 139 docking stations operated by a system, with the largest system (Citi Bike serving New York City, NY; Jersey City, NJ; and Hoboken, NJ) operating over 1,500 stations and 4 systems having 10 or fewer stations.

However, the number of shared bikes is limited, and in order to ensure that customers have access to bikes in a timely manner, citi bike in New York City is currently rebalancing the number of bikes available through a variety of rebalancing methods.

Therefore, how can we measure the rebalancing priority and determine the unmet demands? In this analysis, we will use the Citi bike usage data for April 2021 to try to build a predictive model for shared bike usage to help guide rebalancing. Previous efforts have been put to develop a tracker system to notice which station is full or empty that cannot predict the demand of bikes used but only indicate the lag result. Consequently, this analysis is aimed at exploring the user pattern and building model to indicate rebalancing priority.

The task of rebalancing is so complex that there is an endless list of factors that influence bike share user patterns, including time of day, day of the week, location, amenity, weather, elevation, and so on. This analysis will include those critical features in the model but also add time lag features (from 1 hour to the whole day) to accurately count serial relations.

library(tidyverse)
library(sf)
library(lubridate)
library(tigris)
library(tidycensus)
library(viridis)
library(riem)
library(gridExtra)
library(knitr)
library(kableExtra)
library(RSocrata)
library(FNN)
library(gganimate)
library(ggplot2)
library(dplyr)
plotTheme <- function(base_size = 12, title_size = 16) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = title_size, colour = "black"), 
    plot.subtitle = element_text(face="italic"),
    plot.caption = element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),
    panel.grid.minor = element_line("grey90", size = 0.01),
    panel.grid.major = element_line("grey90", size = 0.01),
    panel.border = element_rect(colour = "white" , fill=NA, size=0),
    strip.background = element_rect(color = "white",fill='white'),
    strip.text = element_text(size=12),
    axis.title = element_text(size=12),
    axis.text = element_text(size=10),
    plot.background = element_blank(),
    legend.background = element_blank(),
    legend.title = element_text(colour = "black", face = "italic"),
    legend.text = element_text(colour = "black", face = "italic"),
    strip.text.x = element_text(size = 12)
  )
}

mapTheme <- function(base_size = 12, title_size = 16) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = title_size,colour = "black"),
    plot.subtitle=element_text(face="italic"),
    plot.caption=element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),
    axis.title = element_blank(),
    axis.text = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.grid.minor = element_blank(),
    strip.text.x = element_text(size = 12),
    strip.background = element_rect(fill = "white"))
}

# Cross-validate function from chapter 5 (left in chapter)
crossValidate <- function(dataset, id, dependentVariable, indVariables) {
  
  allPredictions <- data.frame()
  cvID_list <- unique(dataset[[id]])
  
  for (i in cvID_list) {
    
    thisFold <- i
    cat("This hold out fold is", thisFold, "\n")
    
    fold.train <- filter(dataset, dataset[[id]] != thisFold) %>% as.data.frame() %>% 
      dplyr::select(id, geometry, all_of(indVariables),
                    all_of(dependentVariable))
    fold.test  <- filter(dataset, dataset[[id]] == thisFold) %>% as.data.frame() %>% 
      dplyr::select(id, geometry, all_of(indVariables),
                    all_of(dependentVariable))
    
    form_parts <- paste0(dependentVariable, " ~ ", paste0(indVariables, collapse = "+"))
    form <- as.formula(form_parts)
    regression <- glm(form, family = "poisson",
                      data = fold.train %>%
                        dplyr::select(-geometry, -id))
    
    thisPrediction <-
      mutate(fold.test, Prediction = predict(regression, fold.test, type = "response"))
    
    allPredictions <-
      rbind(allPredictions, thisPrediction)
    
  }
  return(st_sf(allPredictions))
}


palette7 <- c('#084594','#c6dbef','#fec44f','#d95f0e','#9ecae1','#6baed6','#4292c6')
palette5 <- c('#d73027','#f46d43','#fdae61','#fee090','#e0f3f8')
palette4 <- c("#981FAC","#FF006A","#FE4C35","#FE9900")
palette2 <- c('#4575b4', '#fecc5c')

II. Data Loading and Feature Engineering

2.1 Import Census Data

This analysis uses data from the 2020 ACS 5-year estimates. The following demographic variables are selected from ACS 2020 for census tracts in NYC. including: Total population, Median household income , Median age, White population percentage, Travel time, Number of commuters, Means of transportation, Total public transportation

Based on the data, we create to identify white population proportion, to identify average commute time, and to identify the proportion of taking public transportation. Percent_White, Mean_Commute_Time, Percent_Taking_Public_Trans

# Install Census API Key
tidycensus::census_api_key("e79f3706b6d61249968c6ce88794f6f556e5bf3d", overwrite = TRUE)
NYCcensus <-
  get_acs(geography = "tract",
          variables = c("B01003_001", "B19013_001", 
                        "B02001_002", "B08013_001",
                        "B08012_001", "B08301_001", 
                        "B08301_010", "B01002_001"), 
          year=2020, state="New York", geometry=TRUE, output = "wide", county = "New York County") %>% 
  st_transform(26918) %>%
  rename(Total_Pop =  B01003_001E,
         Med_Inc =  B19013_001E,
         Med_Age = B01002_001E,
         White_Pop = B02001_002E,
         Travel_Time = B08013_001E,
         Num_Commuters = B08012_001E,
         Means_of_Transport = B08301_001E,
         Total_Public_Trans = B08301_010E) %>%
  dplyr::select(Total_Pop, Med_Inc, White_Pop, Travel_Time,
         Means_of_Transport, Total_Public_Trans,
         Med_Age,GEOID, geometry) %>%
  mutate(Percent_White = White_Pop / Total_Pop,
         Mean_Commute_Time = Travel_Time / Total_Public_Trans,
         Percent_Taking_Public_Trans = Total_Public_Trans / Means_of_Transport)

Tracts <- 
  NYCcensus %>%
  as.data.frame() %>%
  distinct(GEOID, .keep_all = TRUE) %>%
  select(GEOID, geometry) %>% 
  st_sf

# 2021 Bike Usage
BikeUsages <- 
  st_read('~/Github/scooter/21_4nyccctrip') %>% 
  na.omit() %>%
  st_transform(26918) %>%
  st_intersection(.,Tracts) %>%
  rename(ridership = ride_id, Destination.Tract=GEOID)%>%
  filter(ridership<20) 

dat_census <- BikeUsages %>%
  mutate(day_hour= ymd_h(paste(interval60,hour,sep = ' ')))%>%
  filter(month(day_hour)==4)%>%
  as.data.frame() %>%
  select(-geometry)

2.3 Import Weather Data

Import weather data from NYC National Airport using function riem_measures. We mutate Temperature, Precipitation, Wind_Speed on an hourly basis and plot trends respectively from Week 13 to Week 18 in 2020.

weather.Panel <- 
  riem_measures(station = "NYC", date_start = "2021-04-01", date_end = "2021-05-01") %>%
  select(valid, tmpf, p01i, sknt)%>%
  replace(is.na(.), 0) %>%
  mutate(interval60 = ymd_h(substr(valid,1,13))) %>%
  mutate(week = week(interval60),
         dotw = wday(interval60, label=TRUE)) %>%
  group_by(interval60) %>%
  summarize(Temperature = max(tmpf),
            Precipitation = sum(p01i),
            Wind_Speed = max(sknt)) %>%
  mutate(Temperature = ifelse(Temperature == 0, 42, Temperature))

2.4 Create Traffic Related Feature

In the early days of bike-sharing use, there was a very strong concern about the safety of use, so the crash data during this time period was used in this section to perform the relevant distribution analysis. However, due to the large difference in order of magnitude between the crash data distribution and the bicycle data set, this metric was not used in the subsequent regression model.

Traffic_crash <-
  st_read('https://data.cityofnewyork.us/resource/h9gi-nx95.geojson') %>% 
  dplyr::select(geometry,crash_date, crash_time)%>%
  dplyr::mutate(ctime = paste(crash_date , crash_time)) %>%
  mutate(interval60 = ymd_hms(ctime)) %>%
  filter(year(interval60) != 2022)%>%
  mutate(week = week(interval60),
         dotw = wday(interval60, label=TRUE),
         Crash_Counter = 1)%>%
  filter(!st_is_empty(.)) %>%
  st_transform(26918) %>%
  st_intersection(.,Tracts)

## Reading layer `h9gi-nx95' from data source 
##   `https://data.cityofnewyork.us/resource/h9gi-nx95.geojson' 
##   using driver `GeoJSON'
## replacing null geometries with empty geometries
## Simple feature collection with 1000 features and 32 fields (with 76 geometries empty)
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -74.20098 ymin: 0 xmax: 0 ymax: 40.90517
## Geodetic CRS:  WGS 84

subway_entrance <-
  st_read("https://data.cityofnewyork.us/resource/he7q-3hwy.geojson") %>% 
  st_transform(26918) %>%
  st_intersection(.,Tracts)

## Reading layer `he7q-3hwy' from data source 
##   `https://data.cityofnewyork.us/resource/he7q-3hwy.geojson' 
##   using driver `GeoJSON'
## Simple feature collection with 1000 features and 4 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -74.25114 ymin: 40.51339 xmax: -73.78318 ymax: 40.89014
## Geodetic CRS:  WGS 84

grid.arrange(ncol=3,
ggplot() + 
    geom_sf(data = Tracts, fill = "#f5f5f5", color = "#f5f5f5")+
    stat_density2d(data = data.frame(st_coordinates(Traffic_crash)), 
                   aes(X, Y, fill = after_stat(level), alpha = ..level..),
                        size = 0.1, bins = 40,color ='#fc8d59' ) +
               labs(title = "Density of Observed Crashes",
                    subtitle = 'New York City,  2021\n',
                    caption = 'Figure 2.4.1') +
               mapTheme() + 
               theme(legend.position = "none") +
               mapTheme(),
ggplot() + 
    geom_sf(data = Tracts, fill = "#f5f5f5", color = "#f5f5f5")+
    stat_density2d(data = data.frame(st_coordinates(BikeUsages)), 
                   aes(X, Y, fill = after_stat(level), alpha = ..level..),
                        size = 0.1, bins = 40, color = '#fdcc8a') +
               labs(title = "Density of Shared Bike Points",
                    subtitle = 'New York City, 2021\n',
                    caption = 'Figure 2.4.2') +
               mapTheme() + 
               theme(legend.position = "none") +
               mapTheme(),
ggplot() + 
               geom_sf(data = Tracts, fill = "#f5f5f5", color = "#f5f5f5") +
               geom_sf(data = subway_entrance, size = 1, show.legend = "point",color ='#e34a33') +  
               labs(title= "Subway Station Points",
                    subtitle = 'New York City, 2021\n',
                    caption = 'Figure 2.4.3') +
               mapTheme() +
               theme(legend.position = "none") +
               mapTheme())

2.5 Create Full Space-Time Panel

Here, we create the full panel by summarizing counts by docks for each time interval, keep census info and lat/lon information along for joining later to other data. We remove data for time intervals that are missing values.

study.panel <- 
  expand.grid(day_hour=unique(dat_census$day_hour), 
              end_station_id = unique(dat_census$end_station_id)) %>%
  left_join(., dat_census %>%
              dplyr::select(end_station_id, end_station_name, Destination.Tract, end_lng, end_lat)%>%
              distinct() %>%
              group_by(end_station_id) %>%
              slice(1))


ride.panel <- 
  dat_census %>%
  dplyr::select(-interval60) %>%
  right_join(study.panel) %>% 
  group_by(day_hour, end_station_id, end_station_name, Destination.Tract, end_lng, end_lat) %>%
  summarize(Trip_Count = sum(ridership, na.rm=T)) %>%
  rename(interval60 = day_hour) %>%
  left_join(weather.Panel) %>%
  left_join(Traffic_crash  %>%
  dplyr::select(interval60, Crash_Counter) %>%
              as.data.frame() %>%
              dplyr::select(-geometry)) %>%
  ungroup() %>%
  mutate(week = week(interval60),
         dotw = wday(interval60, label = TRUE),
         Crash = case_when(is.na(Crash_Counter) ~ 0)) %>%
  select(-Crash_Counter) %>%
  filter(is.na(Destination.Tract) == FALSE, is.na(Crash) == FALSE)

ride.panel <- 
  left_join(ride.panel, NYCcensus %>%
              as.data.frame() %>%
              dplyr::select(-geometry), by = c("Destination.Tract" = "GEOID"))

2.6 Create Time Lags

Creating time lag variables will add additional nuance about the demand during a given time period. This analysis create lagHour, lag2Hours, lag3Hours, lag4Hours, lag12Hours, lag1day and to show trip counts hours before and during that day.

ride.panel <- 
  ride.panel %>% 
  arrange(end_station_id, interval60) %>% 
  mutate(lagHour = dplyr::lag(Trip_Count,1),
         lag2Hours = dplyr::lag(Trip_Count,2),
         lag3Hours = dplyr::lag(Trip_Count,3),
         lag4Hours = dplyr::lag(Trip_Count,4),
         lag12Hours = dplyr::lag(Trip_Count,12),
         lag1day = dplyr::lag(Trip_Count,24),
         holiday = ifelse(yday(interval60) == 94,1,0)) %>%
   mutate(day = yday(interval60))

2.7 Split Training Set and Test Set

ride.Train <- filter(ride.panel, week>= 16)
ride.Test <- filter(ride.panel, week< 16)

III. Exploratory Analysis

3.1 Time Serial Autocorrelation

3.1.1 Bikeshare count by week

We begin by examining the time and frequency components of our data.

First, we look at the overall time pattern - there is clearly a daily periodicity and there are lull periods on weekends. Then, notice that for the test and training set. there is a little bit increase for the trip count in the testing part probably due to the temperature increase, which leads to the necessity to treat temperature as a predictor in our model building.

3.1.2 Bike share trips on weekend vs weekday

Figure 3.1.2.1 shows the plots of bikeshare trip counts by days of the week. We can see that Monday to Friday generally follows the same trend with a small peak at around 8: 00am-9:00am, a distinct peak at around 17:00pm-19:00pm, and then decrease until midnight. While for Saturday and Sunday, bike share trip mostly occur from 10:00am to 17:00pm. Figure 3.1.2.2 indicates the patterns of bike share between weekdays and weekend more clearly. And we can presume that the number of people using shared bikes for leisure trips increase largely on Friday and Saturday. This is because a growth of trip counts on Fridays among weekdays , and an increase of trip counts on Saturdays among weekends. And their use of these two days’ peak is not as obvious.

grid.arrange(ggplot(dat_census %>%
        mutate(dofw = case_when( dotw == 0 ~ "Monday",
                                  dotw == 1 ~ "Tuesday",
                                  dotw == 2 ~ "Wednesday",
                                  dotw == 3 ~ "Thursday",
                                  dotw == 4 ~ "Friday",
                                  dotw == 5 ~ "Saturday",
                                  dotw == 6 ~ "Sunday"))%>%
         group_by(hour,dofw)%>%
         tally(wt = ridership))+
  geom_line(aes(x = hour, y = n, color = dofw),linewidth = 0.7)+
  scale_color_manual(values = palette7,
                          name = 'Weekday') +
  labs(title="Bike share trips in New York City",
       subtitle =  'by day of the week, Apirl, 2021',
       x="Hour", 
       y="Trip Counts")+
     plotTheme(),


ggplot(dat_census%>% 
         mutate(weekend = ifelse(dotw %in% c(6, 5), "Weekend", "Weekday"))%>%
         group_by(hour,weekend)%>%
         tally(wt = ridership))+
  geom_line(aes(x = hour, y = n, color = weekend),linewidth = 0.7)+
  scale_color_manual(values = palette2,
                          name = 'Weekday') +
  labs(title="Bike share trips in New York City",
       subtitle = 'weekend vs weekday, Apirl, 2021',
       x="Hour", 
       y="Trip Counts",
       caption = "Figure 3.1.2 ")+
     plotTheme(), ncol=2)

3.1.3 Hourly Trips per Station

Figure 3.1.3 shows the average hourly bikeshare trips per station in different time periods. Consistent with the findings above, more trips occurs in the Rush hour. Specifically speaking, bike sharings morning rush time (7:00am-10:00am), night rush hour (15:00am-18:00am) and Mid-day (10:00am-15:00pm) are more active than overnight (18:00am-24:00am), when zero or one trip occurred at stations.

dat_census %>%
        mutate(time_of_day = case_when(hour < 7 | hour > 18 ~ "Overnight",
                                 hour >= 7 & hour < 10 ~ "AM Rush",
                                 hour >= 10 & hour < 15 ~ "Mid-Day",
                                 hour >= 15 & hour <= 18 ~ "PM Rush"))%>%
         group_by(interval60, end_station_name, time_of_day) %>%
         tally(wt = ridership)%>%
  group_by(end_station_name, time_of_day)%>%
  summarize(mean_trips = mean(n))%>%
  ggplot()+
  geom_histogram(aes(mean_trips), binwidth = 1, fill ='#fc9272' )+
  labs(title="Mean Number of Hourly Trips Per Station",
  subtitle="New York City, Apirl, 2021",
       x="Number of trips", 
       y="Frequency",
       caption = "Figure 3.1.3 ")+
  facet_wrap(~time_of_day)+
  plotTheme()

3.1.4 Time Lag Exploration

Overall, there was some correlation between time lags and trips. The strongest correlation is for trip counts with a 1-hour lag, R=0.93, followed by trip counts with a 2-hour lag and trip counts with a 3-hour lag, R=0.78, and R=0.60. For trip counts with more than 4 hours and trip counts with a 12-hour lag, the correlation coefficients weaken and the correlation is not significant in a practical sense. Therefore, lags of 4 hours, and 12 hours were excluded from the modeling.

plotData.lag <-
 as.data.frame(ride.panel) %>%
    group_by(interval60) %>% 
    summarise_at(vars(starts_with("lag"), "Trip_Count"), mean, na.rm = TRUE) %>%
    gather(Variable, Value, -interval60, -Trip_Count) %>%
    mutate(Variable = factor(Variable, levels=c("lagHour","lag2Hours","lag3Hours","lag4Hours","lag12Hours",
                                                "lag1day")))


correlation.lag <- as.data.frame(ride.panel) %>%
    group_by(interval60) %>% 
    summarise_at(vars(starts_with("lag"), "Trip_Count"), mean, na.rm = TRUE) %>%
    gather(Variable, Value, -interval60, -Trip_Count) %>%
    mutate(Variable = factor(Variable, levels=c("lagHour","lag2Hours","lag3Hours","lag4Hours","lag12Hours",
                                                "lag1day")))%>%
    group_by(Variable) %>%  
    summarize(correlation = round(cor(Value, Trip_Count),2))



ggplot(plotData.lag, aes(Value, Trip_Count)) +
  geom_text(data = correlation.lag, aes(label = paste("R =", round(correlation, 2))),
            x=-Inf, y=Inf, vjust = 1.5, hjust = -.1) +
  geom_smooth(method = "lm", se = FALSE, colour = "#4575b4") +
  facet_wrap(~Variable) +
  labs(title = "Bikeshare trip counts as a function of time lags",
       subtitle = 'Week 15 in 2021, New York City\n',
       caption = 'Figure 3.1.4') +
  plotTheme()

3.2 Spatial Autocorrelation

The image shows that most of the stations in New York City have trip counts between 0-20 per hour, but there are still a good number of stations with trip counts above 100 and even some stations with trip counts over 300 per hour.

ggplot(dat_census  %>%
         group_by(hour, end_station_name) %>%
         tally(wt = ridership))+
  geom_histogram(aes(n), binwidth = 5, fill ='#fc9272')+
  labs(title="Bike share trips per hr by station",
        subtitle = "New York City, Apirl, 2021",
       x="Trip Counts", 
       y="Number of Stations",
       caption = "Figure 3.2.1 ")+
  plotTheme()

As shown above, weekday’s hourly trip counts are significantly more than weekend’s hourly trip counts, and weekend’s am rush trip counts on weekday are significantly higher than hourly trip counts on weekend, and am rush trip counts on weekend are significantly lower, and by comparing with the geographic location of subway stations, we can predict that share bike use and subway transit may be correlated.

grid.arrange(ggplot()+
  geom_sf(data = Tracts %>%
          st_transform(crs=4326), fill = "#f5f5f5", color = "#f5f5f5")+
  geom_point(data = dat_census %>% 
            mutate(
                weekend = ifelse(dotw %in% c(6, 5), "Weekend", "Weekday"),
                time_of_day =case_when(hour < 7 | hour> 18 ~ "Overnight",
                                 hour >= 7 & hour < 10 ~ "AM Rush",
                                 hour >= 10 & hour < 15 ~ "Mid-Day",
                                 hour >= 15 & hour <= 18 ~ "PM Rush"))%>%
              group_by(end_station_id, end_lat, end_lng, weekend, time_of_day) %>%
              tally(wt = ridership),
            aes(x=end_lng, y = end_lat, color = n), 
            fill = "transparent", alpha = 0.7, size = 2)+
  scale_colour_viridis(direction = -1,
  discrete = FALSE, option = "F")+
  ylim(min(dat_census$end_lat), max(dat_census$end_lat))+
  xlim(min(dat_census$end_lng), max(dat_census$end_lng))+
  facet_grid(weekend ~ time_of_day)+
  labs(title="Bike share trips per hr by station",
       caption = "Figure 3.2.2 ")+
  mapTheme() +
  theme(legend.position = 'bottom'),
  ggplot() + 
               geom_sf(data = Tracts, fill = "#f5f5f5", color = "#f5f5f5") +
               geom_sf(data = subway_entrance, size = 2, show.legend = "point",color ='#e34a33') +  
               labs(title= "Subway Station Points",
                    subtitle = 'New York City, 2021\n',
                    caption = 'Figure 3.2.3') +
               mapTheme() +
               theme(legend.position = "none") +
               mapTheme(),
ncol = 2)

3.3 Space/Time Correlation: Animation on Week15

To show the combined spatial and temporal patterns, I created an animated map of the number of bike-sharing trips per hour at each site for week 15. We can see that there are some sites that have less than one shared bike trip for most of the time. However, there are some sites where the number of trip count has been consistently high.

3.4 Weather Correlation

As indicated above, this analysis includes three weather-related variables: precipitation, temperature and wind speed. This section presents how the three weather features effect bikeshare trip respectively.

3.4.1 Precipitation

During the study period, precipitation in New York by hour is mainly conducted in storm bursts. Figure 3.4.1.2 shows that although the mean trip count is minor, more precipitation significantly reduces the occurrence of bikeshare trips.

library(dplyr)
library(ggplot2)
rbind(ride.Train%>%mutate(Legend = "Training"),ride.Test%>%mutate(Legend = "Testing"))%>%
  ggplot(aes(interval60,Precipitation, colour = Legend)) + geom_line(linewidth = 1) + 
  scale_colour_manual(values = palette2) +
  geom_vline(data = mondays, aes(xintercept = monday), linetype = "dotted", color = 'black') +
  labs(title="Precipitation Across Time\n", 
       caption = 'Figure 3.4.1.1',
       subtitle="Dotted Lines for Each Monday\n",
       x="Hour", y="Perecipitation") + 
  plotTheme()

library(lubridate)
ride.panel %>%
  group_by(interval60) %>% 
  summarize(Trip_Count = mean(Trip_Count),
            Precipitation = first(Precipitation)) %>%
  mutate(week = week(interval60)) %>%
  ggplot(aes(Precipitation, Trip_Count)) + 
    geom_point(color = '#fecc5c') + geom_smooth(method = "lm", se= FALSE, color = '#4575b4') +
    facet_wrap(~week, ncol=3) + 
    labs(title="Trip Count as A Function of Precipitation by Week\n", 
       caption = 'Figure 3.4.1.2',
         x="Precipitation", y="Mean Trip Count") +
    plotTheme()

3.4.2 Temperature

During the study period, temperature showed distinct fluctuations across hours and days.But in general the temperature is showing an upward trend. The relation between temperature and bike share trips seems to be un-intuitive since sometimes they exhibit positive pattern while sometimes not, which indicating that to set time period features together with this feature is critical to model building.

  rbind(
  mutate(ride.Train, Legend = "Training"), 
  mutate(ride.Test, Legend = "Testing")) %>% 
 
  ggplot(aes(interval60,Temperature, colour = Legend)) + geom_line(linewidth = 1) + 
  geom_vline(data = mondays, aes(xintercept = monday), linetype = "dotted", color = 'black')+
  scale_colour_manual(values = palette2) +
    labs(title="Temperature Across Time\n", 
       caption = 'Figure 3.4.2.1',
       subtitle="Dotted Lines for Each Monday\n",
       x="Hour", y="Temperature") + plotTheme()

ride.panel %>%
  group_by(interval60) %>% 
  summarize(Trip_Count = mean(Trip_Count),
            Temperature = first(Temperature)) %>%
  mutate(week = week(interval60)) %>%
  ggplot(aes(Temperature, Trip_Count)) + 
    geom_point(color = '#fecc5c') + geom_smooth(method = "lm", se= FALSE,color = '#4575b4') +
    facet_wrap(~week, ncol=3) + 
    labs(title="Trip Count as A Function of Temperature by Week\n", 
       caption = 'Figure 3.4.2.2',
         x="Temperature", y="Mean Trip Count") +
    plotTheme()

3.4.3 Wind Speed

During the study period, wind speed fluctuated across hours per day, but showed similar pattern across days. As shown in Figure 3.4.3.2, in most cases, the wind speed seemingly does not affect the number of bikeshare trip a lot so we exclude it from the model.

  rbind(
  mutate(ride.Train, Legend = "Training"), 
  mutate(ride.Test, Legend = "Testing")) %>% 
 
  ggplot(aes(interval60,Wind_Speed, colour = Legend)) + geom_line(linewidth = 0.7) + 
  scale_colour_manual(values = palette2) +
  geom_vline(data = mondays, aes(xintercept = monday), linetype = "dotted", color = 'black')+
    labs(title="Wind Speed Across Time\n", 
       caption = 'Figure 3.4.3.1',
       subtitle="Dotted Lines for Each Monday\n",
       x="Hour", y="Wind Speed") + plotTheme()

ride.panel %>%
  group_by(interval60) %>% 
  summarize(Trip_Count = mean(Trip_Count),
            Wind_Speed = first(Wind_Speed)) %>%
  mutate(week = week(interval60)) %>%
  ggplot(aes(Wind_Speed, Trip_Count)) + 
    geom_point(color = '#fecc5c') + geom_smooth(method = "lm", se= FALSE,color = '#4575b4') +
    facet_wrap(~week, ncol=3) + 
    labs(title="Trip Count as A Function of Wind Speed by Week\n", 
       caption = 'Figure 3.4.3.2',
         x="Wind Speed", y="Mean Trip Count") +
    plotTheme()

IV. Model Building and Prediction

This analysis is training models on 3 weeks of data, weeks 13-15, and testing on the preceding 3 weeks, weeks 16-18.

This section firstly creates an initial model reg0 to help determine to rule out the features that are obviously less unrelated to the bikeshare counts. Upon the model, features Percent_White, Wind_Speed are excluded from model building due to irrelevance.

Then, this sections designs four different linear regressions are estimated on ride.Train, each with different fixed effects:

reg1 focuses on just time, including hour and weekday fixed effects, and other critical features
reg2 focuses on just space effects with the station and and other critical fixed effects
reg3 includes both time and space fixed effects.
reg4 adds the time lag features: laghour,lag2hours , lag3hours and lag1day Prediction output is saved in nested-format dataset named week_predictions.

reg1 <- 
  lm(Trip_Count ~  hour(interval60) + dotw + Temperature,  data=ride.Train)

reg2 <- 
  lm(Trip_Count ~  end_station_name + dotw + Temperature,  data=ride.Train)

reg3 <- 
  lm(Trip_Count ~  end_station_name + hour(interval60) + dotw + Temperature + Precipitation, 
     data=ride.Train)

reg4 <- 
  lm(Trip_Count ~  end_station_name +  hour(interval60) + dotw + Temperature + Precipitation +
                   lagHour + lag2Hours +lag3Hours + lag1day, 
     data=ride.Train)

ride.Test.weekNest <- 
  ride.Test %>%
  nest(-week) 

model_pred <- function(dat, fit){
   pred <- predict(fit, newdata = dat)}

week_predictions <- 
  ride.Test.weekNest %>% 
    mutate(ATime_FE = map(.x = data, fit = reg1, .f = model_pred),
           BSpace_FE = map(.x = data, fit = reg2, .f = model_pred),
           CTime_Space_FE = map(.x = data, fit = reg3, .f = model_pred),
           DTime_Space_FE_timeLags = map(.x = data, fit = reg4, .f = model_pred)) %>% 
    gather(Regression, Prediction, -data, -week) %>%
    mutate(Observed = map(data, pull, Trip_Count),
           Absolute_Error = map2(Observed, Prediction,  ~ abs(.x - .y)),
           MAE = map_dbl(Absolute_Error, mean, na.rm = TRUE),
           sd_AE = map_dbl(Absolute_Error, sd, na.rm = TRUE))

V. Model Evaluation and Validation

5.1 MAE Examination by Model

This section calculates the mean absolute error (MAE) for each model at week 13, week 14, and week 15. In general, model D performs the best among all models.This illustrates the high correlation between bike use and time. But we still need more checks to compare the efficiency of the models.

week_predictions %>%
  dplyr::select(week, Regression, MAE) %>%
  gather(Variable, MAE, -Regression, -week) %>%
  ggplot(aes(week, MAE)) + 
  geom_bar(aes(fill = Regression), position = "dodge", stat="identity") +
  scale_fill_manual(values = palette5) +
  labs(title = "Mean Absolute Errors by model specification and week",
       caption = "Figure 5.1 ") +
  plotTheme()

5.2 Error Examination -Time

This section includes error checks for the four models in time, space and combination of space and time! Figure 5.2 shows plots of the observed and predicted data and examines how well they match at different times. We can generally conclude that all models tend to underestimate bike-sharing trips during peak periods and overestimate bike-sharing trips during low peak periods. Model D captures more fluctuations in counts compared to the other models.

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           end_station_id = map(data, pull, end_station_id)) %>%
    dplyr::select(interval60, end_station_id, Observed, Prediction, Regression) %>%
    unnest() %>%
    gather(Variable, Value, -Regression, -interval60, -end_station_id) %>%
    group_by(Regression, Variable, interval60) %>%
    summarize(Value = sum(Value)) %>%
    ggplot(aes(interval60, Value, colour=Variable)) + 
    geom_line(size = 1) + 
    scale_color_manual(values = palette2) +
    facet_wrap(~Regression, ncol=1) +
    labs(title = "Predicted/Observed bike share time series", subtitle = "New York City; A test set of 3 weeks",  x = "Hour", y= "Station Trips",
         caption = "Figure 5.2 ") +
    plotTheme()

5.3 Error Examination - Spatial

In the following part, we will exam the spatial pattern of errors of model 4. Figure 5.3.1 plots the average error by station. The mean error is close to 0-1 in most areas, with higher MAE in areas with higher trip count.

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           end_station_id = map(data, pull, end_station_id), 
           end_lat = map(data, pull, end_lat), 
           end_lng = map(data, pull, end_lng)) %>%
    select(interval60, end_station_id, end_lng, end_lat, Observed, Prediction, Regression) %>%
    unnest() %>%
  filter(Regression == "DTime_Space_FE_timeLags") %>%
  group_by(end_station_id, end_lng,end_lat) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
ggplot(.)+
  geom_sf(data = Tracts%>%st_transform(crs=4326), fill = "#f5f5f5", color = "#f5f5f5")+
  geom_point(aes(x = end_lng, y = end_lat, color = MAE), 
             fill = "transparent", alpha = 0.7, size = 1.5)+
  scale_colour_viridis(direction = -1,
  discrete = FALSE, option = "F")+
  ylim(min(dat_census$end_lat), max(dat_census$end_lat))+
  xlim(min(dat_census$end_lng), max(dat_census$end_lng))+
  labs(title="Mean Abs Error, Test Set, Model D",
       caption = "Figure 5.3.1 ")+
  mapTheme()

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           end_station_id = map(data, pull, end_station_id), 
           end_lat = map(data, pull, end_lat), 
           end_lng = map(data, pull, end_lng),
           dotw = map(data, pull, dotw)) %>%
    select(interval60, end_station_id, end_lng, end_lat, Observed, Prediction, Regression,
           dotw) %>%
    unnest() %>%
  filter(Regression == "DTime_Space_FE_timeLags")%>%
  mutate(weekend = ifelse(dotw %in% c("周日", "周六"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
  ggplot()+
  geom_point(aes(x= Observed, y = Prediction), alpha = 0.1,color = '#fecc5c')+
    geom_smooth(aes(x= Observed, y= Prediction), method = "lm", se = FALSE, color = "#4575b4")+
    geom_abline(slope = 1, intercept = 0)+
  facet_grid(time_of_day~weekend)+
  labs(title="Observed vs Predicted",
       x="Observed trips", 
       y="Predicted trips",
       caption = "Figure 5.3.2 ")+
  plotTheme()

With Figures 5.3.2 through 5.3.3, the errors for the test set are examined by combining each model in space and time. In general, the MAEs vary relatively independently in time or space, and the MAEs at some of the sites are always higher and lower than others throughout the day, which provides a strong INSIGHT for us to correct the models.

week_predictions %>% 
     mutate(interval60 = map(data, pull, interval60),
           end_station_id = map(data, pull, end_station_id), 
           end_lat = map(data, pull, end_lat), 
           end_lng = map(data, pull, end_lng),
           dotw = map(data, pull, dotw)) %>%
    select(interval60, end_station_id, end_lng, end_lat, Observed, Prediction, Regression,
           dotw) %>%
    unnest() %>%
  filter(Regression == "DTime_Space_FE_timeLags")%>%
  mutate(weekend = ifelse(dotw %in% c("周日", "周六"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
  group_by(end_station_id, weekend, time_of_day, end_lng, end_lat) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  ggplot(.)+
  geom_sf(data = Tracts%>%st_transform(crs=4326), fill = "#f5f5f5", color = "#f5f5f5")+
  geom_point(aes(x = end_lng, y = end_lat, color = MAE), 
             fill = "transparent", size = 2, alpha = 0.7)+
  scale_colour_viridis(direction = -1,
  discrete = FALSE, option = "F")+
  ylim(min(dat_census$end_lat), max(dat_census$end_lat))+
  xlim(min(dat_census$end_lng), max(dat_census$end_lng))+
  facet_grid(weekend~time_of_day)+
  labs(title="Mean Absolute Errors, Test Set",
       caption = "Figure 5.3.3 ")+
  mapTheme()

Through Figure 5.3.4, we can see that error has a high correlation with all socio-economic variables, and the model has a high prediction error where median household income is higher, where public transit is more developed, and where the white population is higher.

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           end_station_id = map(data, pull, end_station_id), 
           end_lat = map(data, pull, end_lat), 
           end_longitude = map(data, pull, end_lng),
           dotw = map(data, pull, dotw),
           Percent_Taking_Public_Trans = map(data, pull, Percent_Taking_Public_Trans),
           Med_Inc = map(data, pull, Med_Inc),
           Percent_White = map(data, pull, Percent_White)) %>%
    select(interval60, end_station_id, end_longitude, 
           end_lat, Observed, Prediction, Regression,
           dotw, Percent_Taking_Public_Trans, Med_Inc, Percent_White) %>%
    unnest() %>%
  filter(Regression == "DTime_Space_FE_timeLags")%>%
  mutate(weekend = ifelse(dotw %in% c("周日", "周六"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
  filter(time_of_day == "AM Rush") %>%
  group_by(end_station_id, Percent_Taking_Public_Trans, Med_Inc, Percent_White) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  gather(-end_station_id, -MAE, key = "variable", value = "value")%>%
  ggplot(.)+
  #geom_sf(data = chicagoCensus, color = "grey", fill = "transparent")+
  geom_point(aes(x = value, y = MAE), alpha = 0.4,color = '#fecc5c')+
  geom_smooth(aes(x = value, y = MAE), method = "lm", se= FALSE, color = '#4575b4')+
  facet_wrap(~variable, scales = "free")+
  labs(title="Errors as a function of socio-economic variables",
       y="Mean Absolute Error (Trips)",
       caption = "Figure 5.3.4 ")+
  plotTheme()

5.4 Cross Validation

This section conducts LOGO cross-validation to test goodness of fit on the whole dataset. To save the operation cost, we create folds on week column to test and we also only run cross validation on the reg4.(That is because the dataset is too big and it ate up my computer’s memorry..) If the model generalized well, the distribution of errors would cluster tightly together.

Based on the outputs, shown in Figure 5.4.1, the model do not predict consistently, indicating model do not generalize well.

## This hold out fold is 13 
## This hold out fold is 14 
## This hold out fold is 15 
## This hold out fold is 16 
## This hold out fold is 17 
## This hold out fold is 18

Compared the accuracy between different folders, we can see there is a variance in the MAE. This might due to the different sample size between folders because we did LOGO-cv on the week column. However, we can see the range of values is between 1.6-2.1, which is generally lower than the modelC, which proves the improvement for the accuracy of timelag model.

cvID	MAE	geometry
13	1.629284	MULTIPOLYGON (((583788.1 45…
14	1.940696	MULTIPOLYGON (((583788.1 45…
15	1.950095	MULTIPOLYGON (((583788.1 45…
16	2.053664	MULTIPOLYGON (((583788.1 45…
17	2.136516	MULTIPOLYGON (((583788.1 45…
18	2.094047	MULTIPOLYGON (((583788.1 45…

Table 5.4 Cross-validation Test on Model 4

VI.Conclusion

Through this analysis, we can conclude that by examining the sequence pattern of shared bikes, the occurrence peak on weekends is different from the occurrence peak on weekdays; by examining the spatial pattern of shared bikes, the number of trips is more in the part of shared bike points near metro stations. To take into account the complexity of the characteristics affecting shared bikes, models containing sequence characteristics, spatial characteristics, weather characteristics, and user demographic characteristics were created for this analysis. In all four models created above, the errors are limited, indicating their accuracy to some extent. And, by comparing model 4 with the other models, we can see that the time lag feature improves the model to a great extent.

Despite the imperfections, Model 1 or Model 4 can be used to solve the rebalancing problem because they both capture the effects of time changes on travel. Models 2 and 3, although more data are added to the training, the predictions are not sensitive to the effect of time variation on trips due to the instability of climate-related factors on trip count. From the results, more scheduling resources should be allocated to stations after 10 a.m. and balanced and rebalanced between weekdays and weekends. Spatially, more resources should be dispatched to high-trip count stations to ensure the availability of bicycles.

One of the limitations of this analysis is that although we can roughly speculate on the predictors associated with trip count but still need more systematic research and improvement in the analysis and improvement of the model results. In the future, more features such as traffic characteristics, scheduling resources and bicycle conditions need to be considered to help initiate better rebalancing plans.

Bike Share Prediction

Xinge Zhang

2022-11-02

I. Introduction

II. Data Loading and Feature Engineering

2.1 Import Census Data

2.3 Import Weather Data

2.5 Create Full Space-Time Panel

2.6 Create Time Lags

2.7 Split Training Set and Test Set

III. Exploratory Analysis

3.1 Time Serial Autocorrelation

3.1.1 Bikeshare count by week

3.1.3 Hourly Trips per Station

3.1.4 Time Lag Exploration

3.2 Spatial Autocorrelation

3.3 Space/Time Correlation: Animation on Week15

3.4 Weather Correlation

3.4.1 Precipitation

3.4.2 Temperature

3.4.3 Wind Speed

IV. Model Building and Prediction

V. Model Evaluation and Validation

5.1 MAE Examination by Model

5.2 Error Examination -Time

5.3 Error Examination - Spatial

5.4 Cross Validation

VI.Conclusion

Bike Share Prediction

Xinge Zhang

2022-11-02

I. Introduction

II. Data Loading and Feature Engineering

2.1 Import Census Data

2.2 Import Bike Share Data

2.3 Import Weather Data

2.4 Create Traffic Related Feature

2.5 Create Full Space-Time Panel

2.6 Create Time Lags

2.7 Split Training Set and Test Set

III. Exploratory Analysis

3.1 Time Serial Autocorrelation

3.1.1 Bikeshare count by week

3.1.2 Bike share trips on weekend vs weekday

3.1.3 Hourly Trips per Station

3.1.4 Time Lag Exploration

3.2 Spatial Autocorrelation

3.3 Space/Time Correlation: Animation on Week15

3.4 Weather Correlation

3.4.1 Precipitation

3.4.2 Temperature

3.4.3 Wind Speed

IV. Model Building and Prediction

V. Model Evaluation and Validation

5.1 MAE Examination by Model

5.2 Error Examination -Time

5.3 Error Examination - Spatial

5.4 Cross Validation

VI.Conclusion