• 1 Introduction
    • 1.1 Background
    • 1.2 Use Case
    • 1.3 Definition
  • 2 Exploratary Data Analysis
    • 2.1 Time Distance
      • 2.1.1 Takeaways
    • 2.2 Headway and Bunching
      • 2.2.1 Do buses show similar patterns of bunching regardless of the day?
      • 2.2.2 Maps of mean headway standard deviation by stops
      • 2.2.3 Conclusion
  • 3 Model: Feature Engineering and Exploration
    • 3.1 Model Outcome
    • 3.2 External Stop-Level Variables
      • 3.2.1 Traffic Signals
      • 3.2.2 Ridership
      • 3.2.3 Population, Population Density, and Commuters
      • 3.2.4 Commercial Licenses
      • 3.2.5 Land Use
      • 3.2.6 Trash Day
      • 3.2.7 Geography
      • 3.2.8 Spatial Lag of Stop-Level Variables
    • 3.3 Temporal Variables
    • 3.4 Temporal Timepoint-Level Lags
    • 3.5 Panel Creation: Merging Temporal and Spatial Data
      • 3.5.1 Panel-level variables: Weather
    • 3.6 Other feature associations with the likelihood of bunching
      • 3.6.1 Numeric panel and numeric stop-level lag variables
      • 3.6.2 Categorical variables
    • 3.7 Variable and Model Selection
    • 3.8 Variable Importance
    • 3.9 Result
    • 3.10 Adjusting Threshold
  • 4 Generalizability
  • 5 Conclusion

Return to MUSA 801 Projects page

Visit our project app

This project was done as part of the University of Pennsylvania’s Master of Urban Spatial Analytics (MUSA) Spring 2023 Practicum. We’d like to thank several people who supported and advised us throughout the duration of this project:

  • Andrew Simpson at the Office of Transportation, Infrastructure and Sustainability, for helping us understand the severity of bus bunching in Philadelphia and its impacts on residents, answering all our questions, and providing us with the necessary data sets to build our predictive models

  • Steve Labedz at SEPTA, for all his expertise and advice about the operational aspects of bus bunching

  • Our instructors, Matt Harris and Michael Fichman, for organizing this project, steering us in the right direction, and for their valuable feedback throughout the semester

Our githubs:

Jie’s github
Mia’s github
Yihong’s github

1 Introduction

1.1 Background

As the negative consequences of automobile dependence and climate change are becoming increasingly apparent, it is crucial that city governments prioritize the provision of safe, efficient, equitable, and reliable transit to their residents. In Philadelphia specifically, over half of all transit trips are taken on buses. Additionally, among all income levels, low income residents rely on buses the most. In an ideal scenario, residents are confident that a given bus route will arrive at their bus stop at regular intervals, there is sufficient space on the bus for them to sit or stand comfortably, and they are able to plan their trip without much stress.

Ideal situation (source: https://setosa.io/bus/):


Unfortunately, the reality is quite different. SEPTA buses do not often follow their schedules. They arrive either earlier or later than expected because they are unequally impacted by congestion, disruptions such as illegal parking, and long dwell times due to heavy ridership and long boarding times. This leads to the issue of bus bunching, which is when two consecutive buses arrive at bus stops very soon after one another. This is almost always accompanied by bus gapping, which is when there are long gaps between the arrival of consecutive buses at any given bus stop. As a result, people are delayed getting to their destinations and may also be forced to ride very crowded buses.

Bunched situation (source: https://setosa.io/bus/):


Therefore, it is extremely imperative that bus bunching be tackled by transit agencies and city governments working together. As stated by Jarrett Walker - ‘frequency is freedom’ i.e. people are more likely to choose buses as their preferred mode of transit only if they are guaranteed regular and frequent service.

1.2 Use Case

This project aims to provide reliable and frequent transit service to residents during their commute by predicting and responding to bus-bunching along key transit corridors.

The outcome of our predictions from this analysis is the initiation of bunching i.e. where does bunching begin. Once two buses start to bunch, they will likely continue to bunch at the subsequent stops. In order to prevent bunching, it is most effective to understand where bunching starts to happen because interventions can then be targeted at this point.

1.3 Definition

This project defines some key concepts as follows:

  • Headway: The interval of time between two successive buses moving in the same direction on the same route.

  • Bunching: If the observed headway is half or less than the expected headway.

  • Center City: Area in Philadelphia that is bounded by South St in the South, Spring Garden in the North, 24th St in the West, and Delware Ave/Penn’s Landing in the East.

2 Exploratary Data Analysis

We explored the patterns of 13 bus routes - Route 21 (East-Westbound crossing Center City) and Route 47 (North-Southbound crossing Center City) are shown below as examples. We found similar patterns on other routes as well. Due to the uniqueness each route, we did not aggregate the them together.

#setwd("C:/Users/huyh/Documents/Penn/Spring 2023/MUSA Practicum/otis-corridor/explore-yihong")
timepoints <- read_parquet("../data/runtime/runtimeSeveralRoutesOct.gzip")

# Set to the defualt timezone
timepoints <- force_tz(timepoints, tzone = "UTC")

# include only these routes
route_include <- c(unique(timepoints$routeId))

# Interested date
date <- "2022-10-03"
date2 <- "2022-10-04"

# group data in north and south bound

N.Sbound <- c('17','33','47','48','52','46','6')
E.Wbound <- c('21','56','60','79','G','18')

# group by trip ID and block ID, making southbound distance negative (for plotting later), calculate delayed time between scheduled and actual arrival time in minutes

timepoints <- timepoints |> 
              filter(routeId == "47" | routeId == "21") |>
              group_by(blockId, tripId, routeId) |> 
              mutate(cumDistance_relative = if_else(directionId == 0, cumDistance * -1, cumDistance))

# Add dummy variable that passes center city
## Mutate a dummy variable. If within Center City, then center_city == 1, otherwise 0

timepoints <- timepoints |>
  mutate(center_city = case_when(routeId == "47" & directionId == "0" & toStopSequence %in% 63:76                                  ~ "Yes",  #Sequence within center City Route 47 Southbound
                                 routeId == "47" & directionId == "1" & toStopSequence %in% 26:39                                      ~ "Yes",  #Sequence within Center City Route 47 Northbound
                                 routeId == "21" & directionId == "0" & toStopSequence %in% 48:71                                      ~ "Yes",  #Sequence within Center City Route 21 Eastbound
                                 routeId == "21" & directionId == "1" & toStopSequence %in% 1:29                                       ~ "Yes",  #Sequence within Center City Route 21 Westbound
                                 TRUE ~ "No"))

timepoints <- timepoints |> 
  mutate(direction_text = case_when(routeId == "47" & directionId == "0" ~ "Southbound",
                                    routeId == "47" & directionId == "1" ~ "Northbound",
                                    routeId == "21" & directionId == "0" ~ "Eastbound",
                                    routeId == "21" & directionId == "1" ~ "Westbound"))

timepoints$direction_text <- factor(timepoints$direction_text,levels = c("Northbound","Southbound","Westbound","Eastbound"))

# Create intervals for AMpeak, midday, PMpeak, Evening, and overnight

int.overnight <- interval(ymd_hms(paste(date,"00:00:00")), ymd_hms(paste(date,"06:59:59")))
int.AMpeak <- interval(ymd_hms(paste(date,"07:00:00")), ymd_hms(paste(date,"09:59:59")))
int.midday <- interval(ymd_hms(paste(date,"10:00:00")), ymd_hms(paste(date,"15:59:59")))
int.PMpeak <- interval(ymd_hms(paste(date,"16:00:00")), ymd_hms(paste(date,"18:59:59")))
int.evening <- interval(ymd_hms(paste(date,"19:00:00")), ymd_hms(paste(date2,"00:59:59")))

# Create a list of hourly interval 
Z <- format(as.POSIXct(timepoints$scheduledTripStartTime), "%Y-%m-%d %H")
# Add hourly bin to the main dataset
timepoints$hourly_interval <- Z 

2.1 Time Distance

The following stringlines or time-distance charts display time on the x-axis and cumulative distance from the terminal on y-axis. Each line represents an individual bus moving through the city. The areas where the lines are more flat indicate where buses slow down. Similarly, the closer two lines (buses) are together, greater the likelihood that they are experiencing bunching.

The graphs below show examples of Route 47 (North/Southbound Route) and Route 21 (East/West Route) that travel through Center City during peak hours on Oct 3, 2022. Each node on the strings represent a stop on the route. Stops within Center City are colored red. The spatial pattern of bunching can be clearly seen here.

# Make function for creating several ggplotlys on different intervals

make_stringlines <- function(routeID, interval, interval2, date) {
g <- timepoints |> filter(routeId == routeID) |> 
     filter(scheduledTripStartTime %within% interval) |>
     ggplot(aes(y = cumDistance_relative, x = observedToStopArrivalTime, group = tripId, 
             text = paste("Observed Arrial Time: ", observedToStopArrivalTime,
                          "<br>Stop Name:", toStopName,
                          "<br>Cumlative Distance:", cumDistance))) +
     geom_line(size = 0.5, color = blue3) + geom_point(shape=21, aes(fill = center_city, 
                                                    color = center_city), size = 0.75,
                                                    alpha = 0.8) +
     scale_fill_manual(values = c(red2, blue1), breaks = c("Yes", "No")) +
     scale_color_manual(values = c(red1, blue0), breaks = c("Yes", "No")) +
     #geom_hline(yintercept = c(9000,12758.225)) +
     #scale_y_continuous(breaks = timepoints$cumDistance, labels = timepoints$stop_label)
     scale_x_datetime(date_breaks = "30 min", labels = date_format("%H:%M")) +
     facet_grid(direction_text ~ hourly_interval, scales = "free") +
     xlab("Time") +
     ylab("Distance") +
     labs(fill = "Center City", color = "") +
     theme_light() +
     theme(panel.spacing.y = unit(1, "lines")
     )
     #theme_ipsum(axis_text_size = 9, axis_title_size = 11, axis_title_face = "bold", axis_title_just      = "ct",
     #strip_text_size = 11, ticks = TRUE)

     ggplotly(g, tooltip = "text", height = 600, width = 750) |>
     layout(title = list(text = paste0(str_glue('<b> Route {routeID} Stringline by Trip Start Hour, {interval2} </b>',
                                                '<br><span style="font-size: 15px;">',
                                                'Observed {date}',
                                                '</span>'))),
           margin = list(t = 120),
           legend = list(x = 1.05)
           )
}
make_stringlines("47", int.AMpeak,"AM peak", "Oct.3, 2022")
05000100001500007:0007:3008:0008:3009:00-10000-5000008:3009:0009:3010:0009:3010:0010:3011:00
Center CityNoYes Route 47 Stringline by Trip Start Hour, AM peak Observed Oct.3, 2022TimeDistance2022-10-03 072022-10-03 082022-10-03 09NorthboundSouthbound

AM Northbound: Trips on Route 47 Northbound during AM peak hours showed variation in speed as they exited center city. At certain time intervals (between 8:00 - 8:30 am; 9:00 - 9:30 am; 10:00 - 10:30 am) bunching built up as one bus slowed down and another bus either kept the same speed or sped up.

AM Southbound: Trips on Route 47 Southbound also showed speed variation in Center City. In addition, buses seemed to slow down for a while after passing particular stops. Please hover your cursor on the plot where the strings show a constant “glitch” horizontally. These bottlenecks were at 5th St & Wyoming Ave, 5th St & Bristol St. 

make_stringlines("47", int.PMpeak,"PM peak", "Oct.3, 2022")
05000100001500016:0016:3017:0017:30-10000-5000017:3018:0018:3019:0019:30
Center CityNoYes Route 47 Stringline by Trip Start Hour, PM peak Observed Oct.3, 2022TimeDistance2022-10-03 162022-10-03 172022-10-03 18NorthboundSouthbound

PM Northbound: Trips on Route 47 Northbound during PM peak hours showed the same pattern as AM Southbound in terms of speed variation in Center City and the location of bottlenecks.

PM Sorthbound: Trips on Route 47 Southbound during PM peak hours showed the same pattern as AM Southbound in terms of speed variation in Center City and the location of bottlenecks.

make_stringlines("21", int.AMpeak,"AM peak", "Oct.3, 2022")
030006000900007:0007:3008:0008:30-10000-7500-5000-2500008:0008:3009:0009:3009:0009:3010:0010:30
Center CityNoYes Route 21 Stringline by Trip Start Hour, AM peak Observed Oct.3, 2022TimeDistance2022-10-03 072022-10-03 082022-10-03 09WestboundEastbound
make_stringlines("21", int.PMpeak,"PM peak", "Oct.3, 2022")
030006000900016:0016:3017:0017:30-10000-7500-5000-2500017:3018:0018:3018:3019:0019:30
Center CityNoYes Route 21 Stringline by Trip Start Hour, PM peak Observed Oct.3, 2022TimeDistance2022-10-03 162022-10-03 172022-10-03 18WestboundEastbound

The effect of Center City is more obvious along Route 21 during AM peak hours. As buses entered or exited Center City, especially in the busiest region (between 11th and 20th St), bunching is severe.

2.1.1 Takeaways

Bunching was induced by a change in speed, and this change in speed occurred the most when entering or exiting Center City. Moreover, a few bottlenecks were identified along the route where buses always slowed down after passing those points.

This finding informs us that spatial factors contribute to the initiation of bunching.

2.2 Headway and Bunching

The standard deviation of headway is a common metric used to understand bunching. If buses pperate according to schedule, they should have consistent headways with little variation or, in other words, low standard deviation. However, when there is bunching or gapping taking place at a bus stop, the headways become inconsistent. This variation results in a higher standard deviation.

The standard deviation of headway for routes 21 and 47 are visualised below with the help of heatmaps. The x-axis displays the stop sequence of the specific route and the y-axis displays the observed hour of the day from 0 - 23. The color of the each square indicates the standard deviation of headway at that stop. The brighter the color, the higher the variance at that stop.

The headway in this case is equal to the difference between two consecutive buses’ observed arrival time at a stop. The standard deviation of headway is aggregated by observed hour, taking the average headway of 10 days at each hour.

To compare the relationship of headway variance with bunching, we also calculated the amount of bunching (as a percentage) taking place at each stop for routes 21 and 47. This is visualized below with heatmaps as well. As mentioned before, bunching is defined as less than half of the scheduled headway per stop.

We cleaned the data so it only contains data points where the headway is less than 60 minutes. A headway of greater than 60 minutes is likely to be the result of bus detouring or a data collection error.

The heatmaps below show headway variance and percent bunched along Route 47 and Route 21.

# #loading speed data and processed hourly bunching data
# block_trip <- timepoints |> ungroup() |> distinct(routeId, blockId, tripId)
# stop_dist <- timepoints  |> ungroup() |> filter(directionId == 0) |> distinct(routeId, toStopId, cumDistance)
stop_seq <- timepoints  |> ungroup() |> distinct(routeId, direction_text, toStopId, toStopName,toStopSequence)

#create a list of hourly interval
timepoints$scheduledHour <- hour(timepoints$scheduledToStopArrivalTime)
timepoints$observedHour <- hour(timepoints$observedToStopArrivalTime)

headway_route_date <- timepoints |>
  arrange(routeId, serviceDate, direction_text, toStopId, toStopName,observedToStopArrivalTime) |> 
  group_by(routeId, serviceDate, direction_text, toStopId, toStopName) |> 
  mutate(temp = as.numeric(as.POSIXct(observedToStopArrivalTime)),
         headwayObserved = (temp - lag(temp))/60) |> 
  dplyr::select(routeId, tripId, serviceDate, direction_text, toStopId, toStopName, observedHour,
         scheduledToStopArrivalTime, observedToStopArrivalTime,
         headwayObserved) |> 
  filter(!is.na(headwayObserved)) |>
  arrange(routeId, serviceDate, direction_text, toStopId, toStopName,scheduledToStopArrivalTime) |> # Scheduled Headway
  group_by(routeId, serviceDate, direction_text, toStopId, toStopName) |> 
  mutate(temp_exp = as.numeric(as.POSIXct(scheduledToStopArrivalTime)),
         headwayScheduled = (temp_exp - lag(temp_exp))/60) |> 
  dplyr::select(routeId, tripId, serviceDate, direction_text, toStopId, toStopName, observedHour,
         scheduledToStopArrivalTime, observedToStopArrivalTime,
         headwayScheduled, headwayObserved) |> 
  filter(!is.na(headwayScheduled)) |>
  filter(headwayScheduled < 60,
         headwayObserved < 60) |> # Filter out scheduled headway more than 60 minutes
  mutate(bunched = ifelse(headwayObserved < headwayScheduled/2, 1, 0)) # Bunching defined by less than scheduled headway

#headway SD transform
headwayVarBun_route_date <- headway_route_date |> 
  # filter(serviceDate == "2022-10-03 20:00:00") |> 
  group_by(routeId, observedHour, direction_text, toStopId) |> 
  summarise(varHeadwayObserved = round(sd(headwayObserved), digits = 3),
            bunching_pct = round(sum(bunched)/n()*100,digits = 2)) |>
  filter(!is.na(varHeadwayObserved))

headwayVar_route_date <- headway_route_date |> 
  # filter(serviceDate == "2022-10-03 20:00:00") |> 
  group_by(routeId, observedHour, direction_text, toStopId, toStopName) |> 
  summarise(varHeadwayObserved = round(sd(headwayObserved), digits = 3),
            bunching_pct = round(sum(bunched)/n()*100,digits = 2)) |>
  filter(!is.na(varHeadwayObserved))
# create headway var ggploly function 
DrawHeadVarPlotly <- function(routeID){
g <-  
  headwayVarBun_route_date |>
  left_join(stop_seq, by = c("routeId", "direction_text", "toStopId")) |> 
  filter(routeId == routeID) |> 
  ggplot(aes(toStopSequence, observedHour, fill = varHeadwayObserved, 
             text = paste("Observed Hour: ", observedHour,
                          "<br>Stop Name:", toStopName,
                          "<br>Headway Variance:", varHeadwayObserved,
                          "<br>Percent Bunched:", bunching_pct))) + 
    geom_tile(color = "white", size = 0.1) +
    scale_y_continuous(trans = "reverse", 
                       breaks = unique(headwayVarBun_route_date$observedHour)) +
    scale_fill_viridis_c(name = "Headway\nstandard deviation",
                         limits = c(0, 20),oob = scales::squish) +
    facet_grid(rows = "direction_text") + 
    xlab("Stop Sequence") +
    ylab("Observed Hour") +
    theme_light() +
    theme(legend.position = "bottom",
          legend.title.align = 0.5)
  
ggplotly(g,tooltip = "text", height = 680, width = 1200) |> 
layout(title = list(text = paste0(str_glue('<b> Headway variance analysis by stop and hour, Route {routeID} </b>',
                                           '<br><span style="font-size: 15px;">',
                                           'Aggregated by observed hour during 2022-10-02 - 2022-10-13',
                                           '</span>'))),
           legend = list(x = 1.05),
            margin = list(t = 100)
          )
           
}

DrawHeadBunchPlotly <- function(routeID){
g <-  
  headwayVarBun_route_date |>
  left_join(stop_seq, by = c("routeId", "direction_text", "toStopId")) |> 
  filter(routeId == routeID) |>
  ggplot(aes(toStopSequence, observedHour, fill = bunching_pct, 
             text = paste("Observed Hour: ", observedHour,
                          "<br>Stop Name:", toStopName,
                          "<br>Headway Variance:", varHeadwayObserved,
                          "<br>Percent Bunched:", bunching_pct))) + 
    geom_tile(color = "white", size = 0.1) +
    scale_y_continuous(trans = "reverse", 
                       breaks = unique(headwayVarBun_route_date$observedHour)) +
    scale_fill_viridis_c(name = "Percent Bunched",
                         limits = c(0, 50),oob = scales::squish) +
    facet_grid(rows = "direction_text") + 
    xlab("Stop Sequence") +
    ylab("Observed Hour") +
    theme_light() +
    theme(legend.position = "bottom",
          legend.title.align = 0.5)
  
ggplotly(g,tooltip = "text", height = 680, width = 1200) |> 
layout(title = list(text = paste0(str_glue('<b> Bunching analysis by stop and hour, Route {routeID} </b>',
                                           '<br><span style="font-size: 15px;">',
                                           'Aggregated by observed hour during 2022-10-02 - 2022-10-13',
                                           '</span>'))),
           legend = list(x = 1.05),
            margin = list(t = 100)
          )
           
}
DrawHeadVarPlotly("47")
01234567891011121314151617181920212223025507510001234567891011121314151617181920212223
05101520Headwaystandard deviation Headway variance analysis by stop and hour, Route 47 Aggregated by observed hour during 2022-10-02 - 2022-10-13Stop SequenceObserved HourNorthboundSouthbound
DrawHeadBunchPlotly("47")
01234567891011121314151617181920212223025507510001234567891011121314151617181920212223
01020304050Percent Bunched Bunching analysis by stop and hour, Route 47 Aggregated by observed hour during 2022-10-02 - 2022-10-13Stop SequenceObserved HourNorthboundSouthbound
DrawHeadVarPlotly("21")
01234567891011121314151617181920212223020406001234567891011121314151617181920212223
05101520Headwaystandard deviation Headway variance analysis by stop and hour, Route 21 Aggregated by observed hour during 2022-10-02 - 2022-10-13Stop SequenceObserved HourWestboundEastbound
DrawHeadBunchPlotly("21")
01234567891011121314151617181920212223020406001234567891011121314151617181920212223
01020304050Percent Bunched Bunching analysis by stop and hour, Route 21 Aggregated by observed hour during 2022-10-02 - 2022-10-13Stop SequenceObserved HourWestboundEastbound

In general, route 47 moving Northbound has the highest variance within the Center City region and the same route moving Southbound has inconsistent variance when entering in South Philly.

The headway of Route 21 is more consistent compared to route 47. The highest variance is seen near the bridge crossing Schuylkill River (21st to 24th St).

These heatmaps demonstrate that spatial factors might be more strongly correlated to bunching than temporal factors. Regardless of the observed hour, the heatmaps show that certain stops of a route in each direction always have high variance, implying high levels of bunching and gapping. These stops can be considered bottlenecks in the route. (Note that the consistent colored chunk shown in Northbound route 47 is probably due to data error).

For example, the plot for route 47 moving Northbound shows a section of bright color starting from 7th St & Bainbridge St, entering into Center City, to 7th St & Pine. A similar section is seen from 7th & Spring Garden, exiting Center City, to 7th St & Brown St. These bottlenecks correspond to the findings in the time-distance section as well.

Along route 21, the bottlenecks are easier to identify - Walnut St & 23rd in the Westbound direction, and Chestnut & 21st in the Eastbound direction.

Bottlenecks are therefore the locations where bunching usually starts to build up. The heatmaps provide a direct comparison between headway variance and amount of bunching taking place, and we can see that bunching starts to occur right after the stops that report high headway variance. High headway variance indicates that buses come at very different speeds when arriving at a given bus stop.

In short, the headway of a bus is strongly associated with bunching and the initiation of bunching.

2.2.1 Do buses show similar patterns of bunching regardless of the day?

We considered whether the high variance depicted in the graphs above is observed only on certain days or everyday. If the pattern is noticed everyday, we can reasonably conclude that the particular stop is indeed a bottleneck.

The following heatmaps display stop sequence on the x-axis, and day of the week on the y-axis. The color of the squares depicts the standard deviation of headway. The brighter the color, the higher the variance.

# Add a column with the day of the service date
headway_route_date$day <- day(headway_route_date$serviceDate)

# Filter out only 8 am and calculate headway variance
headwayVar_route_date_8am <- headway_route_date |> 
  filter(observedHour == 8) |>
  # filter(serviceDate == "2022-10-03 20:00:00") |> 
  group_by(routeId, observedHour, day,direction_text, toStopId, toStopName) |> 
  summarise(varHeadwayObserved = round(sd(headwayObserved), digits = 3))

# create a function to draw ggplotly by day

DrawHeadVarPlotlyDay <- function(RouteID){ 
g <-  
 headwayVar_route_date_8am |> 
 left_join(stop_seq, by = c("routeId", "direction_text", "toStopId")) |> 
 filter(routeId == RouteID) |> 
 ggplot(aes(x = toStopSequence, y = day, fill = varHeadwayObserved,
           text = paste("Observed Day: ", day,
                        "<br>Stop Name:", toStopName.y,
                        "<br>Headway Variance:", varHeadwayObserved))) +
 geom_tile(color = "white", size = 0.1) +
 scale_y_continuous(trans = "reverse",
                    breaks = unique(headwayVar_route_date_8am$day)) +
 scale_fill_viridis_c(name = "Headway\nstandard deviation",
                      limits = c(0, 30),oob = scales::squish) +
 facet_grid(rows = "direction_text") +
 xlab("Stop Sequence") + 
 ylab("Day 2 - 13") +
 theme_light() +
 labs(title = str_glue(""),
      subtitle = "8 AM") +
 theme(legend.position = "bottom",
       legend.title.align = 0.5,
       title = element_text(size = 14, face = "bold"))
  
ggplotly(g,tooltip = "text", height = 450, width = 1100) |> 
layout(title = list(text = paste0(str_glue('<b> Route {RouteID} Headway variance analysis Oct 2 - 13, 2022 </b>',
                                           '<br><span style="font-size: 15px;">',
                                           'Observed Hour: 8 AM',
                                           '</span>'))),
       legend = list(x = 1.05),
       margin = list(t = 100))
}

DrawHeadVarPlotlyDay("47")
23456910111213025507510023456910111213
0102030Headwaystandard deviation Route 47 Headway variance analysis Oct 2 - 13, 2022 Observed Hour: 8 AM Stop Sequence Day 2 - 13 NorthboundSouthbound
DrawHeadVarPlotlyDay("21")
23456910111213020406023456910111213
0102030Headwaystandard deviation Route 21 Headway variance analysis Oct 2 - 13, 2022 Observed Hour: 8 AM Stop Sequence Day 2 - 13 WestboundEastbound
# ggsave("../presentation-viz/Presentation1/slide8-var-by-day-21.png", height = 12, width = 15, dpi = 300)

The heatmaps above confirm the existence of bottlenecks. Regardless of the day, there is a high variance in headway at certain stops at 8 AM. For example, 7th St & Shark and 7th St & South St for route 47 moving Northbound; Thompson St & Franklin St and 8th St & Fitzwater for route 47 moving Southbound; Walnut St & 24th St along route 21moving Westbound; Chestnut St & 43rd St for Route 21 moving Eastbound.

2.2.2 Maps of mean headway standard deviation by stops

Where is bunching happening? The headway variance of each stop is aggregated as an average at each stop over a period of two weeks (Oct 2 - Oct 13, 2022).

headwayVar_route_date_stops <-
  headwayVar_route_date |>
  group_by(routeId, toStopId,toStopName) |>
  summarize(avgObservedHeadwayVar = mean(varHeadwayObserved))

# Read in stops 
stops <- st_read("../data/other/stops.geojson") |> 
  dplyr::select(LineAbbr, DirectionN, StopId,StopName, geometry) |> 
  filter(LineAbbr %in% route_include) |> distinct()

headwayVar_route_date_stops <-
  left_join(headwayVar_route_date_stops, stops, by = c("toStopName" = "StopName")) |> distinct(toStopName, .keep_all = TRUE) |> st_as_sf()


# load philly road network map grabbed by osm
philly_road <- read_sf("../data/philly_road.geojson")



# create function for map viz of headway variance
makeStopHeadVarMap <- function(routeID){
g <- ggplot()+
  geom_sf(data = philly_road, color = "grey50") +
  geom_sf(data = filter(headwayVar_route_date_stops, routeId == routeID), 
          aes(color = avgObservedHeadwayVar,
              text = paste("Stop Name:", toStopName,
                        "<br>Average Headway Variance:", avgObservedHeadwayVar)),
          size = 0.8) +
 # coord_sf(xlim = c(-75.20, -75.12), ylim = c(39.91, 40.048), expand = FALSE) +
  scale_color_viridis_c(name = "Mean headway\nstandard deviation",
                         limits = c(0, 15),oob = scales::squish) +
  labs(title = str_glue(""),
       subtitle = "") +
  theme_light() +
  theme(axis.text = element_blank(),
        panel.background = element_rect(fill = "white"),
        panel.grid = element_line(color = "transparent"))

ggplotly(g,tooltip = "text", height = 650, width = 800) |> 
layout(title = list(text = paste0(str_glue('<b> Headway variance analysis by stop, Route {routeID} </b>',
                                           '<br><span style="font-size: 15px;">',
                                           'Mean Headway SD Oct 2 - Oct 13',
                                           '</span>'))),
       legend = list(x = 1.05),
       margin = list(t = 100))
}
makeStopHeadVarMap("47")
051015Mean headwaystandard deviation Headway variance analysis by stop, Route 47 Mean Headway SD Oct 2 - Oct 13
makeStopHeadVarMap("21")
051015Mean headwaystandard deviation Headway variance analysis by stop, Route 21 Mean Headway SD Oct 2 - Oct 13

These maps show the effect of moving through Center City again. The headways of route 47 buses vary the most within and around the Center City area. Similarly, the headways of route 21 buses vary the most when entering and leaving Center City, with some variation in the West Philly residential area as well.

2.2.3 Conclusion

The above analysis of the standard deviation of headway standard deviation highlights the existing bottlenecks along two routes. These bottleneck stops seem to be near the Center City border, with a few also located in the residential area (the densest region) in West and South Philly.

This analysis shows the strong impact of temporal and spatial factors on the rate of bunching and initiation of bunching.

In our models, we will explore spatial variables as fixed-effects/stop-level variables, since they capture the characteristics of certain stops and their surrounding areas. We will also explore temporal variables, including the hour of operation, the day of each arrival, and the weather. Lastly, we will examine temporal and spatial lag variables, assuming bus bunching is highly related to the speed, headway, and lateness of the previously dispatched buses.

3 Model: Feature Engineering and Exploration

3.1 Model Outcome

The unit of analysis for our models is each bus arrival at a particular stop, for a specific route moving in a particular direction, and at a given time between the weekdays of 10/03 - 10/15.

As seen in the earlier graphs, bunching tends to persist i.e. when buses start to bunch at any given stop, they continue to bunch at following stops. Due to this, the dataset contains a much higher share of bunching instances than non-bunching instances. If we were to predict bunching vs. no bunching with this dataset, any models we create would be very accurate, but not necessarily very useful.

To tackle this, we decided to predict where bunching begins i.e. the initiation of bunching. To do this, we calculated the difference between the expected headway of every two consecutive buses arriving at a bus stop and the observed headway of the same. We also created another variable that indicated whether the bus stop for each arrival is the starting point of bunching or not. For our final model, we subsetted the original dataset to contain only the bus arrivals that are non-bunched and bus arrivals that are the starting point of bunching. Ultimately, we want the model to successfully identify whither the arrival would initiate bunching or not. This approach lends itself to a more useful model because it will identify where exactly interventions need to be targeted.

3.2 External Stop-Level Variables

Besides engineering existing variables in the original dataset, we also wrangled other external variables that might show be correlated to bunching.

3.2.1 Traffic Signals

Intersections with traffic signals compared to intersections without traffic signals could slow down traffic due to vehicles stopping when hitting a red light. Slower traffic could lead to more bunching.

We created a 100 feet buffer around bus stops. We used these buffers to create a dummy variable indicating whether a traffic signal was within 100 feet of a bus stop or not.

#stops_dir <- timepoints |> dplyr::distinct(routeId, directionId, toStopId, toStopName)

# Load stop: geometries and directions
stop_dir <- read.csv("../data/other/stops_dir.csv", colClasses=c("LineAbbr"="character")) |> 
  dplyr::select(-StopName, -Sequence)

stops <- st_read("../data/other/stops.geojson") |> 
  filter(LineAbbr %in% route_include) |> 
  distinct(LineAbbr, DirectionN, StopId, .keep_all = T) |> 
  left_join(stop_dir, by = c("LineAbbr", "DirectionN", "StopId")) |> 
  rename(routeId = LineAbbr, toStopSequence = Sequence, 
         directionName = DirectionN, toStopId = StopId, toStopName = StopName) |> 
  dplyr::select(routeId, toStopSequence, directionId, directionName, toStopId, toStopName) |> 
  st_transform(2272)


# Load traffic signal data
traffic_sig <- st_read("../data/other/traffic_signals.geojson") |> st_transform(2272)

# Add 100 feet buffer around stops to determine the presence of signals
stops <- stops |>
  mutate(isSignaled = lengths(st_intersects(stops, st_buffer(traffic_sig, 100))) > 0)

# mapview::mapview(stops, z = "isSignaled")

3.2.2 Ridership

The dwell time of bus at a bus stop is impacted by the number of riders that get on and off the bus. Higher the number of riders, the longer a bus needs to wait to accommodate their boarding and deboarding.

Ridership data was downloaded from SEPTA’s 2019 daily average ridership. Currently, The ridership data from 2022 is not publicly available. However, we believe the trends in 2019 (pre-pandemic), should be similar to the ones in 2022 (i.e. stops that had the highest ridership then will likely have the highest ridership today).

# load 2019 average daily ridership per stop
rides_2019 <- read_csv("../data/daily-ridership/Fall_2019_Stops_By_Route.csv") |> 
  filter(Route %in% route_include) |> 
  mutate(riders = Weekday_Bo + Weekday_Le) |> 
  dplyr::select(Stop_ID, Route, Direction, riders) |> 
  distinct()

# merge with stops dataset
stops <- stops |> 
  left_join(rides_2019, 
            by = c("routeId" = "Route", "toStopId" = "Stop_ID", "directionName"="Direction"))

3.2.3 Population, Population Density, and Commuters

Greater the population, population density, and commuters, greater the number of public transit users and other types of commuters. These factors could therefore slow traffic down and lead to irregular headways.

The population data comes from 2021 five-year American Census Survey. We wrangled population by block in Philadelphia, Montgomery, and Delaware. The number of commuters is only available on a census tract level. Each bus stop has population, population density, and the number of commuters of the block or census tract it is in.

# Population and populaiton density by blocks 
PhillyBlockPop <- get_acs(geography = "block group",
                         year = 2021,
                         state = "Pennsylvania",
                         county = c("Philadelphia", "Montgomery", "Delaware"),
                         variables = c("pop" = 'B01001_001E'
                                       ),
                         output = "wide",
                         geometry = T,
                         survey = "acs5") |>
                          st_transform(crs = 2272) 

PhillyBlockPop$area <- as.numeric(st_area(PhillyBlockPop$geometry)) / 1000000
PhillyBlockPop$popDen <- PhillyBlockPop$pop / PhillyBlockPop$area

PhillyBlockPop <- PhillyBlockPop |> dplyr::select(pop, popDen, geometry,GEOID)


# Remerge with stops geometry
stops <-
  st_join(stops, PhillyBlockPop, join = st_within) |> dplyr::select(-GEOID)

# Commuter density by blocks

PhillyTractCommuter <- get_acs(geography = "tract",
                         year = 2021,
                         state = "Pennsylvania",
                         county = c("Philadelphia", "Montgomery", "Delaware"),
                         variables = c("commuter" = 'B08006_001E'
                                       ),
                         output = "wide",
                         geometry = T,
                         survey = "acs5") |>
                         st_transform(crs = 2272) |>
                         dplyr::select(commuter, geometry,GEOID)

stops <- st_join(stops, PhillyTractCommuter, join = st_within)

# census_tract <- c(unique(stops$GEOID))

3.2.4 Commercial Licenses

Transportation and foot traffic tends to concentrate in areas with high levels of commercial activity, thus influencing traffic movement, potentially causing delays and leading to bunching of buses. There is also frequent loading and unloading of goods in commercial areas. In a city with narrow roads like Philadelphia, the loading trucks tend to block the traffic. Here we wrangle the number of commercial licenses around each bus stop within 200 feet. We made sure that all licences considered were active before Oct. 15 2022.

Commercial license data is from the City of Philadelphia, and we chose the ones that require loading. These include food services, junk yard, and towing.

3.2.5 Land Use

Land zoned for residential, commercial, and institutional uses are typically origin and destinations of a transit trip. Here we used the land use dataset at a parcel level from the City of Philadelphia. We joined it to the census blockgroup data. The use with the highest number of parcels associated with it was determined as the primary use for that entire blockgroup.” For example, if a blockgroup has 7 commercial parcels and 2 residential parcels, we would classify this blockgroup as “commercial”.

This data is then rejoined to SEPTA bus stops. A stop would then be marked as “commercial”, “residential”, or “institution”, based on the use of the blockgroup it is in.

3.2.6 Trash Day

Trash day data is from the City of Philadelphia. Each sanitary district has a different trash pick-up day. We joined bus stops to these sanitary districts. Then we joined this dataset to each arrival in the original data set. Finally, we created a boolean variable indicating whether each arrival at a bus stop was on a trash day for that area or not.

3.2.7 Geography

As we mentioned before, the location of stops - i.e. whether the stop is in Center City or not – has a huge impact on bunching. Here we indicate whether a bus stop is within the Center City boundary (as defined in the introduction) or not.

center_city <- st_read("../data/center_city.kml") %>% 
  mutate(isCenterCity = TRUE) %>%
  dplyr::select(isCenterCity, geometry) %>% 
  st_transform(crs = 2272)

stops <- st_join(stops, center_city, join = st_within)
stops$isCenterCity[is.na(stops$isCenterCity)] <- FALSE

3.2.8 Spatial Lag of Stop-Level Variables

Bunching tends to persist. One reason for this is the presence of consistent road blocks or a consistently high number of boardings that adds to the dwell time of a bus. These spatial lags created below account for these effects that are spread across space - specifically for traffic signals and the number of riders 10 and 15 stops before the stop of the current arrival.

stops_spatial_lag <- stops |>
  distinct(routeId, directionId, toStopId, .keep_all = T)|>
  arrange(routeId, directionId, toStopSequence, toStopId) |> 
  group_by(routeId, directionId) |>
  mutate(sumSignal_15 = rollsum(isSignaled, k = 15, fill = NA, align = "right"),
         sumSignal_10 = rollsum(isSignaled, k = 10, fill = NA, align = "right"),
         sumSignal_20 = rollsum(isSignaled, k = 20, fill = NA, align = "right"),
         sumRiders_15 = rollsum(riders, k = 15, fill = NA, align = "right"),
         sumRiders_10 = rollsum(riders, k = 10, fill = NA, align = "right"),
         sumRiders_20 = rollsum(riders, k = 20, fill = NA, align = "right"),
         sumComm_10 = rollsum(comm_count, k = 10, fill = NA),
         sumComm_15 = rollsum(comm_count, k = 15, fill = NA),
         sumComm_20 = rollsum(comm_count, k = 20, fill = NA)) |> 
  mutate(pctSignal_15 = (sumSignal_15 / 15 * 100),
         pctSignal_10 = (sumSignal_10 / 10 * 100),
         pctSignal_20 = sumSignal_20/ 20 * 100) %>% # calculate percentage %>%
  dplyr::select(-sumSignal_10,-sumSignal_15,-sumSignal_20)

saveRDS(stops_spatial_lag, "stops_spatial_lag.RData")

3.3 Temporal Variables

The rate of bunching, as we mentioned above, is highly dependent on the time of day. The graphs below show the the amount of bunching that is observed for all arrivals between 10/03/2022 - 10/31/2022.

We processed the iriginal runtime data from Swiftly in the python script here: Jupyter Notebook script for processing raw data

# load processed lag file
# Note when reading parquet, the default time zone is UTC. Everything is four hours 
lagTime <- read_parquet("../data/runtime/runtime-10-3-to-31-with-lagHeadway.gzip")

lagTime <-
  lagTime %>% 
  filter(!headway > 3600) #Delete rows with more than 60 minutes headway

lagTime_20 <- lagTime |> 
  dplyr::select("routeId", "toStopId", "directionId","toStopName", "serviceDate",
                "headwayLag21", "headwayLag20Diff21", "prevBus_headwayLag20",
                "prevBus_headwayLag19Diff20", "lateLag21", "lateLag20Diff21", 
                "prevBus_lateLag20", "prevBus_lateLag19Diff20","speedLag21",
                "speedLag20Diff21", "prevBus_speedLag20",  "prevBus_speedLag19Diff20",
                "bunched", "scheduledFromStopDepartureTime", "initBunching","headway","period") 
# ggplot of the percent bunched by hour and day

g <- lagTime %>%
   mutate(d_name = toupper(weekdays(scheduledFromStopDepartureTime, abbreviate = T))) %>%
   group_by(period, d_name, bunched) %>%
   summarise(B_count = n()) %>%
   filter(!d_name == "SUN") %>%
   pivot_wider(names_from = bunched, values_from = B_count, names_prefix = "b_") %>%
   mutate(pct_bunched = b_TRUE/(b_TRUE+b_FALSE) * 100,
          d_name = factor(d_name, levels = c("MON", "TUE", "WED", "THU", "FRI"))) %>%
   ggplot(aes(x=period, y = pct_bunched)) +
     geom_col(fill = "#5c6db3") +
     labs(title = "Percent bunched by hour and day",
          subtitle = "Observed during weeksdays 10/01/2022 - 10/31/2022") +
     xlab("Hour") +
     ylab("Percent Bunched") +
     facet_wrap(~ d_name) +
     theme_ipsum_musa()

g

 ggsave("../mark-down/viz/pct_bunched.png", width = 8, height = 6)

Percent Bunched by hour and day

We see a clear pattern here - most bunching occurs during the morning rush hour (6AM-9AM) and evening rush hour (3PM - 6PM). The day of the week play a role here as well: Monday and Friday showed slightly less amounts of bunching during peak hours. This makes sense given that post-COVID, a lot of employees work remotely.

g2 <- lagTime %>%
  mutate(d_name = toupper(weekdays(scheduledFromStopDepartureTime, abbreviate = T))) %>%
  group_by(period, d_name, initBunching) %>%
  summarise(B_count = n()) %>%
  filter(!d_name == "SUN") %>%
  pivot_wider(names_from = initBunching, values_from = B_count, names_prefix = "b_") %>%
  mutate(pct_bunched = b_TRUE/(b_TRUE+b_FALSE) * 100,
         d_name = factor(d_name, levels = c("MON", "TUE", "WED", "THU", "FRI"))) %>%
  ggplot(aes(x=period, y = pct_bunched)) +
    geom_col(fill = "#5c6db3") +
    labs(title = "Percent of bunching initiation by hour and day",
         subtitle = "Observed 10/03/2022 - 10/15/2022") +
    xlab("Hour") +
    ylab("Percent Bunched") +
    facet_wrap(~ d_name) +
    theme_ipsum_musa()
 
ggsave("../mark-down/viz/pct_initBunching.png", width = 8, height = 6)

Percent Bunched by hour and day

Looking at only the initiation of bunching, we see that similar patterns emerge. Bunching is often initiated at peak hours in the mornings and evenings.

3.4 Temporal Timepoint-Level Lags

Because bunchings are often caused by variation in speed, lateness of previously dispatched buses, and headway of buses, temproal lag variables are important indicators to predict future bunchings. Here, we will explore lag variables relating to twenty bus stops ahead.

We chose to predict initiation of bunching up to twenty stops ahead to allow longer reaction time for the bus operator to act in order to prevent bunching.

Temperial-lag-explained

The diagram above is an example of temporal lag variables for one arrival instance of bus 321. Suppose we want to predict whether bus 321 will begin to bunch at Stop C. We can trace the trip of this bus twenty-one stops before to Stop A. Ideally, buses are being dispatched every 15 minutes and so headway should also be 15 minutes. This means that bus 321 arrives 15 minutes after bus 320 passed by, at 1:17 pm. However, for whatever reason, bus 321 arrives a little later at 1:20 pm, creating a headway of 18 minutes (3 minutes > expected headway). This difference is recorded under the variable “headwayLag21”.

Then, bus 320 arrives at Stop B at 1:07 pm,taking 5 minutes to travel from Stop A. Bus 321 travels a little faster than Bus 320 to arrive at Stop B at 1:23 pm, taking only 3 minutes to travel from Stop A. This results in a headway of 16 minutes.

The difference between these two headways measured at Stop A and Stop B is -2 minutes (16 - 18). This difference is recorded under the variable “headwayLag20Diff21”.

In general, if we are interested in a particular bus arrival at a specific stop, any of the lag variables will consider the arrival of the same bus at earlier stops.

Lateness is a measure of how early or late buses arrive in comparison to the scheduled arrival time.

Speed is how the fast or slow buses are in comparison to if the speed were uniform (route length / (scheduled trip arrival time at last stop - scheduled trip start time) * distance between two stops).

For our analysis, the unit of headway is seconds, the unit of lateness is second, and the unit of speed is miles per hour.

We create lag variables with the suffix “prevBus_”, after repeating the same process but with bus 320.

The graph below shows the rate of bunching initiation in correlation to the mean of the lag variables.

# Make the data into a long table and create a plot
bunch_lag_gather <- lagTime_20 |> 
  dplyr::select(-"routeId", -"toStopId", -"directionId", -"toStopName", -"period",
                -"serviceDate", -"scheduledFromStopDepartureTime", -"headway") %>%
  filter(initBunching == TRUE | bunched == FALSE ) %>%
  dplyr::select(-bunched) %>% 
  gather(key = "key", value = "input", -initBunching)

# Make ggplot of the lag variable
bunch_lag_gather |>
mutate(facet = factor(key, levels = c("headwayLag21", 
                                      "headwayLag20Diff21","prevBus_headwayLag20",
                                      "prevBus_headwayLag19Diff20", "lateLag21",
                                      "lateLag20Diff21", "prevBus_lateLag20",
                                      "prevBus_lateLag19Diff20","speedLag21",
                                      "speedLag20Diff21", "prevBus_speedLag20",
                                      "prevBus_speedLag19Diff20"))) |>
ggplot(aes(initBunching, input, fill=initBunching)) +
    geom_bar(position = "dodge", stat = "summary", fun = "mean", color = "transparent") +
    scale_fill_manual(values = c(red0,blue0)) +
    facet_wrap(~facet, scales = "free") +
    labs(x="", y="Mean",
         title = "Feature associations with the likelihood to initiate bunching",
         subtitle = "Based on the mean of lagged headway, lateness, and speed",
         fill = "Bunched") +
    theme_ipsum_musa()

ggsave("../mark-down/viz/temp_lag_assc.png", height = 7, width = 12)
# bunch_lag_gather |>
# ggplot() +
#     geom_histogram(aes(input)) +
#     facet_wrap(~key, scales = "free") +
#     theme_light() + theme(legend.position = "none")

rm(bunch_lag_gather)

Association with temporal lag variables

The graph shows:

  • Initiation of bunching is highly related to headways of bus trips – i.e. the shorter the headways 21 - 20 stops ago, the higher the chance the bus will start to bunch 20 stops later. As headways lessen between two buses, greater the likelihood of bunching.

  • Initiation of bunching is highly related to lateness of bus trips – i.e. higher likelihood of the initiation of bunching with higher irregularities in lateness - if one bus trip arrives early and previous one arrives late.

  • Initiation of bunching is highly related to difference in speed – i.e. higher likelihood of initiation of bunching with greater irregularities in speeds, if one bus trip is fast and the previous bus trip is slow.

3.5 Panel Creation: Merging Temporal and Spatial Data

We merged all the variables above to create a panel dataset for modeling. The column “bunched” will be the dependent variable.

# Change data type for merge
lagTime$routeId <- as.character(lagTime$routeId)
lagTime$toStopName <- as.character(lagTime$toStopName)
lagTime$directionId <- as.numeric(as.character(lagTime$directionId))
lagTime$toStopId <- as.numeric(as.character(lagTime$toStopId))

stops_spatial_lag <- readRDS("./stops_spatial_lag.RData")

# Create panel by merging stop-level and timepoint level variables
panel <-
  lagTime |> left_join(stops_spatial_lag, by = c("routeId", "toStopId", "directionId"))

panel$serviceDayName <- toupper(weekdays(panel$scheduledFromStopDepartureTime, abbreviate = T))

panel <- panel |> mutate(isTrashDay = ifelse(serviceDayName == COLLDAY, "TRUE","FALSE"))

3.5.1 Panel-level variables: Weather

Lastly, we added weather on an hourly basis to the panel dataset. We hypothesised that weather might have some effects on the speed of traffic if precipitation is heavy or if visibility is low. The weather data is collected through airport records in Philadelphia.

3.6 Other feature associations with the likelihood of bunching

The graphs below show the feature associations with bunching by numerical and categorical types of variables.

3.6.1 Numeric panel and numeric stop-level lag variables

# Select all numerical variables 
bunch_other <- panel |> 
  dplyr::select(initBunching, tempF, precipIn, vsby, pop, popDen, commuter, riders, comm_count,
                pctSignal_10, pctSignal_15, pctSignal_20, sumRiders_10, sumRiders_15, sumRiders_20, sumComm_10, sumComm_15, sumComm_20)

# From wide to long
bunch_other_gather <- bunch_other |> gather(key = "key", value = "input", -initBunching)

# ggplot Feature associations with initiation of bunching
bunch_other_gather |>
mutate(facet = factor(key, 
                      levels = c("initBunching", "tempF", "precipIn", "vsby",
                                "pop", "popDen", "commuter", "riders",
                                "comm_count", "pctSignal_10", "pctSignal_15", "pctSignal_20",
                                "sumRiders_10", "sumRiders_15", "sumRiders_20", "sumComm_10", "sumComm_15",
                                "sumComm_20"))) |>
ggplot(aes(initBunching, input, fill=initBunching)) + 
    geom_bar(position = "dodge", stat = "summary", fun = mean, na.rm = T, width = 0.9,
             color = "transparent") + 
    scale_fill_manual(values = c(red0,blue0)) +
    facet_wrap(~facet, scales = "free") +
    labs(y="Mean", 
         title = "Feature associations with the likelihood to initiate bunching",
         subtitle = "Numeric panel and numeric stop-level lag variables",
         fill = "Initiation of Bunching",
         x=""
        ) +
    theme_ipsum_musa()

ggsave("../mark-down/viz/stop_lag_assoc.png", height = 8, width = 10)

numeric stop-level and weather variables association

With the numeric variables, we decided to summarize the data using their averages. The strongest association is between the number of riders and the likelihood to of bunching initiation. The more riders that get on or off at a bus stop, the longer the bus has to wait at that bus stop.

3.6.2 Categorical variables

#ggsave("../presentation-viz/out_img/feature_assoc_numeric_bunched.png", height = 4, width = 3, scale = 2)

bunch_cat_gather <- 
  panel |>
  dplyr::select(initBunching, precip_q3, vsby_q3, tempF_q3, pop_q3, popDen_q3, commuter_q3, riders_q3, isSignaled, landUse, comm_q3, isTrashDay, isCenterCity) |>
  rename(precipitation_level = precip_q3, 
         visability_level = vsby_q3,
         temperature_level = tempF_q3,
         population_level = pop_q3,
         popDensity_level = popDen_q3,
         commuter_level = commuter_q3,
         riders_level = riders_q3,
         commercial_level = comm_q3) %>%
  gather(key = "key",value = "input", -initBunching) |>
  group_by(initBunching,key,input) |>
  summarize(count = n()) 


bunch_cat_gather |>
   mutate(facet = factor(key,levels=c("precipitation_level","visability_level","temperature_level","population_level","popDensity_level","commuter_level","riders_level","commercial_level","landUse","isTrashDay","isSignaled","isCenterCity"))) |>
  ggplot(aes(input, count, fill = initBunching, na.rm = T)) +   
    geom_bar(position = "fill", stat="identity",na.rm = T, color = "transparent") +
    scale_fill_manual(values = c("transparent",blue0)) +
    scale_x_discrete(na.translate = FALSE) +
    coord_cartesian(ylim = c(0,.014)) +
    facet_wrap(~facet, scales="free_x", nrow = 3) +
    labs(x="", y="Count",
         title = "Feature associations with the likelihood to intiate bunching",
         subtitle = "Categorical features in quantiles by percentage, intiated bunching incidents only",
         fill = "Initiation of Bunching") +
    theme_ipsum_musa() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))

ggsave("viz/catagorical_assoc.png",height = 8, width = 8)

# The code below shows the association between variables and bunched instead of initiation of bunching:
# bunch_cat_lag_gather <- 
#   lag |>
#   dplyr::select(bunched, traffic_lag,traffic_lag3, traffic_lag6, traffic_lag12) |>
#   gather(key = "key",value = "input", -bunched) |>
#   group_by(bunched,key,input) |>
#   summarize(count = n())
# 
# bunch_cat_lag_gather |> 
#   ggplot(aes(input, count, fill = bunched, na.rm = T)) +   
#     geom_bar(position = "fill", stat="identity",na.rm = T) +
#     scale_fill_manual(values = c(red0,blue0)) +
#     scale_x_discrete(na.translate = FALSE) +
#     facet_wrap(~key, scales="free_x", nrow = 3) +
#     labs(x="", y="Count",
#          title = "Spatial catagory lag feature associations with the likelihood of bunching by percentage",
#          subtitle = "Traffic Signals",
#          fill = "Bunched") +
#     theme_light() + theme(axis.text.x = element_text(angle = 45, hjust = 1))

# bunch_cat_gather |> na.omit() |> filter(bunched == TRUE) |> 
#   ggplot(aes(input, count, fill = "bunched",width = 0.5)) +   
#     geom_bar(position = "dodge", stat="identity") +
#     facet_wrap(~key, scales="free", nrow = 3) +
#     scale_fill_manual(values = "#00BFC4") +
#     labs(x="Bunched", y="Count",
#          title = "Feature associations with the likelihood of bunching",
#          subtitle = "Category features in quantiles, showing counts for bunched buses",
#          fill = NULL) + 
#     theme_light() + theme(axis.text.x = element_text(angle = 45, hjust = 1),
#                             legend.position = "none")

# bunch_other_gather |>
# ggplot() + 
#     geom_histogram(aes(input)) + 
#     facet_wrap(~key, scales = "free") +
#     theme_light() + theme(legend.position = "none") 

catogrical association

We formatted numerical variables into three quantiles across categorical variables to test their association with the initiation of bunching. The graph above shows the variation of initiation of bunching across categories.

The number of commercial licenses within 200 feet of a stop shows a strong positive association with the initiation of bunching. Higher the commercial activity around a stop (loading, unloading, outdoor dining etc),the more likely that bunching will occur at that stop. the number of riders has a a similar association.

Unsurprisingly, the stops within Center City are more likely to experience the initiation of bunching compared to stops outside of Center City.

Population density, on the other hand, is negatively associated with the initation of bunching. The higher the density, lower the initiation of bunching. This is expected because the highest densities occur in Center City, which is a small fraction of area compared to the rest of the city, resulting in a fairly small amount of true incidents for the initiation of bunching.

3.7 Variable and Model Selection

We tested three models–linear regression, random forest, and xgboost tree–with four variable combinations: kitchen sink (with everything), kitchen sink without weather variables, temporal and spatial lags only, and temporal and spatial lags with significant variables identified in the previous section.

ROC is the metric to rank the model performances.

The result shows that xgboost tree provides the best prediction on the dataset without weather variables considered.

# Set up panel data
panels <- panel
# z <- panel
rm(panel)

#panels <- read_csv("../data/final_panel_allstops_ahead.csv") 


#panels$index <-  1:nrow(panels)

cols.num <- c("precip_q3", "vsby_q3", "tempF_q3", "pop_q3", "popDen_q3", "commuter_q3", "riders_q3", "comm_q3", "directionId")

panels[cols.num] <- sapply(panels[cols.num], as.character)

panels[sapply(panels, is.factor)] <- lapply(panels[sapply(panels, is.factor)], 
                                            as.character)
panels[sapply(panels, is.logical)] <- lapply(panels[sapply(panels, is.logical)], 
                                             as.factor)

#Filter out incident of arrival that only start bunching or non bunched. 

panels_filtered <- panels %>%
  filter(initBunching == TRUE | bunched == FALSE) 

#These variables do not vary on lag numbers
panels_fixed_var <- panels_filtered %>% 
  dplyr::select(routeId, toStopId, directionId, tripId, instanceId,toStopSequence.x, period,
                initBunching, tempF, precipIn, vsby, pop, popDen, commuter,
                riders, comm_count, precip_q3, vsby_q3, tempF_q3, pop_q3, popDen_q3, 
                commuter_q3, riders_q3, isSignaled, landUse, comm_q3, isTrashDay,
                isCenterCity, serviceDayName, day) %>%
  rename(precipitation_level = precip_q3, vsby_level = vsby_q3, 
         temperature_level = tempF_q3, population_level = pop_q3,
         popDensity_level = popDen_q3, commuter_level = commuter_q3,
         riders_level = riders_q3,commercial_level = comm_q3,
         toStopSequence = toStopSequence.x)

#These do vary with stops, here we are selecting ones that relate to prediction 20 stops ahead
panels_20 <- panels_filtered %>%
  dplyr::select("headwayLag21", 
                "headwayLag20Diff21","prevBus_headwayLag20",
                "prevBus_headwayLag19Diff20", "lateLag21",
                "lateLag20Diff21", "prevBus_lateLag20",
                "prevBus_lateLag19Diff20","speedLag21",
                "speedLag20Diff21", "prevBus_speedLag20",
                "prevBus_speedLag19Diff20",ends_with("_20")) 

# Saprate into different dataframes with different variables

# Kitchen Sink
panels_ks <-
  cbind(panels_20, panels_fixed_var) 

# No Weather variable
panels_no_weather <-
  panels_ks %>%
  dplyr::select(-starts_with("temp"), -starts_with("precip"), -starts_with("vsby"))

#Only with Temporal Lag and Spatial Lag
panels_lagTS <-
  panels_ks %>%
  dplyr::select(toStopId, directionId, routeId, toStopSequence, tripId, instanceId, serviceDayName, period, 
    contains("lag", ignore.case = T), starts_with("sum"), starts_with("pct"), day, initBunching)

# Lag with other signifianct variables from exploration
panels_signi <-
  panels_ks %>% 
  dplyr::select(initBunching,toStopId, directionId, routeId, toStopSequence,tripId, instanceId, serviceDayName, period, 
    contains("lag", ignore.case = T),starts_with("sum"), starts_with("pct"), riders, commercial_level, popDensity_level, day)

We first split the panel data into train and test sets according to the date of arrival instances. The train set has the first three weeks of arrival instances. The trained model trained will predict initiation of bunching on the test set that which contains the last week of arrival instances.

All this data is pre-processed through the “recipe” function in “tidymodel” package. We imputed the mean for any missing data in spatial lag variables. This is to minimize the number of rows that are deleted in the model (since models can only be trained on datasets without NAs).

# Specify train day and test day 
train_day <- c(1:21)
test_day <- c(22:31)

# Create function to get train set and pre-preprocess it to be trained on
create_training_set <- function(dataset){
rec <- 
  recipe(initBunching ~ ., data = dataset %>% filter(day %in% train_day)) %>%
    update_role(toStopId, tripId, instanceId, new_role = "id") %>%
    step_impute_mean(starts_with("sum"),starts_with("pct")) %>%
    step_unknown(all_nominal_predictors()) %>%
    step_other(threshold = 0.05) %>%
    step_dummy(all_nominal(), -all_outcomes(), one_hot = T) %>%
    step_zv(all_predictors()) %>%
    step_normalize(all_predictors()) %>%
    step_naomit(everything()) %>%
    prep() %>%
    juice() 
}

# Create function to get test set and pre-process it to be predicted on 
create_test_set <- function(dataset){
rec <- 
  recipe(initBunching ~ ., data = dataset %>% filter(day %in% test_day)) %>%
    update_role(toStopId, tripId, instanceId, new_role = "id") %>%
    step_impute_mean(starts_with("sum"),starts_with("pct")) %>%
    step_unknown(all_nominal_predictors()) %>%
    step_other(threshold = 0.05) %>%
    step_dummy(all_nominal(), -all_outcomes(), one_hot = T) %>%
    step_zv(all_predictors()) %>%
    step_normalize(all_predictors()) %>%
    step_naomit(everything()) %>%
    prep() %>%
    juice()

# Split data into train and test sets for four variable combinations
panel_train_ks <- create_training_set(panels_ks) %>% dplyr::select(-precipitation_level_X1, -vsby_level_X2)
panel_train_lagTS <- create_training_set(panels_lagTS)
panel_train_no_weather <- create_training_set(panels_no_weather)
panel_train_sign <- create_training_set(panels_signi)

panel_test_ks <- create_test_set(panels_ks)
panel_test_lagTS <- create_test_set(panels_lagTS)
panel_test_no_weather <- create_test_set(panels_no_weather)
panel_test_sign <- create_test_set(panels_signi)
# Create workflow for each model to be run on the train sets

glm_result <- function(df, df_test, df_name, md_name){
  
  glm_initial <- 
    logistic_reg() %>%
    set_engine("glmnet") %>%
    set_args(penalty = 0) %>%
    set_args(mixture = 0) %>%
    set_mode("classification") %>%
    fit(initBunching~., data = df %>% dplyr::select(-tripId, -toStopId, -instanceId, -day)) 
  
  glm_result <- 
    glm_initial %>%
    predict(df, type = "prob") %>%
    bind_cols(df) %>%
    roc_auc(truth = initBunching, estimate = .pred_FALSE)

  
  df_test <- 
    df_test %>% dplyr::select(-tripId, toStopId, -instanceId, -day)
  
  glm_result2 <- 
    glm_initial %>%
    predict(df_test, type = "prob") %>%
    bind_cols(df_test) %>%
    roc_auc(truth = initBunching, estimate = .pred_FALSE)

  glm_result <-
    glm_result %>% 
    mutate(Dataset = df_name,
           Model = md_name,
           TestOrTrain = "train")
  
  glm_result2 <-
    glm_result2 %>% 
    mutate(Dataset = df_name,
           Model = md_name,
           TestOrTrain = "test")
  
  rbind(glm_result, glm_result2)
}

rf_result <- function(df, df_test, df_name, md_name){
  
  rf_initial <- 
    rand_forest() %>%
    set_args(trees = 50) %>% 
    set_args(min_n = 100) %>%
    set_engine("ranger") %>% 
    set_mode("classification") %>%
    fit(initBunching~., data = df %>% dplyr::select(-tripId, -toStopId, -instanceId, -day)) 
  
  rf_result <- 
    rf_initial %>%
    predict(df, type = "prob") %>%
    bind_cols(df) %>%
    roc_auc(truth = initBunching, estimate = .pred_FALSE)
  
  df_test <- 
    df_test %>% dplyr::select(-tripId, toStopId, -instanceId, -day)
  
  rf_result2 <- 
    rf_initial %>%
    predict(df_test, type = "prob") %>%
    bind_cols(df_test) %>%
    roc_auc(truth = initBunching, estimate = .pred_FALSE)


  rf_result <-
    rf_result %>% 
    mutate(Dataset = df_name,
           Model = md_name,
           TestOrTrain = "train")
  
  rf_result2 <-
    rf_result2 %>% 
    mutate(Dataset = df_name,
           Model = md_name,
           TestOrTrain = "test")
  
  rbind(rf_result, rf_result2)
}

xgb_result <- function(df, df_test, df_name, md_name){
  
  xgb_initial <- 
    boost_tree() %>%
    set_args(min_n = 100) %>%
    set_args(trees = 50) %>% 
    set_engine("xgboost") %>% 
    set_mode("classification") %>%
    fit(initBunching~., data = df %>% dplyr::select(-tripId, -toStopId, -instanceId,-day)) 
  
  xgb_result <- 
    xgb_initial %>%
    predict(df, type = "prob") %>%
    bind_cols(df) %>%
    roc_auc(truth = initBunching, estimate = .pred_FALSE)

  df_test <- 
    df_test %>% dplyr::select(-tripId, toStopId, -instanceId, -day)
  
  xgb_result2 <- 
    xgb_initial %>%
    predict(df_test, type = "prob") %>%
    bind_cols(df_test) %>%
    roc_auc(truth = initBunching, estimate = .pred_FALSE)
  
  xgb_result <-
    xgb_result %>% 
    mutate(Dataset = df_name,
           Model = md_name,
           TestOrTrain = "train")
  
  xgb_result2 <-
    xgb_result2 %>% 
    mutate(Dataset = df_name,
           Model = md_name,
           TestOrTrain = "test")
  
  rbind(xgb_result, xgb_result2)
}
set.seed(400)

# Run function to get the initial results 
ks_glm_initial <- glm_result(panel_train_ks,panel_test_ks, "Kitchen Sink", "GLM")
no_weather_glm_initial <- glm_result(panel_train_no_weather,panel_test_no_weather, "Without Weather", "GLM")
lagTS_glm_initial <- glm_result(panel_train_lagTS, panel_test_lagTS, "Temporal and Spatial Lags", "GLM")
sign_glm_initial <- glm_result(panel_train_sign, panel_test_sign, "With Other Significant Variables", "GLM")

# rbind(ks_glm_initial,no_weather_glm_initial,lagTS_glm_initial,sign_glm_initial) %>% 
#write.csv("../data/other/glm_initial_results.csv")

ks_rf_initial <- rf_result(panel_train_ks, panel_test_ks,"Kitchen Sink", "Random Forest")
no_weather_rf_initial <- rf_result(panel_train_no_weather,panel_test_no_weather,"Without Weather", "Random Forest")
lagTS_rf_initial <- rf_result(panel_train_lagTS, panel_test_lagTS, "Temporal and Spatial Lags", "Random Forest")
sign_rf_initial <- rf_result(panel_train_sign, panel_test_sign, "With Other Significant Variables", "Random Forest")

rbind(ks_rf_initial, lagTS_rf_initial, sign_rf_initial, no_weather_rf_initial) %>% write.csv("../data/other/rf_initial_results.csv")

ks_xgb_initial <- xgb_result(panel_train_ks, panel_test_ks, "Kitchen Sink", "XGBoost")
no_weather_xgb_initial <- xgb_result(panel_train_no_weather,panel_test_no_weather, "Without Weather", "XGBoost")
lagTS_xgb_initial <- xgb_result(panel_train_lagTS, panel_test_lagTS, "Temporal and Spatial Lags", "XGBoost")
sign_xgb_initial <- xgb_result(panel_train_sign, panel_test_sign, "With Other Significant Variables", "XGBoost")

# Combine the resutls into a table

initial_results <- rbind(ks_glm_initial,no_weather_glm_initial,lagTS_glm_initial,sign_glm_initial, ks_rf_initial,no_weather_rf_initial,lagTS_rf_initial, sign_rf_initial,ks_xgb_initial, no_weather_xgb_initial,lagTS_xgb_initial,sign_xgb_initial) %>% 
      arrange(desc(.estimate)) %>% filter(TestOrTrain == "test") %>%
      dplyr::select(-TestOrTrain)

ROC-AUC (Receiver operating characteristic - area under the curve) is the metric to rank binary result models. ROC is a good indicator of model performance when it comes to imbalanced datasets. It considers the relation between True Positive rate (predicted true and actually true) and False Positive rate (predicted true, actually false) across various thresholds. Generally, the higher the roc-auc or the larger the area under the curve, the better the model.

# write.csv(initial_results, "../data/initial_results.csv")

initial_results <- read.csv("../data/initial_results.csv")

initial_results %>% dplyr::select(-X) %>% rename (Rank = X.1)%>%  kable() %>% kable_styling("striped", full_width = F) 
Rank .metric .estimator .estimate Dataset Model
1 roc_auc binary 0.8666502 Without Weather XGBoost
2 roc_auc binary 0.8659709 Kitchen Sink XGBoost
3 roc_auc binary 0.8652692 With Other Significant Variables XGBoost
4 roc_auc binary 0.8638091 Temporal and Spatial Lags XGBoost
5 roc_auc binary 0.8368549 Without Weather Random Forest
6 roc_auc binary 0.8361816 Kitchen Sink Random Forest
7 roc_auc binary 0.8358384 Temporal and Spatial Lags Random Forest
8 roc_auc binary 0.8353477 With Other Significant Variables Random Forest
9 roc_auc binary 0.8156384 With Other Significant Variables GLM
10 roc_auc binary 0.8154405 Without Weather GLM
11 roc_auc binary 0.8152760 Kitchen Sink GLM
12 roc_auc binary 0.8151066 Temporal and Spatial Lags GLM

In the result above, xgboost tree on the dataset without weather performed the best. This is our initial variable and model selection.

# model_xgb_rec <-
#   recipe(initBunching ~ ., data = panel_final_train) %>%
#   update_role(index, new_role = "id") %>%
#   step_unknown(all_nominal_predictors()) %>%
#   step_other(threshold = 0.05) %>%
#   step_dummy(all_nominal(), -all_outcomes(), one_hot = T) %>%
#   step_zv(all_predictors()) %>%
#   step_normalize(all_predictors()) %>%
#   step_naomit(all_predictors()) 

# run the basic non-tuned xgboost stree model 

vip_model <- boost_tree(trees = 300) %>%
  set_engine("xgboost", importance = "impurity") %>%
  set_mode("classification") %>%
  fit(initBunching ~., data = panel_train_no_weather %>%
        dplyr::select(-instanceId, -toStopId, - tripId, -day))

#saveRDS(vip_model, file = "../data/vip_model.RData")

3.8 Variable Importance

We first ran a basic xgboost non-tuned model to test variable importance.

#vi(vip_model) 

# Uuse function "vip" to exract variable importance

#vip_plot <- vip(vip_model,n = 35, aesthetics = list(color = "blue4", fill = "blue3", alpha = 0.5))

#saveRDS(vip_plot, file="../data/vip_plot_updated2.RData")

vip_plot <- readRDS("../data/vip_plot_updated2.RData")

vip_plot

From the diagram, we see that most important variables are temporal lag variables, demographic variables – average daily riders per stop and population, and the hour of arrival instances. Categorical variables are not significant in the model.

To improve model efficiency, we decided to cut out unimportant variables. We kept the variables that ranked above 55% percentile. The table shows the final variable selection in the model.

We created the final train and test sets based on the variable selection.

vip_model <- readRDS("../data/vip_model.RData")

importance <- vi(vip_model) 

# Only retain variables that score importance above 55 percentile

q <- quantile(importance$Importance, probs = .55) %>% round(digits = 5)

final_selection <- filter(importance, Importance > q)

final_variable <- unique(final_selection$Variable)

final_variable
##  [1] "headwayLag21"               "prevBus_headwayLag20"      
##  [3] "headwayLag20Diff21"         "prevBus_lateLag20"         
##  [5] "sumRiders_20"               "lateLag21"                 
##  [7] "riders"                     "toStopSequence"            
##  [9] "prevBus_speedLag20"         "prevBus_speedLag19Diff20"  
## [11] "speedLag21"                 "speedLag20Diff21"          
## [13] "prevBus_lateLag19Diff20"    "prevBus_headwayLag19Diff20"
## [15] "pop"                        "popDen"                    
## [17] "lateLag20Diff21"            "commuter"                  
## [19] "sumComm_20"                 "pctSignal_20"              
## [21] "period"                     "routeId_X18"               
## [23] "routeId_X60"                "comm_count"                
## [25] "routeId_X17"                "routeId_X47"
panel_final_train <- panel_train_no_weather %>%
  dplyr::select(initBunching,toStopId, tripId, instanceId, all_of(final_variable))
 

panel_final_test <- panel_test_no_weather %>%
 dplyr::select(initBunching,toStopId, tripId, instanceId, all_of(final_variable))

# Create final recipe for model fit
xgb_final_rec <-
  recipe(initBunching ~ ., data = panel_final_train) %>%
  update_role(toStopId, tripId, instanceId, new_role = "id") %>%
  step_impute_mean(starts_with("sum"),starts_with("pct")) %>%
  step_unknown(all_nominal_predictors()) %>%
  step_other(threshold = 0.05) %>%
  step_dummy(all_nominal(), -all_outcomes(), one_hot = T) %>%
  step_zv(all_predictors()) %>%
  step_normalize(all_predictors()) %>%
  step_naomit(everything()) 
# plan for tunning
XGB_plan <- boost_tree() %>%
  set_args(mtry  = tune()) %>%
  set_args(min_n = tune()) %>%
  set_args(trees = 100) %>% 
  set_engine("xgboost", importance = "impurity") %>% 
  set_mode("classification")

# tunning grid. Less than 30 mtry since only 30 columns avaliable
xgb_grid <- grid_random(mtry() %>% range_set(c(10,30)), 
                        min_n(),
                        size = 5)

# put into a final workflow
xgb_wf <-
  workflow() %>% 
  add_recipe(xgb_final_rec) %>% 
  add_model(XGB_plan)

# 3 folds for cross validation
folds <- vfold_cv(panel_final_train, v = 3)

control <- control_resamples(save_pred = TRUE, verbose = TRUE)
metrics <- metric_set(accuracy, roc_auc)

xgb_tuned <- xgb_wf %>%
  tune::tune_grid(.,
                  resamples = folds,
                  grid      = xgb_grid,
                  control   = control,
                  metrics   = metrics)

# saveRDS(xgb_tuned, "../data/xgb_tuned.RData")

The model is trained and tuned. Column “mtry” refers to the number of randomly selected columns that are used for a decision tree and “min_n” refers to the tree depth or the maximum number of splits a decision tree can have in the xgboost model.

#best <- show_best(xgb_tuned, metric = "roc_auc", n = 5)

#saveRDS(best, "../data/best_results.RData")
best <- readRDS("../data/best_results.RData")

best %>% kable() %>% kable_styling("striped", full_width = F) 
mtry min_n .metric .estimator mean n std_err .config
24 22 roc_auc binary 0.8720514 3 0.0005854 Preprocessor1_Model2
29 30 roc_auc binary 0.8719749 3 0.0005168 Preprocessor1_Model5
20 36 roc_auc binary 0.8719696 3 0.0004672 Preprocessor1_Model1
21 13 roc_auc binary 0.8712491 3 0.0007566 Preprocessor1_Model3
24 2 roc_auc binary 0.8698647 3 0.0003884 Preprocessor1_Model4

The model performed the best when “mtry” is set to 24 and “min_n” set to 22. This achieved a ROC-AUC of 0.87.

We created a model with the above parameters and ran it on the test set.

xgb_tuned <- readRDS("../data/xgb_tuned.RData")

# Select best parameter
xgb_best_params  <- select_best(xgb_tuned, metric = "roc_auc")

# Add best parameter to the workflow
xgb_best_wf    <- finalize_workflow(xgb_wf, xgb_best_params)

# Create the best fit
xgb_best_fit <- xgb_best_wf %>% fit(panel_final_train) 

#saveRDS(xgb_best_fit,"fit_on_filtered_panel.RData")
# predict on the test set 
xgb_pred <- xgb_best_fit %>% predict(panel_final_test, type = "prob") %>%
  bind_cols(panel_final_test)

#saveRDS(xgb_pred, "../data/last_prediction.RData")

3.9 Result

The final prediction result of the test set is promising! We achieved an roc-auc of 0.87. The graph below shows the roc curve of the predictions.

xgb_pred <- readRDS("last_prediction.RData")

# draw roc curve 
xgb_pred %>% roc(initBunching, .pred_TRUE) %>%
  ggroc(color = "orange", alpha = 0.8, size = 1) +
  ylab("True Positive") +
  xlab("False Negative") +
  labs(subtitle = "Prediction 20 stops ahead, AUC = 0.87") +
  coord_equal(ratio = 1) +
  theme_ipsum_musa()

xgb_pred <- xgb_pred %>%
  mutate(.pred = if_else(.pred_TRUE > 0.5, "TRUE","FALSE"))

3.10 Adjusting Threshold

We will make the following assumption about interventions that tackle the initiation of bus bunching:

  1. If a bus is predicted to bunch a certain number of stops ahead, it will cost $0 to intervene because someone at the SEPTA dispatch center has to simply call the operator on the radio and direct them to adjust the speed of the bus. This is based on an interview with a SEPTA personnel

  2. A cost of $4.2 will be borne by each rider if a bus bunches but was not predicted to bunch. No intervention is adopted in this case. Riders bear the consequence of the bunching and/or gapping . This cost calculation is based on Federal Highway Administration’s benefit cost analysis guidance on $ value of a commuter’s time. Every hour of travelling costs $17, which is about $0.28 per minute. For example, a rider arrives at a bus stop at the scheduled arrival time of a bus, but has to wait for the next bus because the previous one traveled faster than expected. The rider thus has to wait for an another 15 minutes for the next bus, assuming it comes on time. This costs tje rider 0.28 * 15 = $4.2 to wait for these extra minutes.

  3. Each rider will save $4.2 if a bus is predicted to initiate bunching and actually does bunch. In this case. the bus driver is notified and actions are taken to prevent bunching.

There is a risk of human induced bunching. This can happen if the bus is predicted to initiate bunching, but actually does not. The intervention is still adopted, meaning that the bus driver might be ordered to stop the bus at a stop for a period of time even when it does not need to. However, we believe that in a real life scenario, if one bus slows down, the prediction of the following trips would also be adjusted accordingly. As the buses slow down together, the prediction would update to non-initiation bunching, and the speed would be recovered.

Note: This is only a temporary general framework, the framework should be adjusted based on the real scenario

# Cost and benefit analysis (CBA) when threshold is at 0.5

cost_benefit_table <-
   xgb_pred %>%
      count(.pred, initBunching) %>%
      summarize(True_Negative = sum(n[.pred=="FALSE" & initBunching==FALSE]),
                True_Positive = sum(n[.pred=="TRUE" & initBunching==TRUE]),
                False_Negative = sum(n[.pred=="FALSE" & initBunching==TRUE]),
                False_Positive = sum(n[.pred=="TRUE" & initBunching==FALSE])) %>%
       gather(Variable, Count) %>%
       mutate(Revenue =
               case_when(Variable == "True_Negative"  ~ Count * 0,
                         Variable == "True_Positive"  ~ Count * 4.2,
                         Variable == "False_Negative" ~ -4.2 * Count,
                         Variable == "False_Positive" ~ 0 * Count)) %>%
    bind_cols(data.frame(Description = c(
              "We predicted no bunching initiation and business as usual",
              "We predicted bunching initiation and operation solution adopted",
              "We predicted no bunching initiated and the bus indeed initiate bunching",
              "We predicted bunching initiation and the bus did not intiate bunching"))) 

cost_benefit_table  %>% kable () %>% kable_styling("striped", full_width = F)
Variable Count Revenue Description
True_Negative 426425 0.0 We predicted no bunching initiation and business as usual
True_Positive 75 315.0 We predicted bunching initiation and operation solution adopted
False_Negative 5718 -24015.6 We predicted no bunching initiated and the bus indeed initiate bunching
False_Positive 57 0.0 We predicted bunching initiation and the bus did not intiate bunching

The goal here is to maximize true negatives and minimize false negatives. However, we can sacrifice a little bit on true negatives since it does not provide any benefits.

# function to see sensibility and specificity across threshold

xgb_pred$initBunching <- as.logical.factor(xgb_pred$initBunching)

calculate_metrics <- function(data, threshold) {
  prediction <- data$.pred_TRUE > threshold
  true_positive <- sum(prediction & data$initBunching)
  false_positive <- sum(prediction & !data$initBunching)
  true_negative <- sum(!prediction & !data$initBunching)
  false_negative <- sum(!prediction & data$initBunching)
  
  sensitivity <- true_positive / (true_positive + false_negative)
  specificity <- true_negative / (true_negative + false_positive)
  
  return(data.frame(
    threshold = threshold, sensitivity = sensitivity,
    specificity = specificity
  ))
}

thresholds <- seq(0, 0.3, by = 0.001)
  
metrics <- lapply(thresholds, function(t) calculate_metrics(xgb_pred, t))
metrics <- do.call(rbind, metrics)

metrics %>%
  gather(key = "metric", value = "value", -threshold) %>%
  ggplot(aes(x = threshold, y = value, color = metric)) +
  geom_line() +
  labs(
    title = "Sensitivity and Specificity by Threshold",
    subtitle = "Prediction 20 stops ahead",
    x = "Threshold",
    y = "Metric",
    color = "") + 
  theme_ipsum_musa()

Sensitivity measures the model’s ability to predict true positives, and specificity measures the model’s ability to predict false negatives. The point where these two lines intersect is the optimal threshold – maximizing true predictions. This optimal threshold is around 0.01.

We adjusted our model based on the optimal threshold.

# get prediction result from optimal threshold. 
xgb_pred_0.01 <-
  xgb_pred %>%
  mutate(.pred = if_else(.pred_TRUE > 0.01, "TRUE","FALSE"))
# cost and benefit after adjusting the threshold

cost_benefit_table2 <-
   xgb_pred_0.01 %>%
      count(.pred, initBunching) %>%
      summarize(True_Negative = sum(n[.pred=="FALSE" & initBunching==FALSE]),
                True_Positive = sum(n[.pred=="TRUE" & initBunching==TRUE]),
                False_Negative = sum(n[.pred=="FALSE" & initBunching==TRUE]),
                False_Positive = sum(n[.pred=="TRUE" & initBunching==FALSE])) %>%
       gather(Variable, Count) %>%
       mutate(Revenue =
               case_when(Variable == "True_Negative"  ~ Count * 0,
                         Variable == "True_Positive"  ~ Count * 4.2,
                         Variable == "False_Negative" ~ -4.2 * Count,
                         Variable == "False_Positive" ~ 0 * Count)) %>%
    bind_cols(data.frame(Description = c(
              "We predicted no bunching initiation and business as usual",
              "We predicted bunching initiation and operation solution adopted",
              "We predicted no bunching initiated and the bus indeed initiates bunching",
              "We predicted bunching initiation and the bus does not intiate bunching"))) 

cost_benefit_table2 %>% kable() %>% kable_styling("striped", full_width = F)
Variable Count Revenue Description
True_Negative 311567 0.0 We predicted no bunching initiation and business as usual
True_Positive 4901 20584.2 We predicted bunching initiation and operation solution adopted
False_Negative 892 -3746.4 We predicted no bunching initiated and the bus indeed initiates bunching
False_Positive 114915 0.0 We predicted bunching initiation and the bus does not intiate bunching

After adjusting our threshold, the model can save ($34,307 - $6,244) / 432,335 total arrival counts = $0.04* per rider per arrival during the last week of October taking the routes used in the dataset, which translates to $16,838 saved per rider across all arrivals.

What about the non-filtered dataset? The test dataset above is filtered by initBunching = TRUE or bunched = FALSE. In real time, we do not have the luxury to do this filtering so let’s mimic a real senario and see what the result looks like.

*Based on the Federal Highway Administration’s guidance on the dollar value of a commuter’s time

# prepare test set that have not been filtered

panels_full <- panels %>% dplyr::select(routeId, toStopId, directionId, tripId, instanceId,toStopSequence.x, period,
                initBunching, pop, popDen, commuter,
                riders, comm_count, pop_q3, popDen_q3, 
                commuter_q3, riders_q3, isSignaled, landUse, comm_q3, isTrashDay,
                isCenterCity, serviceDayName, day, "headwayLag21", 
                "headwayLag20Diff21","prevBus_headwayLag20",
                "prevBus_headwayLag19Diff20", "lateLag21",
                "lateLag20Diff21", "prevBus_lateLag20",
                "prevBus_lateLag19Diff20","speedLag21",
                "speedLag20Diff21", "prevBus_speedLag20",
                "prevBus_speedLag19Diff20",ends_with("_20")) %>% filter(day %in% test_day) %>%
  rename(toStopSequence = toStopSequence.x)

test_full <- recipe(initBunching ~ ., data = panels_full) %>%
    update_role(toStopId, tripId, instanceId, new_role = "id") %>%
    step_impute_mean(starts_with("sum"),starts_with("pct")) %>%
    step_unknown(all_nominal_predictors()) %>%
    step_other(threshold = 0.05) %>%
    step_dummy(all_nominal(), -all_outcomes(), one_hot = T) %>%
    step_zv(all_predictors()) %>%
    step_normalize(all_predictors()) %>%
    step_naomit(everything()) %>%
    prep() %>%
    juice() %>% 
    dplyr::select(initBunching,toStopId, tripId, instanceId, all_of(final_variable))

# predict!

xgb_full_pred <- xgb_best_fit %>% predict(test_full, type = "prob") %>%
  bind_cols(test_full)

#saveRDS(xgb_full_pred, "../data/pred_full_test_panel.RData")
xgb_full_pred <- readRDS("../data/pred_full_test_panel.RData")

# draw ROC curve
xgb_full_pred %>% roc(initBunching, .pred_TRUE) %>%
  ggroc(color = "orange", alpha = 0.8, size = 1) +
  ylab("True Positive") +
  xlab("False Positive") +
  labs(subtitle = "Prediction 20 stops ahead (non-filtered), AUC = 0.83") +
  coord_equal(ratio = 1) +
  theme_ipsum_musa()

# xgb_full_pred$initBunching <- as.logical.factor(xgb_full_pred$initBunching)

calculate_metrics <- function(data, threshold) {
  prediction <- data$.pred_TRUE > threshold
  true_positive <- sum(prediction & data$initBunching)
  false_positive <- sum(prediction & !data$initBunching)
  true_negative <- sum(!prediction & !data$initBunching)
  false_negative <- sum(!prediction & data$initBunching)
  
  sensitivity <- true_positive / (true_positive + false_negative)
  specificity <- true_negative / (true_negative + false_positive)
  
  return(data.frame(
    threshold = threshold, sensitivity = sensitivity,
    specificity = specificity
  ))
}

thresholds <- seq(0, 0.3, by = 0.001)
  
metrics <- lapply(thresholds, function(t) calculate_metrics(xgb_full_pred, t))
metrics <- do.call(rbind, metrics)

metrics %>%
  gather(key = "metric", value = "value", -threshold) %>%
  ggplot(aes(x = threshold, y = value, color = metric)) +
  geom_line() +
  labs(
    title = "Sensitivity and Specificity by Threshold",
    subtitle = "Prediction 20 stops ahead on non-filtered test set",
    x = "Threshold",
    y = "Metric",
    color = "") + 
  theme_ipsum_musa()

# CBA on unfiltered test set

xgb_full_pred <-
  xgb_full_pred %>%
  mutate(.pred = if_else(.pred_TRUE > 0.01, "TRUE","FALSE"))

cost_benefit_table3 <-
   xgb_full_pred %>%
      count(.pred, initBunching) %>%
      summarize(True_Negative = sum(n[.pred=="FALSE" & initBunching==FALSE]),
                True_Positive = sum(n[.pred=="TRUE" & initBunching==TRUE]),
                False_Negative = sum(n[.pred=="FALSE" & initBunching==TRUE]),
                False_Positive = sum(n[.pred=="TRUE" & initBunching==FALSE])) %>%
       gather(Variable, Count) %>%
       mutate(Revenue =
               case_when(Variable == "True_Negative"  ~ Count * 0,
                         Variable == "True_Positive"  ~ Count * 4.2,
                         Variable == "False_Negative" ~ -4.2 * Count,
                         Variable == "False_Positive" ~ 0 * Count)) %>%
    bind_cols(data.frame(Description = c(
              "We predicted no bunching initiation and business as usual",
              "We predicted bunching initiation and operation solution adopted",
              "We predicted no bunching initiated and the bus indeed initiates bunching",
              "We predicted bunching initiation and the bus does not intiate bunching"))) 

cost_benefit_table3 %>% kable() %>% kable_styling("striped", full_width = F)
Variable Count Revenue Description
True_Negative 345492 0.0 We predicted no bunching initiation and business as usual
True_Positive 4474 18790.8 We predicted bunching initiation and operation solution adopted
False_Negative 1319 -5539.8 We predicted no bunching initiated and the bus indeed initiates bunching
False_Positive 124218 0.0 We predicted bunching initiation and the bus does not intiate bunching

This result is great! With an roc-auc of 0.83, we can still save $13,251 per rider in the last week of October!

4 Generalizability

How does our model perform on other routes besides the one in the train and test sets? We chose routes 29, 42, 58, 64, 70, and L to test the generalizability of our model.

Note: we trained a new model that does not include route specific variables.

# Load data for other routes
other_routes <- read_parquet("../data/runtime/runtime-other-routes-10-3-to-31-with-lagHeadway.gzip")

# Delete route "R" due to inconsistency of directionId and directionName
route_included <- c("29", "42", "64", "65", "L", "70", "58") 

# Read in stops actually included in the dataset
stop_dir <- other_routes |> filter(routeId %in% route_included) %>%
  dplyr::distinct(routeId, directionId, toStopId, toStopName) 

# Change type for merging
stop_dir$directionId <- as.numeric(stop_dir$directionId)
stop_dir$toStopId <- as.numeric(stop_dir$toStopId)

# Read in stops dataset and filter only the routes included, right join with stop_dir
stops <- st_read("../data/other/stops.geojson") |> 
  filter(LineAbbr %in% route_included) |> 
  distinct(LineAbbr, DirectionN, StopId, .keep_all = T) %>%
  mutate(directionId = case_when(DirectionN == "Eastbound" ~ 0,
                                 DirectionN == "Westbound" ~ 1,
                                 DirectionN == "Northbound" ~ 0,
                                 DirectionN == "Southbound" ~ 1)) %>%
  rename(routeId = LineAbbr, toStopSequence = Sequence, 
         directionName = DirectionN, toStopId = StopId, toStopName = StopName) |> 
  dplyr::select(routeId, toStopSequence, directionId, directionName, toStopId, toStopName) |> 
  st_transform(2272) %>% right_join(stop_dir)

# Stop-level data process and cleaning------------------------------------------

# Traffic signal avaliability
traffic_sig <- st_read("../data/other/traffic_signals.geojson") |> st_transform(2272)

stops <- stops |>
  mutate(isSignaled = lengths(st_intersects(stops, st_buffer(traffic_sig, 100))) > 0)

# Riders
rides_2019 <- read_csv("../data/daily-ridership/Fall_2019_Stops_By_Route.csv") |> 
  filter(Route %in% route_included) |> 
  mutate(riders = Weekday_Bo + Weekday_Le) |> 
  dplyr::select(Stop_ID, Route, Direction, riders) |> 
  distinct()

stops <- stops |> 
  left_join(rides_2019, 
            by = c("routeId" = "Route", "toStopId" = "Stop_ID", "directionName"="Direction"))

# Population by blocks

PhillyBlockPop <- get_acs(geography = "block group",
                          year = 2021,
                          state = "Pennsylvania",
                          county = c("Philadelphia", "Montgomery", "Delaware"),
                          variables = c("pop" = 'B01001_001E'
                          ),
                          output = "wide",
                          geometry = T,
                          survey = "acs5") |>
  st_transform(crs = 2272) 

PhillyBlockPop$area <- as.numeric(st_area(PhillyBlockPop$geometry)) / 1000000
PhillyBlockPop$popDen <- PhillyBlockPop$pop / PhillyBlockPop$area

PhillyBlockPop <- PhillyBlockPop |> dplyr::select(pop, popDen, geometry,GEOID)


## Remerge with stops geometry
stops <-
  st_join(stops, PhillyBlockPop, join = st_within) |> dplyr::select(-GEOID)

# Commuter density by blocks
PhillyTractCommuter <- get_acs(geography = "tract",
                               year = 2021,
                               state = "Pennsylvania",
                               county = c("Philadelphia", "Montgomery", "Delaware"),
                               variables = c("commuter" = 'B08006_001E'
                               ),
                               output = "wide",
                               geometry = T,
                               survey = "acs5") |>
  st_transform(crs = 2272) |>
  dplyr::select(commuter, geometry,GEOID)

stops <- st_join(stops, PhillyTractCommuter, join = st_within)

comm <- st_read("../data/comm.geojson") |> st_transform(crs = 2272)


comm_stops <- st_buffer(stops, 200) |> st_join(comm, join = st_intersects) |> 
  group_by(toStopId, routeId, directionId) |> summarize(comm_count = n()) 

stops <- stops |> left_join(st_drop_geometry(comm_stops)) |> 
  mutate(pop_q3 = q3(pop),
         popDen_q3 = q3(popDen),
         riders_q3 = q3(riders),
         commuter_q3 = q3(commuter),
         comm_q3 = q3(comm_count))

# Stop level lag calculation
stops_spatial_lag <- stops |>
  distinct(routeId, directionId, toStopId, .keep_all = T)|>
  arrange(routeId, directionId, toStopSequence, toStopId) |> 
  group_by(routeId, directionId) |>
  mutate(sumSignal_15 = rollsum(isSignaled, k = 15, fill = NA, align = "right"),
         sumSignal_10 = rollsum(isSignaled, k = 10, fill = NA, align = "right"),
         sumSignal_20 = rollsum(isSignaled, k = 20, fill = NA, align = "right"),
         sumRiders_15 = rollsum(riders, k = 15, fill = NA, align = "right"),
         sumRiders_10 = rollsum(riders, k = 10, fill = NA, align = "right"),
         sumRiders_20 = rollsum(riders, k = 20, fill = NA, align = "right"),
         sumComm_10 = rollsum(comm_count, k = 10, fill = NA),
         sumComm_15 = rollsum(comm_count, k = 15, fill = NA),
         sumComm_20 = rollsum(comm_count, k = 20, fill = NA)) |> 
  mutate(pctSignal_15 = (sumSignal_15 / 15 * 100),
         pctSignal_10 = (sumSignal_10 / 10 * 100),
         pctSignal_20 = sumSignal_20/ 20 * 100) %>% # calculate percentage %>%
  dplyr::select(-sumSignal_10,-sumSignal_15,-sumSignal_20)


# Change data type for merging
stops_spatial_lag$toStopId <- as.character(stops_spatial_lag$toStopId)
stops_spatial_lag$directionId <- as.character(stops_spatial_lag$directionId)

# write.csv(stops_spatial_lag, "../data/stops_spatial_lag_other_routes.csv")

# Join stop-level variable back to the routes-----------------------------------
other_routes <-
  other_routes |> filter(routeId %in% route_included) %>% left_join(stops_spatial_lag, by = c("routeId", "toStopId", "directionId"))

other_routes <- force_tz(other_routes, tzone = "UTC")
other_routes$period <- hour(other_routes$scheduledFromStopDepartureTime)

other_routes_2 <- other_routes %>%
  dplyr::select("toStopId", "tripId", "instanceId", "initBunching","headwayLag21", 
                "prevBus_headwayLag20", "headwayLag20Diff21", "prevBus_lateLag20",
                "sumRiders_20", "lateLag21", "riders", "toStopSequence.x", "prevBus_speedLag20",
                "prevBus_speedLag19Diff20","speedLag21", "speedLag20Diff21",
                "prevBus_lateLag19Diff20", "prevBus_headwayLag19Diff20", "pop", "popDen",
                "lateLag20Diff21","commuter", "sumComm_20", "pctSignal_20",
                "period","comm_count", "routeId") %>%
  rename(toStopSequence = toStopSequence.x) %>%
  filter(routeId %in% route_included) 

other_routes_2 <- other_routes_2 %>% dplyr::select(-routeId)
other_routes_2$toStopId <- as.numeric(other_routes_2$toStopId)

# Preprocess the data for prediction--------------------------------------------
other_routes_processed <- recipe(initBunching ~ ., data = other_routes_2) %>%
  update_role(toStopId, tripId, instanceId, new_role = "id") %>%
  step_impute_mean(starts_with("sum"),starts_with("pct")) %>%
  step_unknown(all_nominal_predictors()) %>%
  step_other(threshold = 0.05) %>%
  step_dummy(all_nominal(), -all_outcomes(), one_hot = T) %>%
  step_zv(all_predictors()) %>%
  step_normalize(all_predictors())%>%
  step_naomit(everything()) %>% prep() %>% juice()

# Load the model----------------------------------------------------------------
fit <- readRDS("../mark-down/fit_for_current_bus.Rdata")

# Predict-----------------------------------------------------------------------
other_pred <- fit %>% predict(other_routes_processed, type = 'prob') %>% bind_cols(other_routes_processed)

#saveRDS(other_pred, "../data/other_routes_pred_results.RData")
# draw roc curve

other_pred <- readRDS("../data/other_routes_pred_results.RData") 

other_pred %>% roc(initBunching, .pred_TRUE) %>%
  ggroc(color = "orange", alpha = 0.8, size = 1) +
  ylab("True Positive") +
  xlab("False Positive") +
  labs(subtitle = "Prediction 20 stops ahead (other routes), AUC = 0.77") +
  coord_equal(ratio = 1) +
  theme_ipsum_musa()

#other_pred$initBunching <- as.logical.factor(other_pred$initBunching)

calculate_metrics <- function(data, threshold) {
  prediction <- data$.pred_TRUE > threshold
  true_positive <- sum(prediction & data$initBunching)
  false_positive <- sum(prediction & !data$initBunching)
  true_negative <- sum(!prediction & !data$initBunching)
  false_negative <- sum(!prediction & data$initBunching)
  
  sensitivity <- true_positive / (true_positive + false_negative)
  specificity <- true_negative / (true_negative + false_positive)
  
  return(data.frame(
    threshold = threshold, sensitivity = sensitivity,
    specificity = specificity
  ))
}

thresholds <- seq(0, 0.3, by = 0.001)
  
metrics <- lapply(thresholds, function(t) calculate_metrics(other_pred, t))
metrics <- do.call(rbind, metrics)

metrics %>%
  gather(key = "metric", value = "value", -threshold) %>%
  ggplot(aes(x = threshold, y = value, color = metric)) +
  geom_line() +
  labs(
    title = "Sensitivity and Specificity by Threshold (Other Routes)",
    subtitle = "Prediction 20 stops ahead on other routes",
    x = "Threshold",
    y = "Metric",
    color = "") + 
  theme_ipsum_musa()

# prediction result after adjusting the threshold

other_pred_0.01 <-
  other_pred %>%
  mutate(.pred = if_else(.pred_TRUE > 0.01, "TRUE","FALSE"))

# CBA on other routes

cost_benefit_table4 <-
   other_pred_0.01 %>%
      count(.pred, initBunching) %>%
      summarize(True_Negative = sum(n[.pred=="FALSE" & initBunching==FALSE]),
                True_Positive = sum(n[.pred=="TRUE" & initBunching==TRUE]),
                False_Negative = sum(n[.pred=="FALSE" & initBunching==TRUE]),
                False_Positive = sum(n[.pred=="TRUE" & initBunching==FALSE])) %>%
       gather(Variable, Count) %>%
       mutate(Revenue =
               case_when(Variable == "True_Negative"  ~ Count * 0,
                         Variable == "True_Positive"  ~ Count * 4.2,
                         Variable == "False_Negative" ~ -4.2 * Count,
                         Variable == "False_Positive" ~ 0 * Count)) %>%
    bind_cols(data.frame(Description = c(
              "We predicted no bunching initiation and business as usual",
              "We predicted bunching initiation and operation solution adopted",
              "We predicted no bunching initiated and the bus indeed initiates bunching",
              "We predicted bunching initiation and the bus does not initiate bunching"))) 

cost_benefit_table4 %>% kable() %>% kable_styling("striped", full_width = F)
Variable Count Revenue Description
True_Negative 522913 0.0 We predicted no bunching initiation and business as usual
True_Positive 5793 24330.6 We predicted bunching initiation and operation solution adopted
False_Negative 1743 -7320.6 We predicted no bunching initiated and the bus indeed initiates bunching
False_Positive 260590 0.0 We predicted bunching initiation and the bus does not initiate bunching

It is expected that roc-auc in real life settings will be lower because there is a lot of randomness to bunching. However, the result is still useful. The roc-auc is high at 0.77 and we can save $17,010 per rider on routes 29, 42, 58, 64, 70, and L during the entire month of October.

5 Conclusion

This report describes the methodology for predicting bunching selected SEPTA bus routes in Philadelphia, PA, by considering a sample in October. The report details the selection of model, the variables used, and the associated costs and benefit analysis of employing the model in a real life scenario. The most significant variables that determine the initiation of bunching are temporal lag variables (headway, speed, lateness), the number of riders, and the hour of operation. Overall, the final model achieved an roc-auc of 0.87 during testing, and saved about $28,000 per rider. The model is also generalizable. When testing on other routes besides the ones in the train set, the roc-auc achieved is 0.77, and saved a similar amount per rider.