A car-free bridge is still considered a ridiculous idea in many parts of our country. Portlanders beg to differ. Portland's newest bridge, the Tilikum Crossing, opened in 2015, and is highly multimodal, allowing travel for pedestrians, bikes, electric scooters, trains, streetcars, and buses (but the modality of travel by personal car is missing). Bike lanes were not an afterthought, but rather an integral part of the bridge design. One therefore expects to see a good amount of bike traffic on Tilikum.
In this activity, we examine the data collected by the bicycle counters on the Tilikum. Portland is divided into east side and west side by the north-flowing Willamette river and the Tilikum connects the two sides with both eastbound and westbound lanes. Here is a photo of the bike counter (the black display, located in between the streetcar and the bike lane) on the bridge.
Portlanders use the numbers displayed live on this little device to boast about Portland's bike scene in comparison to other cities. The data from the device can also be used in more complex ways. The goal of this lesson is to share the excitement of extracting knowledge or information from data - it is more fun than a Sherlock Holmes tale. In this activity, you get to be Mr. Holmes while you wrangle with the data and feel the thrill of uncovering the following facts that even many of the locals don't know about. $(a)$ Most of those who bike to work on Tilikum live on the east side. $(b)$ Recreational bikers on Tilikum prefer afternoon rides. $(c)$ There are fewer bikers on the bridge after social distancing and they appear to use the bridge during afternoons.
Comparison with Seattle's Fremont bridge bike counter data reveals more, as we shall see: $(a)$ there are fewer bikers on Portland's Tilikum than on Seattle's Fremont bridge in general. $(b)$ During peak hours, bikers are distributed more evenly on Seattle's Fremont bridge travel lanes than on Tilikum. $(c)$ The bike usage on both bridges have shifted to a recreational pattern after social distancing.
The BikePed Portal provides some of the data collected from the counter for the public, but currently only subsampled data can be downloaded from there. Here we shall instead use the full raw data set collected by the counters, which is not yet publicly downloadable. I gratefully acknowledge Dr. Tammy Lee and TREC for making this data accessible. This activity is motivated by the material in Working with Time Series section of [JV-H].
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn; seaborn.set()
As you have seen in previous activities, the first step in dealing with real data is data wrangling to make the data fit our tools. The case of this data is no different. (If you haven't yet heard of Hadley Wickham's famous paper Tidy Data, J. Stat. Software, I recommend you take a look. It begins with the sentence, "A huge amount of effort is spent cleaning data to get it ready for analysis $\ldots$")
# metadata file (small file)
tm = pd.read_csv('../../data_external/tilikum_metadata.csv')
# data file (large file)
td = pd.read_csv('../../data_external/tilikum_20200501.csv')
td.head()
td.tail()
Looking through first few (of the over 300,000) data entries above, and then examining the meta data file contents in tm
, we conclude that volume
gives the bike counts. The volume is for 15-minute intervals as seen from measure_period
. A quick check indicates that every data entry has a starting and ending time that conforms to a 15-minute measurement.
dif = pd.to_datetime(td['end_time']) - pd.to_datetime(td['start_time'])
(dif == dif[0]).all()
Therefore, let us rename start_time
to just time
and drop the redundant data in end_time
and measure_period
(as well as the id
) columns.
td = td.rename(columns={'start_time':'time'}).drop(columns=['end_time', 'measure_period', 'id'])
The meta data also tells us to expect three detectors and three values of flow_detector_id
. Here are a few entries from the meta data:
tm.T.loc[['detector_description', 'flow_detector_id', 'detector_make', 'detector_name', 'facility_description'], :]
Although there are three values of flow_detector_id
listed above, one of these values never seems to appear in the data file. You can check that it does not as follows:
(td.flow_detector_id==1904).sum()
Therefore, going through the meta data again, we conclude that eastbound and westbound bikes pass through the flow detectors with id-numbers 1903 and 1905, respectively.
The next step is to reshape the data into the form of a time series. The start_time
seems like a good candidate for indexing a time series. But it's a red herring. A closer look will tell you that the times are repeated in the data set. This is because there are distinct data entries for the eastbound and westbound volumes with the same time stamp. So we will make two data sets (since our data is not gigabytes long, memory will not be an issue), a tE
for eastbound volume and tW
for westbound volume.
tE = td.loc[td['flow_detector_id']==1903, ['time', 'volume']]
tE.index = pd.DatetimeIndex(pd.to_datetime(tE['time'])).tz_convert('US/Pacific')
tE = tE.drop(columns=['time']).rename(columns={'volume':'Eastbound'})
tW = td.loc[td['flow_detector_id']==1905, ['time', 'volume']]
tW.index = pd.DatetimeIndex(pd.to_datetime(tW['time'])).tz_convert('US/Pacific')
tW = tW.drop(columns=['time']).rename(columns={'volume':'Westbound'})
Note that we have now indexed eastbound and westbound data by time stamps, and renamed volume
to Eastbound
and Westbound
respectively in each case.
We are now ready for a first look at the full time series. Let us consider the eastbound data first.
tE.plot();
Clearly, we have problems with this data. A spike of 7000 bikers passing through in 15 minutes, even for a bike-crazed city like Portland, just does not seem right. Zooming in, we find the situation even more disturbing, with a lot of zero readings before the spike:
tE['2018-11-25':'2019-06-01'].plot();
There are reports from TriMet of construction in 2018 and city traffic advisories in 2019 that might all affect bike counter operation, but since the data set seems to have no means to indicate these outages, we are forced to come up with some strategy ourselves for discarding the false-looking entries from the data.
First, exploiting pandas
' ability to work with missing values, we declare the entries for the dates in the above plot to be missing. Note that missing data is not the same as zero data. When the bike counter is not working, the data should ideally be marked as missing, not zero. Since our suspicion is that outages might have resulted in defective counts, we shall effectively remove all data entries for these dates from the data set, as follows:
tE['2018-11-25':'2019-06-01'] = np.nan
Next, we shall declare all entries with a volume of more than 1000 bikes per 15 minute to be a missing/defective value on both the westbound and eastbound data.
tE[tE > 1000] = np.nan
tW[tW > 1000] = np.nan
After the preparations above, we are now ready to visualize. Let us merge the east and west two data sets on the same time stamp axis.
t = pd.merge(tE, tW, on='time')
t.plot(alpha=0.7, style=['-',':']);
Examining the above graph, we still see spikes that look unreasonably high in the beginning of the data, but they may actually be real because at the official opening of the bridge there were 30,000 to 40,000 people and at least 13,000 bikes milling around. Similarly, the other spikes may be real data. One can try to explain them, e.g., by consulting https://bikeportland.org/events/, from which you might conclude that the spike on August 25, 2019 is due to a Green Loop event, and that large spike on June 29, 2019 might be due to all the people coming over for the World Naked Bike Ride; or was it some afterparty of Loud'n Lit event? I can't really tell. We'll just leave it at that, and blame the remaining spikes on the groovy bike scene of Portland.
The quarter-hour samples look too dense in the plot above. A better picture of the situation is obtained by extracting weekly counts of bikes in both directions from the data.
t.resample('W').sum().plot(style=['-',':'], title='Weekly bike counts on Tilikum');
The Tilikum is being used both by people who commute to work using a bicycle as well as recreational bicycle users. We can understand more about this division among bikers by dividing the data into weekend and weekday entries.
The only technical skills you need for this are numpy.where
and an understanding of pandas.Timestamp
objects. (Please ensure you have studied Working with Time Series section of [JV-H] before proceeding.) Combined with a use of pandas.groupby
, we can then extract the mean biker volumes for each 15-minute interval during the day.
The result is the distribution plotted below.
def weekplot(d, onlyweekend=False, title=None):
weekend = np.where(d.index.weekday < 5, 'Weekday', 'Weekend')
by_time = d.groupby([weekend, d.index.time]).mean()
if onlyweekend:
if title is None: title = 'Bikes per 15-min during weekends'
by_time.loc['Weekend'].plot(title=title)
else:
if title is None: title = 'Bikes per 15-min during weekdays'
by_time.loc['Weekday'].plot(title=title)
weekplot(t)
The hourly distribution is distinctly "bimodal". There is a group of westbound commuters (on their way to work) on the bridge in the morning, and a group (probably the same people) traveling eastbound after work. If you look closely, you will find that there are two slightly smaller bumps indicating that there are some (although many fewer) eastbound morning bikers and westbound evening bikers across the bridge. Yet, on the whole, the data leads us to the interesting conclusion that the overwhelming majority of the bike commuters on the Tilikum live on the east side and commute to the west for work and return daily.
Often the purpose of understanding data is to guide policy and action. What might one do with the pattern we have just discovered? The current numbers are small enough not to pose a bike traffic problem. But envision a future where the bike counts will grow. If it grows maintaining the same lop-sided utilization pattern, what are the city's options to encourage optimal bridge usage? Bike traffic flow control modifications? Generation of more jobs on the east side? More residential zoning near the west end of the bridge? These are complex issues where an urban planner's expertise is needed. Nonetheless, I hope to have convinced you of the importance of going from data (clicks on a counter) to knowledge (patterns of use).
Next, let us look at the non-commuter, recreational, use, assuming that they occur in the weekends. In sharp contrast to the weekday distribution, below we find that the weekend distribution has just one peak.
weekplot(t, onlyweekend=True)
Both the eastbound and westbound lanes seem to find a good amount of use in the weekend. There is, most definitely, a preference for recreational riding in the afternoon. I suppose that is not a major surprise in Portland as afternoons are most often when we are given a reprieve from the battleship gray of the cloud cover.
As you know, on March 18, 2020, in-person instructional activities at all universities in Oregon were suspended, and on March 23 our governor issued the "Stay Home, Save Lives" executive order. Since the Tilikum is near two major universities in Portland, we expect the weekday bike traffic to be impacted by these measures. Let us examine what the data tells us.
weekplot(t.loc[:'2020-03-17'], title='Before social distancing')
weekplot(t.loc['2020-03-17':], title='After social distancing')
Clearly, the strong bimodal distribution has weakened considerably after we all started isolating ourselves. This perhaps comes as no surprise, since both universities on the west side of Tilikum have switched to remote classes. It makes sense that there are fewer westbound commuters in the morning. What about the afternoon peak? One could imagine various explanations for this: people isolating themselves all morning, getting restless in the afternoon, especially with such unusually good weather we were having in April, and deciding to take their bikes out for some fresh air. Whatever be the case, we can summarize our conclusion from the data as follows: social distancing has changed the weekday bike use on Tilikum from a commuter to a recreational pattern.
Of course, we can also compare the overall statistics before and after social distancing, but the results are too blunt to point out differences like the above. From the statistics outputs below, we see that the average number of bikers per quarter-hour in each direction has decreased by about 1:
t.loc[:'2020-03-17'].mean() - t.loc['2020-03-17':].mean()
The data can also tell us the reduction in terms of number of bikers per week, although we should perhaps use it with some caution as not enough weeks have passed after social distancing started to form a robust sample.
t.loc[:'2020-03-17'].resample('W').sum().mean() - t.loc['2020-03-17':].resample('W').sum().mean()
The westbound lane certainly seems to have suffered more reduction in traffic after social distancing, whichever way we slice it.
Although Portland claims to be the first city in the US to adopt the open data program, Seattle's open data program is something to envy. Seattle's Fremont bridge bike counter data, even way back from 2012, is readily available for anyone to download, thanks to their open data program (at the URL below). Let's take a peek at their data.
import os
import shutil
import urllib
url = "https://data.seattle.gov/api/views/65db-xm6k/rows.csv?accessType=DOWNLOAD"
f = "../../data_external/Fremont_Bridge_Bicycle_Counter.csv"
if not os.path.isdir('../../data_external/'):
os.mkdir('../../data_external/')
if not os.path.exists(f):
with open(f, 'wb') as fo:
r = urllib.request.urlopen(url)
shutil.copyfileobj(r, fo)
sd = pd.read_csv(f)
sd.tail()
Let's do some quick clean up and renaming.
sd = sd.rename(columns={'Date' : 'time',
'Fremont Bridge East Sidewalk' : 'East',
'Fremont Bridge West Sidewalk' : 'West'})
sd.index = pd.to_datetime(sd.loc[:, 'time'])
sd = sd.drop(columns=['time', 'Fremont Bridge Total'])
sd.head()
Note that the Seattle data gives counts per hour, not counts per 15-minutes like the Tilikum data. To compare the general statistics, we should resample the Tilikum to get hourly counts.
th = t.resample('H').sum()
th.describe() # Portland's Tilikum
sd.describe() # Seattle's Fremont
The Tilikum data is spikier than Seattle's Fremont data (compare the max
values in the above outputs), but the average volumes (mean
) are clearly higher in Seattle. That the volume is higher in Seattle in even more clear if we plot weekly counts on both bridges on the same axis.
sw = sd.resample('W').sum()
tw = t.resample('W').sum()
fig, axs = plt.subplots(1, 2, figsize=(13, 3), sharey=True)
plt.subplots_adjust(wspace=0.05)
sw.plot(ax=axs[0], title='Fremont bridge (Seattle) bikes/week');
tw.plot(ax=axs[1], title='Tilikum bridge (Portland) bikes/week');
There is a striking difference in the distribution of the average number of bikes/hour during weekdays on the two bridges.
weekplot(sd, title='Fremont (Seattle) on weekdays (Bikes/hr)')
weekplot(th, title='Tilikum (Portland) on weekdays (Bikes/hr)')
The Fremont bridge has good bike traffic flow in both directions during the peak hours, unlike the Tilikum. We conclude that during peak hours, bikers are distributed more evenly on Seattle's Fremont bridge travel lanes than on Portland's Tilikum.
weekplot(sd['2020-03-17':], title='Fremont (Seattle): Weekdays after social distancing'); plt.ylabel('Bikes/hour');
weekplot(th['2020-03-17':], title='Tilikum (Portland): Weekdays after social distancing'); plt.ylabel('Bikes/hour');
Somewhat remarkably, despite all the above-seen differences, the weekday bike counts of both cities respond to social distancing in quite the same fashion: the bimodal weekday distribution of commuting to work has become a unimodal afternoon recreation pattern.
Author: Jay Gopalakrishnan
License: ©2020. CC-BY-SA
$\ll$Table of Contents