Methodology

Below are all imports to be made. See requirements.txt for all install packages for this analysis.

Setup

import pandas as pd import numpy as np import seaborn as sns import ipywidgets as widgets import geopandas import folium from uszipcode import Zipcode from uszipcode import SearchEngine from haversine import haversine from matplotlib import pyplot as plt from geopy.geocoders import Nominatim # from statsmodels.stats.weightstats import ttest_ind import scipy from geopy.distance import geodesic # ignores chaining warming in pandas import warnings from pandas.core.common import SettingWithCopyWarning warnings.simplefilter(action="ignore", category=SettingWithCopyWarning)

# All functions used in the analysis def hs(row): """Finds the distance between the start station and the end station""" loc1 = (row['start_station_latitude'],row['start_station_longitude']) loc2 = (row['end_station_latitude'], row['end_station_longitude']) return haversine(loc1,loc2,unit='mi') '''The get_zipcode() can be used to get the zipcode of one location given the latitude and longitude.''' search = SearchEngine(simple_zipcode=True) def get_zipcode(lat, lon): try: result = search.by_coordinates(lat = lat, lng = lon, returns = 1) return result[0].zipcode except IndexError: return np.nan '''The get_tripzips() can be used to get the start and end zipcode of rides. Due to our use case and the volume of trips, we only needed to match the start station zip code. However, this function may be useful in future analysis.''' # def get_tripzips(latlonDF, start_lat, start_lon, end_lat, end_lon): # #get zipcode for each start station # latlonDF['start_zipcode'] = latlonDF.apply(lambda x: get_zipcode(x[start_lat],x[start_lon]), axis=1) # #get zipcode for each end station # latlonDF['end_zipcode'] = latlonDF.apply(lambda x: get_zipcode(x[end_lat],x[end_lon]), axis=1) # return latlonDF def append_zips(latlonDF, lon, lat): #get zipcode for each start station latlonDF['ZipCode'] = latlonDF.apply(lambda x: get_zipcode(x[lon],x[lat]), axis=1) return latlonDF # Other import to remember global variables: geolocator = Nominatim(user_agent="NYBikeShare Mapping")

Big Query

A collection of our Big Query requests found in the

SELECT usertype, count(*) as RideCount, AVG(tripduration)/60 as Average_Ride_Time, 'New York' AS city FROM `bigquery-public-data.new_york_citibike.citibike_trips` WHERE starttime > date('2017-01-01') and starttime < date('2018-01-01') GROUP BY usertype

SELECT usertype, start_station_latitude, start_station_longitude, end_station_latitude, end_station_longitude, count(*) AS Count_of_Trips, 'New York' as city FROM `bigquery-public-data.new_york_citibike.citibike_trips` WHERE stoptime > DATE('2017-01-01') and stoptime < DATE('2018-01-01') GROUP BY usertype, start_station_latitude, start_station_longitude, end_station_latitude, end_station_longitude ORDER BY Count_of_Trips desc

This chart is empty

Chart was probably not set up properly in the notebook

SELECT DATE_TRUNC(start_date, week) AS week, subscriber_type, COUNT(trip_id) as trip_count, avg(duration_sec) as average_duration, "San Francisco" as city FROM `bigquery-public-data.san_francisco_bikeshare.bikeshare_trips` WHERE EXTRACT(year FROM start_date)=2017 GROUP BY week, subscriber_type

SELECT DATE_TRUNC(start_date, week) AS week, member_gender as gender, member_birth_year as birth_year, COUNT(trip_id) as trip_count, "San Francisco" as city FROM `bigquery-public-data.san_francisco_bikeshare.bikeshare_trips` WHERE EXTRACT(year FROM start_date)=2017 GROUP BY week, member_gender, member_birth_year

SELECT duration_sec, start_station_id, start_station_latitude, start_station_longitude, end_station_id, end_station_latitude, end_station_longitude, trip_id FROM `bigquery-public-data.san_francisco_bikeshare.bikeshare_trips` WHERE EXTRACT(YEAR FROM start_date)=2017

SELECT station_id, name, latitude, longitude, region_id FROM `bigquery-public-data.new_york_citibike.citibike_stations`

SELECT station_id, name, lat, lon FROM `bigquery-public-data.san_francisco_bikeshare.bikeshare_station_info`

# ~10 seconds to run sfStations['zipcode'] = sfStations.apply(lambda x: get_zipcode(x['lat'], x['lon']), axis = 1)

sfStations.head()

# ~45 seconds to run # ny lat lons are reversed in the stations table nyStations['ZipCode'] = nyStations.apply(lambda x: get_zipcode(x['longitude'], x['latitude']), axis = 1)

nyStations.head()

Clean Comparison Dataframes

riders = pd.concat([sfRidersSQL,nyRidersSQL])

freqRiders = pd.concat([sfFreqRides, nyFreqRides]).sort_values('Count_of_Trips', ascending=False) freqRiders['distance'] = freqRiders.apply(lambda row: hs(row), axis=1)

tripsByWeek = pd.concat([SFsubscribertype,NYsubscribertype]) tripsByWeek['week'] = pd.to_datetime(tripsByWeek['week'], utc=True)

tripDemographicsWeekly = pd.concat([tripDemographicsSF, tripDemographicsNY]) tripDemographicsWeekly['week'] = pd.to_datetime(tripDemographicsWeekly['week'], utc=True) ### clean gender column tripDemographicsWeekly['gender'] = np.where(tripDemographicsWeekly['gender']=='unknown', 'other', tripDemographicsWeekly['gender']) tripDemographicsWeekly['gender'] = tripDemographicsWeekly['gender'].str.lower()

returns = freqRiders[freqRiders['distance'] == 0]['distance'].count()/ len(freqRiders) * 100

Analysis

New York Survey Data

nySurvey = pd.read_csv('/work/SanFranciscoBikeShare/Citywide_Mobility_Survey_-_Main_Survey_2017.csv') citiBikeCols = [col for col in nySurvey.columns if 'citi' in col or 'CITI' in col] nySurveyFilt = nySurvey[citiBikeCols]

Figure 1.1

Figure 1.2

Figure 1.3

A larger portion of the participants do not bike. It is possibly that these participants don't particularly use public transportation in any circumstance. Individuals classified as WEIRD (western, educated, industrialized, rich, and democratic) might less inclined to utilize the accessible programs in within their neighborhood. Interesting enough, we found the second largest group stating that the program does not have access to the neighborhoods. The next step in this analysis will connect the survey responses with "Not in my neighborhood" to station areas in NY bike-share data via the stations table.

# Find all zipcodes from new york survey and compile those into a list then a df to merge. noCitiBikeZips = list(nySurvey[nySurvey['qNOCITIBIKE2']=="1"].qzipcmb.unique()) noCitiBikeZips = pd.DataFrame(noCitiBikeZips, columns=['ZipCodes']) # Find all the zipcodes from New York stations and store in the ZipCode col. Takes 1.4 minutes to run. # nyStations = append_zips(nyStations, 'latitude', 'longitude')

nyStations.head()

Figure 1.4

While the initial find that the bike share program was not in their neighborhood was promising, these participants often have a few stations present locally. These stations might not be accessible due to usage or lack bikes available. As the stations table in the NY Citi Bike Big Query dataset doesn't include the number of bikes used per day for the stations, our analysis with the current zip codes has reached a culmination.

San Francisco Survey Data

sf_survey_data = pd.read_csv('LatestData/SF_survey.csv') # how many respondents traveled by bike on a recent trip from time of survey? print(len(sf_survey_data[sf_survey_data['BIKE'] > 0]))

Only 19 respondents of 804 respondents for the survey indicated that they used a bike as a mode of transportation.

# how many respondents had a bike trip more than one day? print(len(sf_survey_data[sf_survey_data['BIKE2'] > 0]))

Only 1 respondent rode a bike for more than one of the days asked about by the survey.

Figure 2.1

Figure 2.2

Figure 2.3

Figure 2.4

Figure 2.5

There aren't any huge differences between the purposes of bike trips and overall trips except that no bike trips were for traveling to school. Looking at the demographics of the respondents for the survey, it may not be representative of the overall population of San Francisco. Most have an income between $100k and $200k, so it makes sense that more trips would be by other types of transportation.

Big Query Comparison Visualizations

Figure 3.1

Figure 3.2

There are significantly more subscribers than customers in both cities.

Figure 3.3

In both cities, customers take marginally longer rides that subscribers. Subscriber ride use remains fairly consistent, regardless of the month, but customer ride time is more variable.

Figure 3.4

San Francisco bike share use seems to ramp up in the fall, while New York seems to be a bit more consistent with only a small increase in the fall. Both cities see a pretty dramatic fall is use in December.

Figure 3.5

Most trips for both New York and San Francisco are taken by riders between ages 20 and 40. In both datasets, there are some outliers that don't make sense, like riders over 100 years old.

San Francisco Geospatial Visualizations

Figure 4.1.1

Figure 4.1.2

There are longer rides towards the outskirts of the city. In general, the central areas see rides mostly under 30 minutes.

Figure 4.1.3

Many of the popular routes seem to be near water or in the city center. There also seem to be 'nodes' near popular landmarks where riders either pick up or drop off bikes.

New York Geospatial Visualizations

Figure 4.2.1

Figure 4.2.2

As with San Francisco, New York seems mostly shorter rides (under 1 hour). The range of average ride length is much larger for New York. The longer rides are a bit further outside of the city.

Figure 4.2.3

Many of the most popular routes again are near water, as with San Francisco. New York also sees many rides through central park.

Significance Testing

We conducted several significance tests to assess the differences in riders between cities. The results of the significance tests tell us... (1) the average duration of female riders in SF is less than the average duration of female riders in NYC (2) the average duration of male riders in SF is less than the average duration of male riders in NY (3) the average distance of female riders in SF is less than the average distance of female riders in NY (4) the average distance of male riders in SF is less than the average distance of male riders in NY (5) The proportion of female riders in SF is less than the proportion of female riders in NY (6) There are no significant differences in the proportions of any generation between cities.

All significance testing results can be found in the notebook: https://github.com/kendalldyke14/SanFranciscoBikeShare/blob/main/Sig_tests_bike_share.ipynb. Due to memory issues in the DeepNote environment, many of the tests involving the NYC data could not be run in this DeepNote notebook and were run locally.

Discussion

Our exploratory data analysis incorporated additional transportation data to understand riders sentiment. While the survey initially seemed promising, we found that there were only 9 participants in the San Francisco survey and 259 participants within the New York survey who were actively using the public bike share programs.

Looking forward, this team plans to expand this analysis to national and international programs. Though there are some additional cities that can be incorporated from Big Query, there are many more bike share programs that don't use this free program. They instead have a smaller API that is supported by Bike-share research (bikeshare-research.org). It is called pyBikes, and they have large city ride top level daily numbers. Though some of our analysis focuses on the difference between subscriber and customer behavior, we would still be able to compare do to some of our time series analysis techniques.

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}Methodology