Ford Gobike Data Exploration
This data set includes information about individual rides made in a bike-sharing system covering the greater San Francisco Bay area in 2019.
There are 183412 fordgobike trips in the cleaned dataset with 16 specifications or columns (duration_sec, start_time, end_time, start_station_id, start_station_name, end_station_id, end_station_name, bike_id, user_type, member_birth_year, member_gender, bike_share_for_all_trip duration_minutes, start_month, start_weekday, start_hour). Out of 16 specifications 9 are numerical(int or float), 2 are datetime, 4 are object type and 1 is boolean type.
What are the main feature(s) of interest in the dataset?
I am interested in the the features that best predict average bike trip. These are age, gender and user type.
What features in the dataset do you think will help support your investigation into your feature(s) of interest?
I expect that the trip duration is determined from the values in our column 'duration_min'. The subscriber anc customer values from 'user_type' will tell the number of users.
I'll start by looking at the duration_minutes as the main variable of interest. To determine the average bike trips as a unit of time.
Average bike trip duration
Duration has a long tailed distribution. When plotted on a log-scale, the duration distribution looks roughly bimodal, with one peak between 8 and 10. Most of the bike trips lasts between 8 and 15 minutes. There are not a lot of bike trips that lasted less than 3 minutes and over 40 minutes.
The average bike trip is 12 minutes. The standard devaition is 29.9. 25% of the trips are over 5 minutes, 50% over 8 minutes and 75% over 13 minutes. The longest trip is 1424 minutes and the shortest being one minute. With this distribution I really wanted to see how the trip was distributed when looking at just the length of a ride. According to baywheels.com, trips longer than 45 minutes will incur an extra $3 per each additional 45-minutes for those with an annual pass. A single ride cost $3 and only last 30 minutes This fact probably factors into how long a user will ride and why the data shows most trips are below 45 minutes.
How many users for the year?
User Type Key Customer = 24-hour pass or 3-day pass user Subscriber = Annual Member
The bar chart shows over 160,000 subscribing users and 20,000 customers. I f we consider price or value for money then it does mae sense that the majority of users subsrcribe instead of just a single or 3 day pass. You get access to un limited 45 minutes rides which works out much better than single or 3 day passes.
...or using a pie chat
Most users are actually subscribers (89.2% of total users) who are using the program more than actual customers who represents only 10.8% of total users.
Daily Ride Usage
The service is most used on Wednesday with over 35,000 for 2019. The usage decreases significantly on the weekends and no activity on Sunday.
What is the Gender Distribution?
Males use the bike service overwhelmily more than females and other genders. Over 120,000 males used the service in 2019.
What is the Age Distribution?
Most users are between the age 25 and 35. There is a steadly decline in usage from age 35 and up.
Males are outpacing the female and other genders in bike ride usage. Male represents over 120,000 with females representing a third of that amount.
Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?
The bike duration variable took on a large range of values. I did have to perform a log transform. The data looked bimodal under the log transformation with one peak between 8 and 10 minutes.
Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?
There was no unusal distributuons. Each variable explored showed what you would expect.
To start off with, I want to look at the pairwise correlations present between features in the data.
Trip Duration and Age
You can observe that there is a negative correlation where the age decreases as the trip duration increases.
The concentration of rides are for persons between ages 25 and 45 showing the inverse relationship between age and the trip duration.
Trip Duration and Gender
Trimming the trip duration y axis values is best so we can better view the box plot. Trimmed to 50 minutes. The boxplot does show that female and other gender have a higher trip duration than males.
Trip Duration and User Type
The customer is spending more time on a bike trip than subscribers.
Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?
The relationships observed between age and the length of the trip was inversely correlated. The trip duration decreased with an increase in age not surprisingly.
Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?## Comparing user types over trip duration
When looking at the relationship between gender members and trip duration I was surprised to find that male had the lowest trip duration.
I also found that among users, an actual customer is spending more time on a bike trip than subscribers. I would like to explore more on the difference between a customer and a subscriber and how that translates to different trip times for each group.
Continuing looking at the same variables I want to explore the relationship between age, gender and the trip durations
Comparing the gender types as it relates to trip duration, the age 20 to 40 represents the group that does most of the rides. Females and males do appear to have similar ride average
Trip duration on a weekday for each gender type
Not surprising the trip duration start trending up on the weekends from Thursdays to Saturdays. Males still have the shortest bike trip.