%run -i appendix.ipynb

df = data_acquisition_preparation("Trips_2018.csv")

distributions_visualisation(df)

df = df.drop(df.index[df.tripduration_min > 120]) df = df.drop(df.index[df.birth_year < 1948])

trip_duration_analysis(df)

distribution_of_rides(df)

most_popular_trips(df)

predict_df = data_preparation(df)

train, test = split_data(predict_df,'2018-04-01', '2018-05-31', '2018-06-01', '2018-06-07')

X_train, y_train, X_test, y_test = get_target_features(train,test)

#defining the models we use to predict the pickups models = [ LinearRegression(fit_intercept=False), SVR(gamma='auto', kernel='linear'), RandomForestRegressor(random_state= 0, n_estimators=300)]

compare_models(X_train, y_train, X_test, y_test, models)

print_scatterplots(X_train, y_train, X_test, y_test, models)

polynomial_regression(X_train, y_train, X_test, y_test)

train2, test2 = split_data(predict_df,'2018-01-01', '2018-10-31', '2018-11-01', '2018-12-31')

X_train2, y_train2, X_test2, y_test2 = get_target_features(train2,test2)

#the models we use are the same as in part 1 compare_models(X_train2, y_train2, X_test2, y_test2, models)

print_scatterplots(X_train2, y_train2, X_test2, y_test2, models)

polynomial_regression(X_train2, y_train2, X_test2, y_test2)

weather = get_weather_festivity_data('NYCweather2018.xlsx')

reg_data = data_preparation_with_weather_festivity(df, weather)

train, test = split_data(reg_data,'2018-01-01', '2018-10-31', '2018-11-01', '2018-12-31')

X_train, y_train, X_test, y_test = get_target_features(train,test)

compare_models(X_train, y_train, X_test, y_test, models)

polynomial_regression(X_train, y_train, X_test, y_test)

Another interesting aspect for the bikesharing operator is whether a pickup results in a long or short rental period. This info can be used for two business reasons. First, it can be used in combination with the number of pickups per hour to predict how many bikes will be available in upcoming time slots. Second, the system can then provide personalized offers to the customer when a pickup is made and the trip duration is correctly classified. The objective of our classification is to predict at a pickup, if the ride will be longer or shorter than 15 minutes. Therefore, we used the following features: start_station_id, start_station_latitude, start_station_longitude, usertype, birth_year, gender, weekday, month, hour. The target variable is based on the tripduration in minutes, where trips under 15 min are put in class 1 and the rest in class 2. We ran a Logistic Regression and an ANN on our data, where the Logistic Regression showed the better results. The results are shown in the following.

features = ['start_station_id', 'start_station_latitude', 'start_station_longitude', 'usertype', 'birth_year', 'gender', 'hour', 'weekday', 'month'] X = get_features(df, features)

#select range of minutes to define the classes bins = [0, 15, max(df["tripduration_min"])] #set threshold labels = [1,2]

y = get_target(df, bins, labels)

from sklearn.model_selection import train_test_split x_train,x_test,y_train,y_test=train_test_split(X, y, train_size=2/3, shuffle=True)

classification_logistic_regression(x_train,x_test,y_train,y_test, features)

The confusion matrix suggests that the classifier is not doing a very good job of predicting the two classes. Clearly, our classes are imbalanced, with around 68.6% doing rides under 15 minutes. Thus, even if we predict all observations as doing a short trip we would get an accuracy of 68.6%. The f1 score is a better measure for the model performance in our case and also incorporates information about how well the model is predicting each of the classes. Since the f1 score is low (<0.4), it means that our classification is not very useful from a business point of view.

stations, total_pickups = data_preparation_cluster(df)

df_clusters, k = create_clusters(stations, 5, total_pickups)

plot_clusters_distribution(df_clusters, k, 5 )

plot_stations(df_clusters, df, 5)

Conclusion

Given the size of the dataset, it is a good base for an insightful analysis. Using polynomial regression, the prediction of the pickups per hour could be solved with satisfactory results. As seen in the exploratory part, it is useful to add additional data like the weather forecast to improve the prediction. For actual use in a business environment, we would recommend including detailed weather data for the different stations, information about tourist flows in the city, and data about the number of people working that day. To obtain an actual value for the operator of the bike-sharing, it is necessary to accurately predict the pickups as well as the dropoffs per station, so it is possible to adjust the number of available bikes at different stations throughout the day and therefore to avoid empty stations. Bringing bikes to the right locations will increase the revenue and avoid unhappy customers. This could be achieved by giving free riding time for people who drop off their bikes at an often empty bike station. The results of the classification of the trip duration are poor, but by looking at the coefficients of the Logistic Regression, one can see that the feature subscriber has the greatest influence on the prediction of the class. Female subscribers are most likely to do long bike rides, so it could make sense to use that in the calculation of available bikes and also to make this group a personalized offer to win more customers. To improve the prediction, it would be useful to collect individual data for every subscriber in the database, so it is possible to predict if it will be a long ride based on previous rides of the user.