import calendar import datetime import matplotlib.pyplot as plt import numpy as np import pandas as pd import plotly.express as px import plotly.graph_objects as go from plotly.subplots import make_subplots import scipy.stats as stats from scipy.stats import ttest_ind, linregress import seaborn as sns import shap from sklearn.cluster import KMeans from sklearn.compose import make_column_transformer from sklearn.decomposition import PCA from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error from sklearn.model_selection import train_test_split, RandomizedSearchCV from sklearn.preprocessing import OneHotEncoder, StandardScaler import statsmodels.api as sm from statsmodels.formula.api import ols import xgboost from xgboost import XGBRegressor, plot_importance import warnings warnings.filterwarnings('ignore')

The bike sharing dataset.

(yeah baby)

Having done mostly simple ML projects lately, I thought i'd brush-up on some DS skills. So as more of a personal note, this is one of those projects where variables don't operate alone. Ergo: The best time for renting a bike could be 2pm, but the day, month and season (plus possibly a host of other variables) will likely have to be factored-in, making this the type of project which isn't simply a "cut through the noise and find data points ML project" and there will be some complex(ish) relationships in the data. For that reason I will use XGBoost for the final model. This kills two birds with one stone because I will want to use SHAP analysis at the end of the project, and XGB likes SHAP (or vice-versa). There is no introduction, this is a basic dataset downloaded from one of the usual data repos so there was no need to build a dataframe or scrape the webz, I am simply on holiday and wanted some BI action.

df = pd.read_csv("/work/hour.csv")

Let's be lazy...

And, in case anyone out there doesn't already know, i'll ask AI why XGB works better for this type of data as opposed to my usual weapon of choice, LightGBM:

XGBoost is often considered favorable for datasets with complex relationships and interactions between features because of the following key reasons: 1. **Robust Handling of Nonlinear Relationships**: XGBoost relies on boosting decision trees, which are inherently good at capturing nonlinear interactions between features. This makes XGBoost highly effective for datasets like yours where relationships (e.g., time, day, weather, etc.) are not simply linear but hierarchical or interdependent. 2. **Built-in Regularization**: XGBoost incorporates L1 (Lasso) and L2 (Ridge) regularization to prevent overfitting. This can make it more robust to noise in a complex dataset compared to LightGBM. 3. **Feature Importances & Explainers**: XGBoost is particularly compatible with SHAP (SHapley Additive exPlanations), making it easier to explain the contribution of each feature—especially useful when analyzing features' impact in scenarios like bike rentals, where temporal and environmental aspects drive variability. 4. **Customized Loss Functions**: XGBoost allows for easier implementation of custom loss functions compared to LightGBM, helpful when dealing with specific business metrics. 5. **Handling Sparsity**: XGBoost has automated handling of missing values and sparse data structures. When features are highly sparse, due to one-hot encoding or otherwise, it can efficiently split around the missing or zero data. 6. **Tree Pruning**: XGBoost uses a "max depth" and other regularization parameters—combined with a "pruning" technique based on anticipated gain—ensuring trees do not over-split during the boosting process. This is crucial when relationships between variables are multilevel or compounding. ### Comparison to LightGBM: LightGBM, while fast (especially on large datasets), optimizes for speed and memory usage with histogram-based learning. While it also models complex relationships well, its search techniques—like leaf-wise splits—can sometimes lead to tricky overfitting if proper tuning isn't applied. This is particularly risky when the dataset is relatively small or has highly nuanced interactions. Ultimately, both are strong libraries, but for nuanced problems like this one, XGBoost's interpretability tools, robust handling of sparsity, and regularization options might better cater to understanding complex driver relationships.

.... AI is definitely American.

Also: "Something something, AI hallucinations", but that's a conversation for another time. Anyhooo:

print(f"The dataset consists of {len(df)} total features. ")

df.head()

df.rename(columns={'dteday': 'date', 'yr': 'year', 'mnth': 'month', 'hr': 'hour', 'workingday': 'work_day', 'weathersit': 'weather_sit', 'hum': 'humidity', 'windspeed': 'wind_speed', 'cnt': 'rental_count'}, inplace=True)

df.info()

Creating some new date-time features including weekday and week_num:

df['weekday'] = df.date.apply(lambda dateString: calendar.day_name[datetime.datetime.strptime(dateString,"%Y-%m-%d").weekday()])

df['week_num'] = df.date.apply(lambda dateString: datetime.datetime.strptime(dateString,"%Y-%m-%d").isocalendar()[1])

Inserting 'week_num' at a preferred location in the dataframe:

df.insert(6, 'week_num', df.pop('week_num'))

df = df.drop(['instant', 'date'], axis=1)

df.head()

Of the four seasons, we see a mean of 2.5 which tells us that one or two of the later seasons are the most popular for rentals by a small margin. Of the two years in the dataset, the mean is slightly leaning toward year #1. Of the months, once again because season, the mean is leaning a little into the later months, but not by much. Weather_sit sees a low-ish average which should signify a trend towards finer weather (with weather_sit 1 being the clearest of the 4 weather conditions). The average casual user figure is around 36 and the average registered user figure is around 154:

df.describe()

Analysis.

Correlations.

The strongest-correlated data to casual users would be the temperature features, followed by the hourly features. Humidity and work_day are the most negatively correlated features for casual users, a strong indicator that a rise in humidity will have an adverse effect on casual bike rentals. And rightly so - why sweat when you don't have to?

The strongest correlations to registered users are similar, but with the hourly variables featuring a tad higher than the temperatures. Work_day sees a higher correlation to the registered users, possibly due to ride sharing during the commute. Humidity doesn't have as much of a negative correlation to registered users, something which may be another indicator of registered users riding the bikes during the commute (get pedalling, your boss doesn't care if you sweat, lolz).

What I want to do here is look at what constitutes as the best conditions for casual use, because they could be future registered users. Although these casual users might purely like to keep their bike rental casual (weekend use etc.) as opposed to registering, it will be interesting to see where the differences lie if nothing else.

df_2 = df.select_dtypes(include=[np.number])

First things first - binning some data. Creating new columns for the average daylight hours in Washington DC - the state from where this data was collected. This should be applied to each season:

df['s1_daylight_hrs'] = df.apply(lambda x: 1 if (x['hour'] > 7 and x['hour'] < 19 and x['season'] == 1) else 0, axis=1) df['s2_daylight_hrs'] = df.apply(lambda x: 1 if (x['hour'] > 6 and x['hour'] < 20 and x['season'] == 2) else 0, axis=1) df['s3_daylight_hrs'] = df.apply(lambda x: 1 if (x['hour'] > 5 and x['hour'] < 21 and x['season'] == 3) else 0, axis=1) df['s4_daylight_hrs'] = df.apply(lambda x: 1 if (x['hour'] > 7 and x['hour'] < 19 and x['season'] == 4) else 0, axis=1)

As well as creating new binary values for daytime features:

df['midnight'] = np.where(df['hour'].between(23, 2, inclusive='right'), 1, 0) df['early_morning'] = np.where(df['hour'].between(2, 6, inclusive='right'), 1, 0) df['morning'] = np.where(df['hour'].between(6, 9, inclusive='right'), 1, 0) df['late_morning'] = np.where(df['hour'].between(9, 12, inclusive='right'), 1, 0) df['afternoon'] = np.where(df['hour'].between(12, 16, inclusive='right'), 1, 0) df['late_afternoon'] = np.where(df['hour'].between(16, 17, inclusive='right'), 1, 0) df['early_evening'] = np.where(df['hour'].between(17, 19, inclusive='right'), 1, 0) df['evening'] = np.where(df['hour'].between(19, 21, inclusive='right'), 1, 0) df['late_evening'] = np.where(df['hour'].between(21, 23, inclusive='right'), 1, 0)

Casual rentals by season.

With the sum total of use for both registered and casual use visualised, we see the figures for registered use in the colder months outweigh that of the casual use, picking up for the registered users in seasons 3 and 4. Season 2 sees around a 20% hike in casual use over registered use for the same season:

season_registered = df.groupby(['season'])['registered'].sum() season_casual = df.groupby(['season'])['casual'].sum() season_registered = season_registered / season_registered.sum() season_casual = season_casual / season_casual.sum() season_registered = season_registered.reset_index() season_casual = season_casual.reset_index()

season_registered

season_casual

Casual rentals by season and year.

The resulting data of the casual users grouped by season and year. As above, we see the 3rd season in year 1 holding the max sum and average values, followed by the 2nd season also in year 1:

df_casual_avg = df.groupby(['season', 'year']).agg({'casual': 'mean'}).reset_index() df_casual_avg.rename(columns={'casual': 'casual_avg'}, inplace=True) df_casual_avg

Casual rentals by hour and month.

There is more visible use after office hours than I initially expected. We see a good spread of use between the hours of 0800 and 1700 for most months except November through to February, where, if we think back, the registered users take the lion's share of the data.

Creating a pivot table to get a better insight into the above data, with the addition of weekday to help return some more helpful information from the time features:

df_casual_hour_month_weekday = df.groupby(['hour', 'month', 'weekday']).agg({'casual': 'sum'}).reset_index() df_casual_hour_month_weekday.rename(columns={'casual': 'casual_sum'}, inplace=True) df_casual_hour_month_weekday

Sorting by value:

df_casual_hour_month_weekday = df_casual_hour_month_weekday.sort_values(by='casual_sum', ascending=False) df_casual_hour_month_weekday

Returning the three most valuable months:

Along with the most valuable times. Which equate to:

• May (Month 5): 1 PM (1867 rentals) on Sunday. • May (Month 5): 3 PM (1780 rentals) on Saturday. • May (Month 5): 3 PM (1733 rentals) on Sunday. • June (Month 6): 3 PM (1776 rentals) on Saturday. • June (Month 6): 1 PM (1752 rentals) on Saturday. • June (Month 6): 2 PM (1744 rentals) on Saturday. • July (Month 7): 1 PM (1638 rentals) on Sunday. • July (Month 7): 2 PM (1548 rentals) on Sunday. • July (Month 7): 3 PM (1456 rentals) on Saturday.

Casual rentals by holiday.

Casual rentals aren't too popular during the holidays, with only 3.61% of total casual use residing in these days:

Casual rentals by work day.

A shade over half (51.1%) of the casual rentals are not on workdays. That leaves 48.9% of rentals being spread out over five workdays, so let's take a look at casual use by weekday to make sure most of the use is indeed on the weekends...

Saturday and Sunday see the highest amount of casual use: But as a sum of use vs. the combined weekdays....

df_weekday_casual = df.groupby(['weekday']).agg({'casual': 'sum'}).reset_index() sat_sun_casual_sum = df_weekday_casual[df_weekday_casual['weekday'].isin(['Saturday', 'Sunday'])]['casual'].sum() weekdays_casual_sum = df_weekday_casual[~df_weekday_casual['weekday'].isin(['Saturday', 'Sunday'])]['casual'].sum() percentage_difference = ((sat_sun_casual_sum - weekdays_casual_sum) / weekdays_casual_sum) * 100 sat_sun_casual_sum, weekdays_casual_sum, percentage_difference

The total casual rentals for Saturday and Sunday combined are 294,373, while for the rest of the weekdays combined, the figure is 325,644. This is a -9.60% difference, meaning weekend rentals are slightly less than weekday rentals. On a day-by day basis however, Saturday and Sunday are head-and-shoulders above any other day.

Casual rentals by weather, humidity and temperature.

Casual usage seems to rise in unison with outdoor ambient temperatures and weather situation 1 (clear, sunny or slightly cloudy & dry weather conditions). A lot of the rentals - over 200 - appear to depend solely on weather situation 1, seeing a drop in rentals in humid weather conditions greater than 0.7:

Casual rentals by temperature and wind speed.

And casual rentals decline as wind speeds rise:

Casual vs. registered rentals per temperature value.

And here is the clearer picture, with only 12% of the 86K sum of registered users (in blue) being casual use in the lower temperature scale which could be a reflection of registered users renting the bikes in the winter months. Rentals in ambient temperatures above 0.5 follow an almost identical trend:

Casual vs. registered rentals per humidity value.

Humidity follows an interesting pattern for casual and registered users, with casual users more likely to rent in low humidity conditions and registered users more likely to rent in the higher humidity scale:

T-statistics & LR slope / intercept back this data up with a high T for temperature and a negative T for humidity. The LR slope signifies an increase of 117.69 rentals per increase in temperature unit, and a decrease of 88.69 rentals per increase in humidity unit:

• Temperature T-Stat: 54.22 • Temperature Linear Regression Slope: 117.69 • Temperature Linear Regression Intercept: -22.81

• Humidity T-Stat: -39.74 • Humidity Linear Regression Slope: -88.69 • Humidity Linear Regression Intercept: 91.30

Variation in registered rentals using ANOVA.

The one-way ANOVA results show a significant variation in registered rentals across different seasons (p-value < 0.05). The F-stat is very high (298.97), with large differences in means among the season groups:

model = ols('registered ~ C(season)', data=df).fit() anova_table = sm.stats.anova_lm(model, typ=1) anova_table

Further analysis of the ANOVA results.

anova_f_stat = anova_table['F'][0] anova_p_value = anova_table['PR(>F)'][0] significant = anova_p_value < 0.05 ss_between = anova_table['sum_sq'][0] ss_within = anova_table['sum_sq'][1] total_variance = ss_between + ss_within eta_squared = ss_between / total_variance { 'anova_f_stat': anova_f_stat, 'anova_p_value': anova_p_value, 'significant': significant, 'eta_squared': eta_squared }

• F-Stat (298.97): A strong variance between seasonal data.

• P-Value (2.93e-189): This (extremely) low value means the results are statistically significant.

• Significance: Variation in registered users across seasons is statistically significant.

• Effect Size (Eta-Squared, 4.91%): While the result is significant, only 4.91% of the variation in registered rentals can be attributed to differences between seasons, which makes sense due to the registered users primarily seeing action in the winter months vs. the casual users. Winter will not be the reason that registered users pick up a bike, however, it will be the reason for a casual user to not pick up a bike.

PCA / Clustering.

There is only one way to look at the overlap between the casual and registered users: clustering and PCA (using 2 components for the sake of this analysis).

df_combined = pd.concat([numeric_df, df_pca], axis=1) cluster_characteristics = df_combined.groupby('Cluster').mean() cluster_characteristics

The characteristics of the three clusters: Cluster 0: • Lower activity in terms of casual and rental counts. • Season is closer to early months (around season 1). • Hour is mostly distributed around midday and in the early day range. • Moderate work_day presence (close to average). • This cluster has lower temperature, a-temp, and daylight hours. • Contains more casual rentals during later evening hours. • PCA indicates data that is less extreme in its separation (negative PCA1, higher PCA2 values). Cluster 1: • Peak activity for casual and registered rentals, concentrated mainly in favourable weather conditions. • Associated primarily with season 2 and 3 (spring and summer), where temperatures are higher and preferable for rentals. • Work-related characteristics, such as higher activity during mid-morning to late afternoon. • Strong correlation to favorable temperatures. • Higher casual and registered rental counts overall compared to other clusters. • Strong positive values in PCA1; peak favourable conditions. Cluster 2: • Contains a mix of seasons but leans toward later months in the year (season 3-4). • Hourly activities fall earlier in the day compared to cluster 1 but have more scattered activity than cluster 0. • Slightly lower work_day influence compared to cluster 1. • Characteristics suggest decline or less favourable conditions; lower temperatures but higher humidity and windy conditions. • PCA sees significant separation in PCA2 with highly negative values, indicating unfavourable scenarios for usage.

df_combined.groupby('Cluster')[['casual', 'registered']].mean()

The breakdown of casual and registered users in the clusters: Cluster 1 has the highest activity for both casual and registered users, where clusters 0 and 2 have lower activity figures. Cluster 0: • Average casual rentals: 14.98 • Average registered rentals: 93.27 Cluster 1: • Average casual rentals: 82.91 • Average registered rentals: 297.62 Cluster 2: • Average casual rentals: 12.15 • Average registered rentals: 79.96

Modeling.

df['hour_sin'] = np.sin(df.hour*(2. * np.pi / 24)) df['hour_cos'] = np.cos(df.hour*(2. * np.pi / 24))

One-hot-encoding some of the categorical values:

transformer = make_column_transformer( (OneHotEncoder(), ['weekday', 'month', 'season', 'weather_sit']), remainder = 'passthrough')

Thinking about some strategies for additional grouping of the already-binned features here. The daylight hours can all go together because there's a good spread of rentals between the hours of 0900 and 2000, especially in the months with longer days. This would be a good opportunity to bin some "time of day" and workday (etc.) features and check for other correlations, with a view to collapsing any irrelevant features for the end model.

Calculating daylight, time and day feature correlations:

daylight_columns = ['s1_daylight_hrs', 's2_daylight_hrs', 's3_daylight_hrs', 's4_daylight_hrs'] daylight_corr = df[daylight_columns + ['casual', 'registered']].corr() time_of_day_bins = ['midnight', 'early_morning', 'morning', 'late_morning', 'afternoon', 'late_afternoon', 'early_evening', 'evening', 'late_evening'] time_of_day_corr = df[time_of_day_bins + ['casual', 'registered']].corr() workday_summary = df.groupby('work_day')[['casual', 'registered']].mean() holiday_summary = df.groupby('holiday')[['casual', 'registered']].mean() { 'daylight_corr': daylight_corr[['casual', 'registered']], 'time_of_day_corr': time_of_day_corr[['casual', 'registered']], 'workday_summary': workday_summary, 'holiday_summary': holiday_summary }

Here we see a low negative correlation between s1_daylight_hrs (spring) for both casual and registered users.

1. Daylight hour bins:

• Summer and Autumn show the most reasonable correlations for casual users at 0.29 and 0.31, respectively. Winter sees the third-strongest correlation at 0.14 for casual, and a seen earlier, a higher rate (0.24 here) for registered users in this season. So I will combine s2 and s3_daylight_hrs due to their close correlations.

2. Time of day bins:

• Early morning shows a negative correlation with both casual and registered users.

• Late afternoon and early evening have moderate positive correlations with registered users.

• Afternoon has the highest positive correlation with casual users, making this the most important bin of all.

3. Workday bins:

• On workdays, casual usage averages less than half of that on non-workdays.

• On holidays, casual rentals are slightly higher, though registered users are lower.

So as well as combining summer and autumn as "combined_daylight_hrs", I will place the time of day bins with minimal relevance into a column named "low_usage". Then I will combine afternoon and late_afternoon into an 'active_afternoon' bin / column.

df['combined_daylight_hrs'] = df['s2_daylight_hrs'] + df['s3_daylight_hrs'] df['low_usage'] = df['midnight'] + df['early_morning'] df['active_afternoon'] = df['afternoon'] + df['late_afternoon'] y = df.casual X = df.drop(['afternoon', 'late_afternoon', 's1_daylight_hrs', 's2_daylight_hrs', 's3_daylight_hrs', 's4_daylight_hrs', 'midnight', 'early_morning', 'casual', 'rental_count', 'hour'], axis=1, inplace=True)

transformed = transformer.fit_transform(X) X = pd.DataFrame(transformed, columns=transformer.get_feature_names_out()) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) best_params = { 'n_estimators': 500, 'learning_rate': 0.05, 'max_depth': 10, 'min_child_weight': 5, 'subsample': 0.9, 'colsample_bytree': 0.7, 'gamma': 0.1, 'reg_alpha': 0.1, 'reg_lambda': 1.0, 'random_state': 42 } xgb_model_2 = XGBRegressor(**best_params) xgb_model_2.fit(X_train, y_train) xgb_model_2.fit(X_train, y_train) xgb_r2 = xgb_model_2.score(X_test, y_test) print(f"R2 score: {round(xgb_r2, 3)}") y_pred = xgb_model_2.predict(X_test) xgb_mse = mean_squared_error(y_test, y_pred) xgb_rmse = np.sqrt(xgb_mse) print(f"RMSE: {round(xgb_rmse, 3)}")

Scatterplot of predicted values:

SHAP explainer:

Results.

The SHAP analysis reveals the top 10 influential features on the model's output and their average impact: • 1: registered (14.30): This feature, which represents the registered users, has the highest influence on predicting casual rentals. It likely captures indirect relationships or overlaps between casual and registered usage patterns. • 2: work_day (9.26): Whether or not it's a workday has a high impact on casual bike rentals. As seen so far in the EDA, casual rentals are generally higher during weekends and non-workdays. • 3: temp (5.55): Temperature has a strong positive influence on casual rental counts, as pleasant weather encourages leisure biking. • 4: hour_sin (4.92). Hour sin & cos are the result of cyclical treatment so we will have to refer to the EDA for more in-depth information re: exact rental times. Time on its own will be important because every rental has a time attached to it, although with the introduction of more variables such as day, month etc., the time variable on its own begins to hold a little less weight. The pivot table in the EDA returns the times of the month for better insights into time ++ [other features]. • 5: hour_cos (4.30). • 6: a-temp (3.27): 'Feels-like' temperature has a notable effect on predicting rentals, as it combines temperature and perceived comfort. This is a very informative nugget which begs further investigation (TBC). • 7: active_afternoon (2.40): Active afternoon hours (afternoon + late_afternoon) obviously influence casual rentals since these times align with leisure activities and have had the benefit of further data binning, due to their importance / influence. • 8: combined_daylight_hrs (2.34): Daylight availability during seasons plays a moderate role in determining rental counts. • 9: week_num (2.28): The week number in the year has some influence, potentially indicating seasonal effects or trends over time. • 10: humidity (2.27): Humidity negatively affects casual rentals, as higher levels of humidity make cycling less comfortable.

Saturday, Sunday and Friday are important features as also seen in the EDA.

Recommendations.

It's doubtful that there exists great potential for flipping casual users, but there is some. The overlap in casual vs. registered use witnessed here doesn't require much investigation and the resulting push for further registrations could be focussed in one or two small areas for maximum impact, such as long-term commuting (this is a hobby project so I won't go into a lot of detail here, but this is one instance that *pops* to me, personally).

Alternatively, if this data were used to forecast conditions for both the placement or removal of bikes in certain sharing locations, there is plenty of information in the clusters, pivot tables and SHAP plot by which to make a few sound decisions. Cheers.

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}*The* bike sharing dataset.