We are importing the necessary dataframes and loading in the CSV with basic NBA team stats. The data used in this dataset come from https://www.basketball-reference.com/leagues/NBA_2020.html

The original dataset has aterisks next to playoff teams. In order to avoid any confusion, the asterisks will get removed. Additionally, the last row is the league average of each statistic, so that row will be removed.

According to data scientists who work with basketball data, the ideal predictors of a good team are effective field goal percentage, turnover percentage, free throw factor, offensive rebounding percentage, and defensive rebounding percentage. Out of those 5, 3 of them (effective field goal percentage, turnover percentage, and free throw factor) are created here as new columns in the dataframe. Additionally, the last row is the league average of each statistic, so that row will be removed.

In order to find the other reportedly important statistics, we import another CSV and drop the effective field goal percentage column because we already created it as a new column. The turnover percentage column was also created, but another reason to drop it along with opponent's turnover percentage is because the calculation on NBA reference slightly differs from the generic turnover percentage statistic (NBA reference doesn't include assists in the denominator).

Before dropping the asterisks from the end of playoff teams, we create the playoff teams variable which will store the values for which teams made the playoffs and nan for which teams didn't. The non-playoff teams variable does the exact opposite, keeping non-playoff teams and turning playoff teams into nan. Finally, we set the league average variable in case we need it later and drop it from the dataframe.

Here we merge the two dataframes on the Team column

Creating a column for playoff teams and replacing nan values with 0 in preparation of manipulating the column to become binary.

This for loop replaces any nonzero values, which are the playoff team names, into 1's.

Importing the regression tool. Also, setting the predictors to be points per game, effective field goal percentage, turnover percentage, free throw factor, net rating, offensive rebounding percentage, and defensive rebounding percentage. The response is the newly created playoff team column, with 1 for a playoff team and 0 for a non-playoff team.

This code tests out the precision, recall, and F1 of the model. These are all factors that determine the accuracy of the model.

Now, we split the data into training and validation data. This validation data provides a glimpse into the success of the model outside of the dataset.

We now create the Logistic Regression model on the training and validation set and calculate their F1 values in order to determine model accuracy.

Due to the tedious process of running those lines of code every time new predictors are used, these functions take a couple of necessary inputs and return the F1 value.

Now we enhance the fit_model_to function in order to fit a model to standardized coefficients, sort those coefficient values in descending order, and print out those coefficients. The function also takes the predictor columns as an input so that it is changeable.

This model shows that lower defensive rebounding percentage leads to better odds of being a playoff team, which doesn't make sense. The next model will remove defensive rebounding percentage since it seems to be counterintuitive.

The training set improves slightly, but the validation set declines significantly. However, with a low number of data points and such high F1 scores, there could be some overfitting going on.

Let's try making a model with just the 5 biggest factors in predicting winning NBA teams according to data scientists (mentioned earlier).

Once again, a slightly higher training set, and an even lower validation set.

We will now go back to the original dataset.

As mentioned, there are only 30 teams so the dataset is small. This leads to a skew in results from each run through. After rerunning the notebook a few times, the original model with 7 predictor variables is the best at predicting NBA playoff teams.

We will now load the dataframe into a pkl file and score the model without changing it by reading the pickle file and calling the functions.