Predicting the 2023 NBA Draft Using Decision Trees

by Austin Jeong (austinsehojeong@gmail.com) and Thomas Kim (thomasekim10@gmail.com)

Every year, the NBA hosts a draft to recruit the next generation of talent into the league. From the 100-200 prospects who declare for the draft, only 60 players are ultimately selected. With every draft comes an element of unpredictability due to the possibility of players being overrated, known as "busts", or underrated, known as "steals". Since teams cannot always predict a player's potential before they are drafted, they must carefully strategize and choose players who will enhance their roster. So how do they do that?

The drafting process does not neccessarily come down to an exact science. However, according to an article on Bleacher Report, NBA draft scouts often look for physical attributes and athleticism. This includes height, wingspan, speed, and verticality. The NBA is a very competitive league, and having vertical reach and horizontal mobility is undoubtedly an important asset to have. They also seek players with solid work ethic and coachability. And of course, they look for a saavy basketball player. Someone who excels at offense or defense, and in the best case, both. And if a team cannot acquire a generational talent, they try to secure a role player who can seemlessly fit into and contribute to a team's system. There is no qualitative data that can encapsulate all these factors, and the selection often comes from a scout's knowledge and expertise. However, we believe that a prospect's draftability is reflected by their on-court productivity in college.

We will employ a Decision Tree Classifier to predict the 2023 draft order. This approach emulates the decision making process an NBA scout could make when selecting players. Our model will be trained on data containing various college statistics as well as height. An important thing to note is that we will remove players who did not attend US college from our data. While many of the most famous picks in the NBA draft came from the non-college route, such as Kobe Bryant, Lebron James, and most recently Victor Wembenyama, we want to maintain a level of uniformity in our data because the competition in high school and international leagues can be argued to be different versus that in college. As a result, our 2023 draft predictions will not include non-college prospects.

Import Libraries and Datasets

First, import necessary libraries. We will primarily be using Pandas for data manipulation as well as Numpy, Matplotlib, and Sklearn for doing mathematical calculations, plotting figures, and modeling, respectively.

import pandas as pd import numpy as np from sklearn.linear_model import LinearRegression import matplotlib.pyplot as plt from sklearn.tree import DecisionTreeClassifier from sklearn.model_selection import train_test_split from sklearn import metrics

Next, import the necessary datasets. We used the Beautiful Soup Library to webscrape statistics from Basketball Reference and Sports Reference. These are the datasets we found:

nba_draft: drafted NBA players from 2000-2021 and their career statistics

college_stats: drafted NBA players from 2000-2021 and their college statistics

height_wingspan: drafted NBA players from 2000-2021 and their height and wingspan

names_2023: names of the 2023 NBA draft prospects

WS_2023: WS/40 for the 2023 NBA draft prospects

heights_2023: heights for the 2023 NBA draft prospects

stats_2023: college statistics for the 2023 NBA draft prospects

actual_draft: 2023 NBA draft results (excluding draftees who did not attend US college)

nba_draft = pd.read_csv('nba_draft_data.csv') college_stats = pd.read_csv('college_stats.csv') height_wingspan = pd.read_csv('height_wingspan.csv') names_2023 = pd.read_csv('2023_college.csv') WS_2023 = pd.read_csv('2023_college_statistics_ws40.csv') heights_2023 = pd.read_csv('2023_college_heights.csv') stats_2023 = pd.read_csv('2023_college_statistics.csv') actual_draft = pd.read_csv('actual_draft.csv') nba_draft

college_stats = college_stats.drop(['Unnamed: 0'], axis=1) college_stats

height_wingspan

Data Cleaning

We start by removing unwanted rows and columns from the nba_draft dataset and adding respective draft classes for visual purposes, although we will not be using players' draft classes in our model.

# drop filler rows - rows 92, 154, and 279 say "Minnesota Timberwolves forfeited their First Round pick" nba_draft_clean = nba_draft.dropna(how='all').drop([92 ,154, 279]).reset_index().drop(columns=['index']) #remove columns regarding NBA career statistics, since we are using college statistics nba_draft_clean = nba_draft_clean[['Pk','Player', 'College']].reset_index().drop(columns=['index']) #add draft year for each player pick = nba_draft_clean['Pk'] years = [] year = 1999 for i in pick: if (i == '1'): year = year + 1 years.append(year) nba_draft_clean['draft_class']=years nba_draft_clean

Remove players who did not attend college, or where 'College' is nan. This process results in removing 294 players from our dataset.

#remove players who did not attend college nba_draft_clean = nba_draft_clean[nba_draft_clean['College'].notna()] nba_draft_clean

We then merge datasets of NBA draftees, college stats, and their heights.

*Note that college_stats has fewer rows than nba_draft_clean because of problems we ran into while web scraping, such as not being able to find the player or some stats were missing. As a result, when we merge these two datasets, we exclude the players in nba_draft_clean that are not in college_stats.

nba_draft_college = pd.merge(nba_draft_clean, college_stats, how="right", on = ['Player']) nba_draft_college = pd.merge(nba_draft_college, height_wingspan, how='left', on = ['Player']).drop(['College_y'], axis=1) nba_draft_college = nba_draft_college[nba_draft_college['College_x'].notna()] nba_draft_college

Next, we abbreviate column names for better readability and usability. We also add a column called 'STRT%', which is the number of college games started divided by the number of college games played for each player.

nba_draft_college = nba_draft_college.rename(columns={'College_x' : 'College', 'College_Season_x' : 'Season', 'College_Games_Played' : 'GP', 'College_Games_Started' : 'GS', 'College_Field_Goals_Made_Per_Game' : 'FGMPG', 'College_Field_Goals_Attempted_Per_Game' : 'FGAPG', 'College_FG%': 'FG%', 'College_2PT_Field_Goals_Made_Per_Game' : '2FGMPG', 'College_2PT_Field_Goals_Attempted_Per_Game' : '2FGAPG', 'College_2PT_FG%': '2FG%', 'College_3PT_Field_Goals_Made_Per_Game': '3FGMPG', 'College_3PT_Field_Goals_Attempted_Per_Game' : '3FGAPG', 'College_3PT_FG%' : '3FG%', 'College_Free_Throws_Made_Per_Game' : 'FTMPG', 'College_Free_Throws_Attempted_Per_Game' : 'FTAPG', 'College_FT%' : 'FT%', 'Offensive_Rebounds_pergame' : 'ORPG', 'Defensive_Rebounds_pergame' : 'DRPG', "Total_Rebounds_pergame" : 'RPG', "Assists_pergame": "APG", "Steals_pergame": "SPG", "Blocks_pergame": "BPG", "Turnovers_pergame": "TPG", "Fouls_pergame" : "FPG", "Points_pergame": "PPG", "Team_strength_of_schedule" : "SOS", 'Offensive_Win_Shares' : 'OWS', 'Defensive_Win_Shares' : 'DWS', 'Win_Shares' : 'WS', 'Win_Shares_Per_40' : 'WS/40'}) nba_draft_college['STRT%'] = nba_draft_college['GS']/nba_draft_college['GP'] nba_draft_college

We now proceed with dropping unnecessary columns. We drop features that will not be used in the training data, such as 'Player', 'College' and 'draft_class'. We also drop columns that can be explained better in another singular column. For example, FG% (field goal percentage) is a result of the field goals attempted and made, so we can drop columns like 'FGMPG' (field goals made per game) and 'FGAPG' (field goals attempted per game).

We also fill nan with 0 for every column except for height and remove rows with nan in the height column.

to_drop = ['Player', 'College', 'draft_class', 'Season', 'GP', 'GS', 'FGMPG', 'FGAPG', '2FGMPG', '2FGAPG', '3FGMPG', '3FGAPG', 'FTMPG', 'FTAPG', 'ORPG', 'DRPG', 'OWS', 'DWS', 'WS','Wingspan'] nba_draft_college_clean = nba_draft_college.drop(to_drop, axis=1) nba_draft_college_clean[['FG%', '2FG%', '3FG%', 'FT%', 'RPG', 'APG', 'SPG', 'BPG', 'TPG', 'FPG', 'PPG', 'SOS', 'WS/40', 'STRT%']] = nba_draft_college_clean[['FG%', '2FG%', '3FG%', 'FT%', 'RPG', 'APG', 'SPG', 'BPG', 'TPG', 'FPG', 'PPG', 'SOS', 'WS/40', 'STRT%']].fillna(0) nba_draft_college_clean = nba_draft_college_clean.dropna() nba_draft_college_clean

Now, the data is organized and void of any nan values. nba_draft_college_clean will be used as the training/validation data and contains 831 players' pick number and college statistics. We can proceed with the model.

Implementing the Decision Tree Model

The features that we decided to use in our model are:

FG%: the percentage of field goal attempts made from any distance, excluding free throws. This includes 2-point and 3-point field goal percentages, so those are excluded. FG% indicates a player's efficiency and quality of shot selection which are important for a productive player.

FT%: the percentage of free throw shots made. A high percentage free throw shot is a good indication of great shooter. It can also represent a player's focus and work ethic.

RPG: the number of rebounds per game. Represents a player's activity on the board and their physicality. Having a high number of rebounds per game indicates the player is securing the ball and can help contribute to a team's win.

APG: the number of assists per game. Assists are recorded when a player passes the ball to a teammate which leads directly to a made shot. Displays a player's playmaking and decision-making skills. Represents a player's offensive ability.

SPG: the number of steals per game. Steals are recorded when a player intercepts a pass or dislodges the ball from an opposing player. Steals disrupt the opponent's offense and often creates offensive opportunities through fastbreaks. Represents a player's defensive ability.

BPG: the number of blocks per game. Represents a player's defensive ability to stop shots close to the rim. Having a player who is good at blocking can pose as a difficulty for teams to score in the paint.

PPG: the number of points per game. Represents a player's shotmaking ability and efficiency on offense. Players with high points per game often show consistent scoring ability and shot creation. Being efficient and high scoring are traits that are coveted in the NBA.

WS/40: the number of wins shares per 40 minutes played. Win shares are an advanced statistic that estimates the number of wins an individual contributes to a team and is calculated using a multitude of statistics (more information here). We use win shares per 40 minutes over win shares to account for the variations in playing times between players. WS/40 estimates the number of win shares a player contributes if they played a full 40 minute game.

Height: the height of a player in centimeters. Height is not the be-all end-all for NBA players, but it does improve a player's ability to do things on the court, from shooting at a higher point, passing over defenders, to grabbing rebounds and blocking.

For the Decision Tree model, we referred to this tutorial from Datacamp. We first create an 80/20 split and then we run the model for 1000 iterations and save the best performing classifier into best_clf. We determine the performace of a single model by getting the mean of the differences between our predicted pick and actual pick and save the the lowest mean difference into best_diff. We use the absolute differences as our performance evaluator as opposed to accuracy because with 60 different classes, the accuracy will tend to be low for all cases. For instance, if our model predicts a player to be one position above or below their actual pick, this means the model performed well despite not being completely accurate. We then store the array of differences into best_abs_diff.

#following tutorial at: https://www.datacamp.com/tutorial/decision-tree-classification-python feature_cols = ['FG%', 'FT%', 'RPG', 'APG', 'SPG', 'BPG', 'PPG', 'WS/40', 'Height'] X = nba_draft_college_clean[feature_cols] y = nba_draft_college_clean.Pk best_diff = 60 for i in range(1000): X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1) clf = DecisionTreeClassifier() clf = clf.fit(X_train, y_train) y_pred = clf.predict(X_test) abs_diff = np.absolute(pd.to_numeric(y_pred) - pd.to_numeric(y_test)) if (best_diff > np.mean(abs_diff)): best_clf = clf best_diff = np.mean(abs_diff) best_abs_diff = abs_diff plt.hist(best_abs_diff, bins=59, range = (0,59)) plt.title(label="Distribution of difference in predicted pick # vs. actual pick #") plt.xlabel('difference in pick #') plt.ylabel('frequency') plt.axvline(best_diff, color='red', linestyle='dashed', linewidth=1) print("Average difference in predicted pick # vs. actual pick #: ", best_diff)

best_clf.get_depth()

Above, we can see the distribution of differences between our predictions and actual pick numbers in our test data. The distribution is skewed right, which is a good sign for us since we want the differences to be as small as possible. The red dotted line indicates the mean difference.

We visualized part of the decision tree here to show how the tree comes to its classifications.

from sklearn import tree fig = plt.figure(figsize=(25,25)) _ = tree.plot_tree(best_clf, max_depth=2, feature_names=feature_cols, class_names=nba_draft_college_clean.Pk, impurity=False, filled=True)

Now, we can use our model to predict the picks for the 2023 NBA draft. First, we merge and manipulate the 2023 draft datasets so that we get a dataset that is the same format as nba_draft_college_clean.

*Note that the players listed in the class_2023 dataset are the top 60 players that attended a US college from this list (this link now shows the best available players after the 2023 Draft took place). In other words, we are simulating a draft without non-college players.

#merge datasets and clean class_2023 = names_2023.join(stats_2023) to_drop = ['FNAME', 'Unnamed: 0', 'College_Games_Started', 'College_Games_Played', 'College_2PT_FG%', 'College_3PT_FG%', 'Fouls_pergame', "Team_strength_of_schedule", "Turnovers_pergame"] class_2023 = class_2023.drop(to_drop, axis=1).rename(columns={'College_FG%': 'FG%', 'College_FT%' : 'FT%', "Total_Rebounds_pergame" : 'RPG', "Assists_pergame": "APG", "Steals_pergame": "SPG", "Blocks_pergame": "BPG", "Turnovers_pergame": "TPG", "Points_pergame": "PPG", "Team_strength_of_schedule" : "SOS"}).fillna(0) class_2023 = pd.merge(class_2023, WS_2023, how="left", on = ['Player']).drop(['Unnamed: 0'], axis=1) class_2023 = pd.merge(class_2023, heights_2023, how="left", on = ['Player']) class_2023 = class_2023.drop(['Player'], axis=1) class_2023

We now use our model to make predictions on the 2023 draft prospects.

predictions = best_clf.predict(class_2023) pick_order = [] labels = [] for i in range(1,61): pick_order.append(i) labels.append(str(i)) class_2023_predictions = names_2023.drop(['FNAME'], axis=1) class_2023_predictions['Projected Pick'] = pick_order class_2023_predictions['Predicted Pick'] = predictions print("Average difference in predicted pick # vs. actual pick #: ", np.mean(np.absolute(class_2023_predictions['Projected Pick'] - pd.to_numeric(class_2023_predictions['Predicted Pick'])))) class_2023_predictions

abs_diff = np.abs(class_2023_predictions['Projected Pick'] - pd.to_numeric(class_2023_predictions['Predicted Pick'])) plt.hist(abs_diff, bins=59, range = (0,59)) plt.title(label="Distribution of difference in predicted pick # vs. projected pick #") plt.xlabel('difference in pick #') plt.ylabel('frequency') plt.axvline(np.mean(abs_diff), color='red', linestyle='dashed', linewidth=1) print("Average difference in predicted pick # vs. actual pick #: ", np.mean(abs_diff))

This table shows what our model predicts to be the order of the 2023 draft.

class_2023_predictions['Predicted Pick'] = pd.to_numeric(class_2023_predictions['Predicted Pick']) class_2023_predictions.sort_values(by='Predicted Pick', ascending=True)

*This part was done after the 2023 NBA Draft took place.

At the time of writing, we now know the outcomes of the 2023 draft, so we can evaluate our model based on the real outcomes as opposed to mock drafts. Here, the 'Predicted Pick' is our model's predictions, the 'Actual Pick' are the results of the 2023 draft, and the 'difference' is the absolute difference between the predicted pick and the actual pick. The players are sorted in ascending orders on 'Actual Pick'.

actual_draft = pd.read_csv('actual_draft.csv') final = pd.merge(class_2023_predictions, actual_draft, how='left', on = ['Player']) final.rename(columns={'Pk': 'Actual Pick'}, inplace=True) final = final.dropna().drop(['Projected Pick'], axis=1) final['difference'] = np.abs(final['Actual Pick'] - final['Predicted Pick']) final = final[['Player', 'Actual Pick', 'Predicted Pick', 'difference']].sort_values(by='Actual Pick', ascending=True) final

The result shows only 36 rows, because 24 players from the mock draft we used did not end up getting drafted.

print('average difference in actual pick # vs. predicted pick #: ', np.mean(final['difference'])) plt.hist(final['difference'], bins=59, range = (0,59)) plt.title(label="Distribution of difference in predicted pick # vs. actual pick #") plt.xlabel('difference in pick #') plt.ylabel('frequency') plt.axvline(np.mean(final['difference']), color='red', linestyle='dashed', linewidth=1)

Takeaways

Our model was able to predict the 2023 draft with a difference error of around 15 picks. Considering that there are 60 picks in the draft where the difference can vary from 0-59, roughly being in the lower quartile of that range is a decent result for us. Furthermore, as stated in the introduction, the NBA draft has some quantitative analytical aspects to it, but it is widely a qualitative decision making progress. That said, some improvements we can make if we were to do this project again are:

Find stronger and more diverse features that can encapsulate a player's draft stock better than college statistics and height. Using better features can improve the performance of our model.

Include non-college athletes into the model. As seen in the 2023 draft which featured 5 non-college prospects in the top 10, more and more players are coming into the NBA internationally or from domestic leagues (G League, Overtime Elite)

Discover a way to fix the issue where our predictions exhibit duplicate pick numbers while maintaining model performance. Ideally, we want our model to rank the players more so than just classifying a suitable pick number. However, we couldn't find a solution for this at the time of writing.

Lastly, we would like to give credit to this article by Saadin Mir which served as an inspiration and a helpful guide to kickstart our project.

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}Predicting the 2023 NBA Draft Using Decision Trees

Import Libraries and Datasets

Data Cleaning

Implementing the Decision Tree Model

Takeaways

Predicting the 2023 NBA Draft Using Decision Trees