Final Project, MA346, Fall 2025

Lucas Aguiar & Liam Harrigan

Files:

Report

https://deepnote.com/workspace/lucas-workspace-b007-5e603550-5fe7-4d66-bc5f-f6fcdf27adcf/project/Final-Project-9631ccb6-63ae-41a6-a64d-bfaa15749ecc/MA%20346%20Final%20Project%20Report.docx?utm_content=9631ccb6-63ae-41a6-a64d-bfaa15749ecc

NBA Advanced Stats 2023

https://deepnote.com/workspace/lucas-workspace-b007-5e603550-5fe7-4d66-bc5f-f6fcdf27adcf/project/Final-Project-9631ccb6-63ae-41a6-a64d-bfaa15749ecc/nba_2023_advanced.csv?utm_content=9631ccb6-63ae-41a6-a64d-bfaa15749ecc

NBA Per Game Stats w/ Salaries 2023

https://deepnote.com/workspace/lucas-workspace-b007-5e603550-5fe7-4d66-bc5f-f6fcdf27adcf/project/Final-Project-9631ccb6-63ae-41a6-a64d-bfaa15749ecc/cleaned_pergame_and_advance_2023_w_salaries.csv.xls?utm_content=9631ccb6-63ae-41a6-a64d-bfaa15749ecc

Merged Dataset

https://deepnote.com/workspace/lucas-workspace-b007-5e603550-5fe7-4d66-bc5f-f6fcdf27adcf/project/Final-Project-9631ccb6-63ae-41a6-a64d-bfaa15749ecc/merged_dataset.csv?utm_content=9631ccb6-63ae-41a6-a64d-bfaa15749ecc

Brief Dataset Overview:

Our first dataset includes the advanced stats for all active NBA players in the 22-23 season. The data comes from Basketball Reference, a trusted source for NBA statistics. Some key columns we will be using are Player, Team, USG%, WS, and PER.

The second dataset also includes all active players for 22-23, but also has their salaries and basic stats (PPG, RPG, etc.). This dataset comes from a Kaggle user who found player salary information from Hoopshype, a website for NBA news.

Our plan is to test these three advanced metrics (USG%, WS, and PER) to see if higher values in these rates correlate with higher salaries for players. For context, USG% measures how much a player is being used on offense, WS (win shares) measures a player's contribution to their team's wins by crediting them with a portion of the team's wins, and PER (player efficiency rating) measures a player's per-minute productivity by weighing positive and negative stats. One would think that on average, players with higher values in these stats get paid more as a result, but we do not know this for sure.

# First Dataset import pandas as pd url = "https://www.basketball-reference.com/leagues/NBA_2023_advanced.html" tables = pd.read_html(url) df = tables[0] # The first table is the advanced stats, including key columns USG% and TS% df.to_csv("nba_2023_advanced.csv", index=False) # Drop League Average row df = df[df['Player'] != 'League Average'] # Keep only rows where team == "2TM", or rows where the player has only 1 row df_clean = df[ (df['Team'] == '2TM') | # Keep 2TM rows (df.groupby('Player')['Player'].transform('count') == 1) # Keep players with only 1 row ] df_clean

Run to view results

# Second Dataset df2 = pd.read_csv('cleaned_pergame_and_advance_2023_w_salaries.csv.xls') df2 = df2[df2['Season Type'] == "Regular"] # Keep only rows where team == "2TM", or rows where the player has only 1 row df2_clean = df2[ (df2['Team'] == '2TM') | # Keep 2TM rows (df2.groupby('Player')['Player'].transform('count') == 1) # Keep players with only 1 row ] # Find rows in df_clean (stats) that don't have a match in df2_clean (salary) diff_row = df_clean[~df_clean['Player'].isin(df2_clean['Player'])] print(diff_row) # There is one mismatch, with df_clean including "Jeff Dowtin Jr." while df2_clean is "Jeff Dowtin" # Replace player name df2_clean['Player'] = df2_clean['Player'].replace("Jeff Dowtin", "Jeff Dowtin Jr.") df2_clean

Run to view results

Merge the two data sets

# Perform inner join on 'Player' merged_df = pd.merge(df_clean, df2_clean, on='Player', how='inner') merged_df # Filter by minimum minutes requirement filtered_df = merged_df[merged_df['MP_x'] >= 1500].copy() filtered_df

Run to view results

Verify the merge

We know that we should have the same number of rows in each dataset, representing the same number of 22-23 active players in each dataset. In the cleaning phase, we noticed that because some players get traded in the middle of the season, each player would be shown in the data at least three times - totals row, 1st team row (pre-trade), 2nd team row (post-trade), and so on if they were traded more. For players like these, the totals row is what we wanted, and this row always was the first row in the group of rows referencing the traded player. We created code that only kept this totals row, but then noticed that there was one more row in one of the data sets. This was because in one of the datasets, there was a "League Average" row at the bottom of the data. We dropped this and finally had the same amount of rows in each dataset. We also noticed that there was one player, Jeff Dowtin Jr., where the stats dataset included "Jr.", while the salaries dataset did not. This was a simple fix using .replace. We then verified that the salaries corresponding to each player seemed correct and performed the merge.

Multiple Linear Regression

To fit a multiple linear regression model, we will install and import the statsmodels package.

!pip install statsmodels==0.14.5

Run to view results

Now, we can fit the model.

import statsmodels.api as sm # Independent variables X = filtered_df[['USG%', 'WS', 'PER_y']] X = sm.add_constant(X) # adds intercept # Dependent variable y = filtered_df['Adjusted Salary'] # Fit model model = sm.OLS(y, X).fit() # View summary print(model.summary())

Run to view results

We can then create scatterplots to see what player salary looks like relative to each stat.

# Visualizations import matplotlib.pyplot as plt import pandas as pd plt.figure(figsize=(15,4)) # Plot 1: USG% vs Salary plt.scatter(filtered_df["USG%"], filtered_df["Adjusted Salary"], alpha=0.6) plt.xlabel("USG%") plt.ylabel("Salary ($)") plt.title("Salary vs USG%") plt.show() # Plot 2: PER vs Salary plt.scatter(filtered_df["PER_y"], filtered_df["Adjusted Salary"], alpha=0.6) plt.xlabel("PER") plt.ylabel("Salary ($)") plt.title("Salary vs PER") # Plot 3: Win Shares vs Salary plt.scatter(filtered_df["WS"], filtered_df["Adjusted Salary"], alpha=0.6) plt.xlabel("Win Shares (WS)") plt.ylabel("Salary ($)") plt.title("Salary vs WS") plt.show()

Run to view results

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}Final Project, MA346, Fall 2025

Merge the two data sets

Verify the merge

Multiple Linear Regression

Final Project, MA346, Fall 2025