Final Project, MA346, Fall 2025
Lucas Aguiar & Liam Harrigan
Files:
Report
NBA Advanced Stats 2023
NBA Per Game Stats w/ Salaries 2023
Merged Dataset
Brief Dataset Overview:
Our first dataset includes the advanced stats for all active NBA players in the 22-23 season. The data comes from Basketball Reference, a trusted source for NBA statistics. Some key columns we will be using are Player, Team, USG%, WS, and PER.
The second dataset also includes all active players for 22-23, but also has their salaries and basic stats (PPG, RPG, etc.). This dataset comes from a Kaggle user who found player salary information from Hoopshype, a website for NBA news.
Our plan is to test these three advanced metrics (USG%, WS, and PER) to see if higher values in these rates correlate with higher salaries for players. For context, USG% measures how much a player is being used on offense, WS (win shares) measures a player's contribution to their team's wins by crediting them with a portion of the team's wins, and PER (player efficiency rating) measures a player's per-minute productivity by weighing positive and negative stats. One would think that on average, players with higher values in these stats get paid more as a result, but we do not know this for sure.
Run to view results
Run to view results
Merge the two data sets
Run to view results
Verify the merge
We know that we should have the same number of rows in each dataset, representing the same number of 22-23 active players in each dataset. In the cleaning phase, we noticed that because some players get traded in the middle of the season, each player would be shown in the data at least three times - totals row, 1st team row (pre-trade), 2nd team row (post-trade), and so on if they were traded more. For players like these, the totals row is what we wanted, and this row always was the first row in the group of rows referencing the traded player. We created code that only kept this totals row, but then noticed that there was one more row in one of the data sets. This was because in one of the datasets, there was a "League Average" row at the bottom of the data. We dropped this and finally had the same amount of rows in each dataset. We also noticed that there was one player, Jeff Dowtin Jr., where the stats dataset included "Jr.", while the salaries dataset did not. This was a simple fix using .replace. We then verified that the salaries corresponding to each player seemed correct and performed the merge.
Multiple Linear Regression
To fit a multiple linear regression model, we will install and import the statsmodels package.
Run to view results
Now, we can fit the model.
Run to view results
We can then create scatterplots to see what player salary looks like relative to each stat.
Run to view results