Fantasy Football Project
The goal of this project is it to incorporate my basic understandings of Machine Learning into fitting a model to some training data, and using that to score our model on some test data.
Read in 2020 Football data.
year2020_football dataset contains data on NFL league football players performance for the 2020 football year.
Read in 2019 Football data.
year2019_football dataset contains data on NFL league football players performance for the 2019 football year.
Read in 2018 Football data.
year2018_football dataset contains data on NFL league football players performance for the 2018 football year.
Read in 2017 Football data.
year2017_football dataset contains data on NFL league football players performance for the 2017 football year.
Upon looking at the first outputs of our data one can notice there are a lot of extra columns. For the purpose of this project some of these columns are not necessary and so we can use a function to drop columns we will not use in the analysis.
Use drop function to drop columns in all datasets for all years
One can notice the dropped columns we identified in the function.
Create a function to rename players names because they have weird asterisks now.
Create a function to rename columns so it is easier for the reader to interpret the data in the columns.
No output means success
With the output the names in the datasets now look normal and a reader not knowing much about American Football could now look up what the column names mean to interpret the data.
Create a new column called year within each dataset so it is easier to see what year the football data is coming from especially once we join the datasets.
With the output one can notice the new year column. Now adjust the columns in each dataset so year is right after player and team played for.
No output means success
It is important to join datasets so we can see the data for multiple players and years they played in the NFL all in one dataframe.
We will use the pandas concatenation function to join together the dataframes from each year of football data we have. The datasets contain the same column headers so that is why we use concatenation.
Convert year columns to integers
With the output, all of the datasets are now concatenated into one dataframe which will allow us to begin prepping for our analysis portion.
For purpose of Data Exploration portion
These lines of code just export the concatenated data file to a csv so I could use that csv file for my dashboard.
Lets take a brief look at the data before we dive into the analysis portion of this project. To do that we will create a table.
Streamlit has an st.write function which allows a user to explore the rows and columns of a dataset. One can first explore the fantasy football data and then can use the sidebar called user inputs to pick players and compare there touchdown performance on a bar chart.
The link to my published Heroku dashboard - https://fantasyfootball2311.herokuapp.com/
This link can also be found on my GitHub repository
For the analysis portion we are going to use machine learning classification to predict which players had a good season based on the current data.
Points per reception is one way fantasy football is scored. It is a good good indicator of whether a player had a good season or bad season because it incorporates some important football statistics into a fantasy game style atmosphere.
We are going to need a Boolean response column for our machine learning logistic regression. Lets make this column called good season and give a 1 to players who had a good season, and a 0 for players that did not. To indicate whether a player had a good season or not lets say PPR > 95 was in fact a good season.
Create good season variable and bad season variable
Create good season column on the dataframe and give the value 1 to all players that met a good season requirement
add the value 0 to players who fit the bad season requirements so we have a Boolean column of 1 and 0
Combine both bad and good season columns so we can perform our machine learning analysis.
With the output one can notice the new Good_Season column and all the concatenated dataframes
We need numpy to convert the nan values to 0. Import numpy
convert nan values to zero in the columns we will be using for the analysis portion. This is so the logistic regression function will run.
Upon output all nan values are converted to 0
Load scikit-learn's logistic regression tools and use them to fit a model to your training data. Use all variables in this model.
No output means success
Write formulas that compute the precision, recall, and F1 score for your model.
with the output, precision, recall and F1 scores on all the data. Refer to the report in my repository for more on what precision recall and F1 tell us about our model
Split your training data into two parts, 60% and 40%. But instead of calling them training and testing data, call them training and validation data.
Train the model on just the training data, and compute its F1 score on just the validation data.
Was having trouble with using .loc['column name'] so print validation data set to make sure the necessary columns I defined as predictors in first line of analysis are numbered into .iloc
Make sure these match
With the output of F1 we get both the combined precision and recall of our model on both the training and validation dataframes. Refer to the report in my repository for more on what precision recall and F1 tell us about our model.
Now we can abstract all of our work into two functions, so that we can do some model fitting and scoring with two function calls.
same output as prior code just in a function now
- Fit a separate model to the standardized predictors, so that we can look at that second model's coefficients. - Print out the coefficients for that second model. (You will still return the original model; this second model is only for printing standardized coefficients.)
These coefficients tell us what columns had the most affect on our model. It seems like Receiving yards affected the model the most. And it seems like Games played had little affect on our model.
In conclusion, my model was .9287410926365796 successful in predicting if a player would have a good season for the current year training data. And my model was 0.944356120826709 successful in predicting if a player would have a good season for the current year validation data.