# Fantasy Football Project

The goal of this project is it to incorporate my basic understandings of Machine Learning into fitting a model to some training data, and using that to score our model on some test data.

Read in 2020 Football data.

year2020_football dataset contains data on NFL league football players performance for the 2020 football year.

Read in 2019 Football data.

year2019_football dataset contains data on NFL league football players performance for the 2019 football year.

Read in 2018 Football data.

year2018_football dataset contains data on NFL league football players performance for the 2018 football year.

Read in 2017 Football data.

year2017_football dataset contains data on NFL league football players performance for the 2017 football year.

# Data Cleaning

Upon looking at the first outputs of our data one can notice there are a lot of extra columns. For the purpose of this project some of these columns are not necessary and so we can use a function to drop columns we will not use in the analysis.

Use drop function to drop columns in all datasets for all years

One can notice the dropped columns we identified in the function.

Create a function to rename players names because they have weird asterisks now.

Create a function to rename columns so it is easier for the reader to interpret the data in the columns.

No output means success

With the output the names in the datasets now look normal and a reader not knowing much about American Football could now look up what the column names mean to interpret the data.

Create a new column called year within each dataset so it is easier to see what year the football data is coming from especially once we join the datasets.

With the output one can notice the new year column. Now adjust the columns in each dataset so year is right after player and team played for.

No output means success

It is important to join datasets so we can see the data for multiple players and years they played in the NFL all in one dataframe.

We will use the pandas concatenation function to join together the dataframes from each year of football data we have. The datasets contain the same column headers so that is why we use concatenation.

Convert year columns to integers

```
Player object
Team object
Year int64
TotalTD int64
Points_Per_Reception float64
RushingAtt int64
Receiving_yards_per_Reception float64
RushingYDs int64
PassingAtt int64
Fumbles_Lost_by_player int64
Rushing_Yards_per_Attempt float64
FantPos object
PassingYDs int64
Games_Played int64
Pass_Targets int64
ReceivingYDs int64
Receptions int64
ReceivingTD int64
Interceptions_Thrown int64
Age int64
Passes_Completed int64
RushingTD int64
PassingTD int64
dtype: object
```

With the output, all of the datasets are now concatenated into one dataframe which will allow us to begin prepping for our analysis portion.

# Dashboard Exploration

For purpose of Data Exploration portion

These lines of code just export the concatenated data file to a csv so I could use that csv file for my dashboard.

Lets take a brief look at the data before we dive into the analysis portion of this project. To do that we will create a table.

Streamlit has an st.write function which allows a user to explore the rows and columns of a dataset. One can first explore the fantasy football data and then can use the sidebar called user inputs to pick players and compare there touchdown performance on a bar chart.

The link to my published Heroku dashboard - https://fantasyfootball2311.herokuapp.com/

This link can also be found on my GitHub repository

# Analysis Preparation

For the analysis portion we are going to use machine learning classification to predict which players had a good season based on the current data.

Points per reception is one way fantasy football is scored. It is a good good indicator of whether a player had a good season or bad season because it incorporates some important football statistics into a fantasy game style atmosphere.

We are going to need a Boolean response column for our machine learning logistic regression. Lets make this column called good season and give a 1 to players who had a good season, and a 0 for players that did not. To indicate whether a player had a good season or not lets say PPR > 95 was in fact a good season.

Create good season variable and bad season variable

Create good season column on the dataframe and give the value 1 to all players that met a good season requirement

```
/shared-libs/python3.7/py-core/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
"""Entry point for launching an IPython kernel.
```

add the value 0 to players who fit the bad season requirements so we have a Boolean column of 1 and 0

```
/shared-libs/python3.7/py-core/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
"""Entry point for launching an IPython kernel.
```

Combine both bad and good season columns so we can perform our machine learning analysis.

With the output one can notice the new Good_Season column and all the concatenated dataframes

We need numpy to convert the nan values to 0. Import numpy

convert nan values to zero in the columns we will be using for the analysis portion. This is so the logistic regression function will run.

Upon output all nan values are converted to 0

# Analysis

Load scikit-learn's logistic regression tools and use them to fit a model to your training data. Use all variables in this model.

```
/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/linear_model/_logistic.py:765: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
```

No output means success

Write formulas that compute the precision, recall, and F1 score for your model.

with the output, precision, recall and F1 scores on all the data. Refer to the report in my repository for more on what precision recall and F1 tell us about our model

Split your training data into two parts, 60% and 40%. But instead of calling them training and testing data, call them training and validation data.

Train the model on just the training data, and compute its F1 score on just the validation data.

Was having trouble with using .loc['column name'] so print validation data set to make sure the necessary columns I defined as predictors in first line of analysis are numbered into .iloc

Make sure these match

```
/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/linear_model/_logistic.py:765: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
```

With the output of F1 we get both the combined precision and recall of our model on both the training and validation dataframes. Refer to the report in my repository for more on what precision recall and F1 tell us about our model.

Now we can abstract all of our work into two functions, so that we can do some model fitting and scoring with two function calls.

```
/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/linear_model/_logistic.py:765: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
```

same output as prior code just in a function now

- Fit a separate model to the standardized predictors, so that we can look at that second model's coefficients. - Print out the coefficients for that second model. (You will still return the original model; this second model is only for printing standardized coefficients.)

```
ReceivingYDs 3.294086
Receptions 3.016573
RushingYDs 2.217588
PassingTD 1.697942
TotalTD 1.483499
Passes_Completed 1.396565
PassingYDs 1.297549
PassingAtt 1.269441
ReceivingTD 1.210126
RushingAtt 1.075806
Pass_Targets 1.042798
RushingTD 0.756769
Interceptions_Thrown 0.746373
Fumbles_Lost_by_player -0.255302
Games_Played 0.212680
Rushing_Yards_per_Attempt 0.163699
Receiving_yards_per_Reception 0.053851
dtype: float64
0.9287410926365796 0.9443561208267091
/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/linear_model/_logistic.py:765: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
```

These coefficients tell us what columns had the most affect on our model. It seems like Receiving yards affected the model the most. And it seems like Games played had little affect on our model.

In conclusion, my model was .9287410926365796 successful in predicting if a player would have a good season for the current year training data. And my model was 0.944356120826709 successful in predicting if a player would have a good season for the current year validation data.