Capstone Project - Reel Analytics

Authors:

Blake Fery

Patrick Hayes

Samuel Drolet

Github : https://github.com/amaizehayes/reel-analytics-mads-capstone

Part 0: Introduction

0.1 Project Overview

0.2 How it started

0.3 Reel Analytics Overview

Figure 0.3.1 - RA Grading Card (Rights and Source: https://reel-analytics.net/)

0.4 Project Scope

0.5 Reel Analytics Data

0.6 Pro Football Focus (PFF) Data

0.7 CollegeFootballData (CFBdata)

Part 1: Labeling the Data

1.0 PFF Data import & processing

Figure 1.0.1 - College Divisions Performance Factors

1.1 Player Ranking

Figure 1.1.1 - Ranking of both Fred Biletnikoff and John Mackey Winners - 2014 to 2022

1.2 UMAP dimensionality reduction

Figure 1.2.1 - UMAP 2D projection of PFF data

1.3 KMeans Cluster Analysis

Figure 1.3.1 - Number of Cluster Evaluation

Figure 1.3.2 - Number of Cluster Evaluation

For each of the selected clusters, we analyzed their overall ranking distribution, as seen in Figure 1.3.3. Then, to gain a deeper understanding of these clusters' characteristics, we then extracted their prominent features. This process allowed us to identify the key factors influencing their respective rankings. Here is their respective analysis:

Cluster 3 - Elite Wide Receivers

Cluster 7 - Penalty Magnets/Disruptors

Cluster 5 - Stereotypical & Efficient Tight Ends

Figure 1.3.3 - Cluster Analysis: Ranking Distribution

Figure 1.3.4 - John Mackey Winners Through Clusters

Figure 1.3.5 - Fred Biletnikoff Winners Through Clusters

Figure 1.3.6 - Cluster Analysis - Labeled Receivers

Part 2: Modeling the Data

Our objective was to create a logistic regression model to predict if a player from the RA database will be a successful Wide Receiver in college.

2.0 Combining and Cleaning

To do this, we started with the PFF dataset and applied the clustered label from Part 1 as our target label. With the sparse RA data lacking features to help differentiate players of the same name, we utilized the CollegeFootballData recruit data to add a few additional players to our data frame used for the model.

The next step in cleaning the data frame was to apply thresholds based on RA's documentation (image below). The player data included many data points that fell outside of the provided max and min range for each of the measured athletic scores. To address this, we applied a percentile replacement value for any point that was above or below the threshold.

2.1 Challenging Data

Figure 2.1.1 - Reel Analytics Data by Year

Additionally, and as mentioned before, the RA dataset lacks certain player features that would help in matching with other datasets we're using. For example, the school they committed to or attended and anything about where they played their High School football is absent. These characteristics also led to a shrinking of our usable data pool, as many players with common names were unable to match uniquely.

The funnel below visualizes the fallout we experienced with the dataset used for modeling.

Figure 2.1.2 - Data Fallout

2.2 Modeling the Data

Max Speed

Time to Max Speed

Yards of Separation

Average Transition Time

Average Yards After Catch

With our smaller-than-desired data set of ~2.3k wide receivers, we used a 70 / 30 split for training and testing. Due to the unbalanced labels in our data of successful versus not successful college players, we implemented oversampling techniques to help assist the model. First, we used RandomOverSampler and then passed the data through SMOTE (Synthetic Minority Over-sampling Technique). Both of these packages have different approaches in combating class imbalance but focus on including additional samples to create a better balance. Of note, without these two inclusions, the model was not returning any positive predictions.

The logistic regression model was then created, using a penalty of L2 and a class weight of ‘balanced’ to further help with the class imbalance (shown in Figure 2.2.1). From here, we fed this model into GridSearch to help identify the best value for C (0.01) while cross-validating five times.

Figure 2.2.1 - Mathematical Formula of our Logistic Model

2.3 Judging the Data

With our logistic regression model fit and predicting outcomes for our test data, it was time to see its performance. To judge this, we settled on using the F1-score as our main metric due to its ability to capture both precision and recall for predictions with our class imbalance.

Here we've printed a classification report to easily show the metrics for each outcome, with an F1-score of .62 pointing towards acceptability for our model given all of the constraints we faced.

Figure 2.3.1 - Predicted and Actual Labels by Year

Part 3: Recreating the IGA Score

3.1 Feature Weighting

Now that the weights are linear, the next step was to normalize them in order to work toward a number that would also be on a 100-point scale.

We've included the Reel Analytics weights for direct comparison. Our score is more balanced across the five features than RA's.

3.2 UM IGA Score

Next, we multiplied every player's measurement into a percentile rank based on the normalized feature weight above. The last step was to sum up the five scores for each player and multiply by 100 to achieve our unbiased UM IGA Score.

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}Capstone Project - Reel Analytics

Authors:

Part 0: Introduction

0.1 Project Overview

0.2 How it started

0.3 Reel Analytics Overview

0.4 Project Scope

0.5 Reel Analytics Data

0.6 Pro Football Focus (PFF) Data

0.7 CollegeFootballData (CFBdata)

Part 1: Labeling the Data

1.0 PFF Data import & processing

1.1 Player Ranking

1.2 UMAP dimensionality reduction

1.3 KMeans Cluster Analysis

Part 2: Modeling the Data

2.0 Combining and Cleaning

2.1 Challenging Data

2.2 Modeling the Data

2.3 Judging the Data

Part 3: Recreating the IGA Score

3.1 Feature Weighting

3.2 UM IGA Score

Part 4: Comparing IGA Scores

Part 5: Results

5.1 Results Summary

5.2 Impact & Ethical Considerations

Appendix

Appendix A - Label Analysis

PFF feature list

Label Analysis

Appendix B - Notebooks

Labeling the Data Part 1: PFF Data Engineering

Labeling Data Part 2: Ranking Weights Analysis

Labeling Data Part 3: UMAP and KMeans Cluster Analysis

Data Modeling Part 1: Combine & Clean

Data Modeling Part 2: Logistic Regression

Data Modeling Part 3: Recreate IGA Score

Data Modeling Part 4: Compare IGA Scores

Appendix C - Statement of Work

Appendix D - Works Cited

Capstone Project - Reel Analytics