Kaggle NCAAM 

by AbidApr 9, 2021
0 likes2 duplicates
Share
Twitter iconTwitter
Facebook iconFacebook
Linkedin
Email
Copy link
Save as PDF
  1. NCAAM Spread of Score Difference Prediction
  2. Data preparation
    1. Seeds
    2. Season results
      1. Features
    3. Tourney results
    4. Ratings
      1. Massey Ordinals
  3. Feature Engineering
    1. Train data
      1. Seeds
      2. Season Stats
      3. Ratings
      4. Add symetrical
      5. Differences
    2. Test Data
      1. Preparing
      2. Seeds
      3. Season Stats
      4. Ratings
      5. Differences
    3. Target
  4. Modeling
      1. Cross Validation
      2. Submission

NCAAM Spread of Score Difference Prediction

NCAAM

import os import re import sklearn import numpy as np import pandas as pd import seaborn as sns import matplotlib.pyplot as plt from collections import Counter from sklearn.metrics import * from sklearn.linear_model import * from sklearn.model_selection import * from sklearn.svm import SVC from sklearn.naive_bayes import MultinomialNB from lightgbm import LGBMClassifier,LGBMRegressor from sklearn.tree import DecisionTreeClassifier from sklearn.semi_supervised import LabelSpreading from catboost import CatBoostClassifier,CatBoostRegressor pd.set_option('display.max_columns', None)
DATA_PATH = '/work/NCAAMData/' for filename in os.listdir(DATA_PATH): print(filename)

Data preparation

Seeds

This file identifies the seeds for all teams in each NCAA® tournament, for all seasons of historical data. Thus, there are between 64-68 rows for each year, depending on whether there were any play-in games and how many there were. In recent years the structure has settled at 68 total teams, with four "play-in" games leading to the final field of 64 teams entering Round 1 on Thursday of the first week (by definition, that is DayNum=136 each season). We will not know the seeds of the respective tournament teams, or even exactly which 68 teams it will be, until Selection Sunday on March 15, 2020 (DayNum=132).

The seed is a 3/4-character :

  • First character : Region (W, X, Y, or Z)
  • Next two digits : Seed within the region (01 to 16)
  • Last character (optional): Distinguishes teams between play-ins ( a or b)
df_seeds = pd.read_csv(DATA_PATH + "MNCAATourneySeeds.csv") # df_seeds = pd.read_csv(DATA_PATH + "WNCAATourneySeeds.csv") df_seeds.head()

Season results

This file identifies the game-by-game results for many seasons of historical data, starting with the 1985 season (the first year the NCAA® had a 64-team tournament). For each season, the file includes all games played from DayNum 0 through 132. It is important to realize that the "Regular Season" games are simply defined to be all games played on DayNum=132 or earlier (DayNum=132 is Selection Sunday, and there are always a few conference tournament finals actually played early in the day on Selection Sunday itself). Thus a game played on or before Selection Sunday will show up here whether it was a pre-season tournament, a non-conference game, a regular conference game, a conference tournament game, or whatever.

df_season_results = pd.read_csv(DATA_PATH + "MRegularSeasonCompactResults.csv") # df_season_results = pd.read_csv(DATA_PATH + "WRegularSeasonCompactResults.csv") df_season_results.drop(['NumOT', 'WLoc'], axis=1, inplace=True)
df_season_results['ScoreGap'] = df_season_results['WScore'] - df_season_results['LScore']
df_season_results.head()

Features

For each team at each season, I compute :

  • Number of wins
  • Number of losses
  • Average score gap of wins
  • Average score gap of losses

And use the following features :

  • Win Ratio
  • Average score gap
num_win = df_season_results.groupby(['Season', 'WTeamID']).count() num_win = num_win.reset_index()[['Season', 'WTeamID', 'DayNum']].rename(columns={"DayNum": "NumWins", "WTeamID": "TeamID"})
num_loss = df_season_results.groupby(['Season', 'LTeamID']).count() num_loss = num_loss.reset_index()[['Season', 'LTeamID', 'DayNum']].rename(columns={"DayNum": "NumLosses", "LTeamID": "TeamID"})
gap_win = df_season_results.groupby(['Season', 'WTeamID']).mean().reset_index() gap_win = gap_win[['Season', 'WTeamID', 'ScoreGap']].rename(columns={"ScoreGap": "GapWins", "WTeamID": "TeamID"})
gap_loss = df_season_results.groupby(['Season', 'LTeamID']).mean().reset_index() gap_loss = gap_loss[['Season', 'LTeamID', 'ScoreGap']].rename(columns={"ScoreGap": "GapLosses", "LTeamID": "TeamID"})

Merge

df_features_season_w = df_season_results.groupby(['Season', 'WTeamID']).count().reset_index()[['Season', 'WTeamID']].rename(columns={"WTeamID": "TeamID"}) df_features_season_l = df_season_results.groupby(['Season', 'LTeamID']).count().reset_index()[['Season', 'LTeamID']].rename(columns={"LTeamID": "TeamID"})
df_features_season = pd.concat([df_features_season_w, df_features_season_l], 0).drop_duplicates().sort_values(['Season', 'TeamID']).reset_index(drop=True)
df_features_season = df_features_season.merge(num_win, on=['Season', 'TeamID'], how='left') df_features_season = df_features_season.merge(num_loss, on=['Season', 'TeamID'], how='left') df_features_season = df_features_season.merge(gap_win, on=['Season', 'TeamID'], how='left') df_features_season = df_features_season.merge(gap_loss, on=['Season', 'TeamID'], how='left')
rating = pd.read_csv("/work/External Data/538ratingsMen.csv") df_features_season = df_features_season.merge(rating, on=['Season', 'TeamID'], how='left')
df_features_season.fillna(0, inplace=True)

Compute features

df_features_season['WinRatio'] = df_features_season['NumWins'] / (df_features_season['NumWins'] + df_features_season['NumLosses']) df_features_season['GapAvg'] = ( (df_features_season['NumWins'] * df_features_season['GapWins'] - df_features_season['NumLosses'] * df_features_season['GapLosses']) / (df_features_season['NumWins'] + df_features_season['NumLosses']) )
df_features_season.drop(['NumWins', 'NumLosses', 'GapWins', 'GapLosses'], axis=1, inplace=True)

Tourney results

This file identifies the game-by-game NCAA® tournament results for all seasons of historical data. The data is formatted exactly like the MRegularSeasonCompactResults data. All games will show up as neutral site (so WLoc is always N). Note that this tournament game data also includes the play-in games (which always occurred on day 134/135) for those years that had play-in games. Thus each season you will see between 63 and 67 games listed, depending on how many play-in games there were.

df_tourney_results = pd.read_csv(DATA_PATH + "MNCAATourneyCompactResults.csv") # df_tourney_results = pd.read_csv(DATA_PATH + "WNCAATourneyCompactResults.csv") df_tourney_results.drop(['NumOT', 'WLoc'], axis=1, inplace=True)

The DayNum features can be improved by replacing it by the corresponding round.

def get_round(day): # round_dic = {134: 0, 135: 0, 136: 1, 137: 1, 138: 2, 139: 2, 143: 3, 144: 3, 145: 4, 146: 4, 152: 5, 154: 6} round_dic = {137: 0, 138: 0, 139: 1, 140: 1, 141: 2, 144: 3, 145: 3, 146: 4, 147: 4, 148: 4, 151:5, 153: 5, 155: 6} # probably wrong but I don't use it anyways try: return round_dic[day] except: print(f'Unknow day : {day}') return 0
df_tourney_results.head()

Ratings

  • Only for men...

Massey Ordinals

This file lists out rankings (e.g. #1, #2, #3, ..., #N) of teams going back to the 2002-2003 season, under a large number of different ranking system methodologies.

  • Season - this is the year of the associated entry in MSeasons.csv (the year in which the final tournament occurs)
  • RankingDayNum - First day that it is appropriate to use the rankings for predicting games. Use 133 for the tournament.
  • SystemName - this is the (usually) 3-letter abbreviation for each distinct ranking system.
  • TeamID - this is the ID of the team being ranked, as described in MTeams.csv.
  • OrdinalRank - this is the overall ranking of the team in the underlying system. Most systems from recent seasons provide a complete ranking from #1 through #351, but sometimes there are ties and sometimes only a smaller set of rankings is provided, as with the AP's top 25. This year and last year they will typically go up to #353 because two new teams were added to Division I last year.

Feature Engineering

Train data

df = df_tourney_results.copy() df = df[df['Season'] >= 2003].reset_index(drop=True) df.head()
  • Each row corresponds to a match between WTeamID and LTeamID, which was won by WTeamID.
  • I only keep matches after 2003 since I don't have the ratings for the older ones.
  • I start by aggregating features coresponding to each tem.

Seeds

  • SeedW is the seed of the winning team
  • SeedL is the seed of the losing team
df = pd.merge( df, df_seeds, how='left', left_on=['Season', 'WTeamID'], right_on=['Season', 'TeamID'] ).drop('TeamID', axis=1).rename(columns={'Seed': 'SeedW'})
df = pd.merge( df, df_seeds, how='left', left_on=['Season', 'LTeamID'], right_on=['Season', 'TeamID'] ).drop('TeamID', axis=1).rename(columns={'Seed': 'SeedL'})
def treat_seed(seed): return int(re.sub("[^0-9]", "", seed))
df['SeedW'] = df['SeedW'].apply(treat_seed) df['SeedL'] = df['SeedL'].apply(treat_seed)
df.head()

Season Stats

  • WinRatioW is the win ratio of the winning team during the season
  • WinRatioL is the win ratio of the losing team during the season
df = pd.merge( df, df_features_season, how='left', left_on=['Season', 'WTeamID'], right_on=['Season', 'TeamID'] ).rename(columns={ 'NumWins': 'NumWinsW', 'NumLosses': 'NumLossesW', 'GapWins': 'GapWinsW', 'GapLosses': 'GapLossesW', 'WinRatio': 'WinRatioW', 'GapAvg': 'GapAvgW', }).drop(columns='TeamID', axis=1)
df = pd.merge( df, df_features_season, how='left', left_on=['Season', 'LTeamID'], right_on=['Season', 'TeamID'] ).rename(columns={ 'NumWins': 'NumWinsL', 'NumLosses': 'NumLossesL', 'GapWins': 'GapWinsL', 'GapLosses': 'GapLossesL', 'WinRatio': 'WinRatioL', 'GapAvg': 'GapAvgL', }).drop(columns='TeamID', axis=1)
df.head()

Ratings

  • OrdinalRankW is the average Massey Ranking of the winning team
  • OrdinalRankL is the average Massey Ranking of the losing team

Add symetrical

  • Right now our data only consists of won matches
  • We duplicate our data, get rid of the winner loser
def add_loosing_matches(win_df): win_rename = { "WTeamID": "TeamIdA", "WScore" : "ScoreA", "LTeamID" : "TeamIdB", "LScore": "ScoreB", "SeedW": "SeedA", "SeedL": "SeedB", 'WinRatioW' : 'WinRatioA', 'WinRatioL' : 'WinRatioB', 'GapAvgW' : 'GapAvgA', 'GapAvgL' : 'GapAvgB', # "OrdinalRankW": "OrdinalRankA", # "OrdinalRankL": "OrdinalRankB", } lose_rename = { "WTeamID": "TeamIdB", "WScore" : "ScoreB", "LTeamID" : "TeamIdA", "LScore": "ScoreA", "SeedW": "SeedB", "SeedL": "SeedA", 'GapAvgW' : 'GapAvgB', 'GapAvgL' : 'GapAvgA', 'WinRatioW' : 'WinRatioB', 'WinRatioL' : 'WinRatioA', # "OrdinalRankW": "OrdinalRankB", # "OrdinalRankL": "OrdinalRankA", } win_df = win_df.copy() lose_df = win_df.copy() win_df = win_df.rename(columns=win_rename) lose_df = lose_df.rename(columns=lose_rename) return pd.concat([win_df, lose_df], 0, sort=False)
df = add_loosing_matches(df)

Differences

  • We compute the difference between the team for each feature.
  • This helps further assessing how better (or worse) team A is from team B
df['SeedDiff'] = df['SeedA'] - df['SeedB'] df['ratingDiff'] = df['538rating_x'] - df['538rating_y'] df['WinRatioDiff'] = df['WinRatioA'] - df['WinRatioB'] df['GapAvgDiff'] = df['GapAvgA'] - df['GapAvgB'] # df['OrdinalRankDiff'] = df['OrdinalRankA'] - df['OrdinalRankB']
df.head()

Test Data

Preparing

df_test = pd.read_csv(DATA_PATH + "MSampleSubmissionStage2.csv") # df_test = pd.read_csv(DATA_PATH + "WSampleSubmissionStage1.csv")
df_test['Season'] = df_test['ID'].apply(lambda x: int(x.split('_')[0])) df_test['TeamIdA'] = df_test['ID'].apply(lambda x: int(x.split('_')[1])) df_test['TeamIdB'] = df_test['ID'].apply(lambda x: int(x.split('_')[2]))
df_test.head()

Seeds

df_test = pd.merge( df_test, df_seeds, how='left', left_on=['Season', 'TeamIdA'], right_on=['Season', 'TeamID'] ).drop('TeamID', axis=1).rename(columns={'Seed': 'SeedA'})
df_test = pd.merge( df_test, df_seeds, how='left', left_on=['Season', 'TeamIdB'], right_on=['Season', 'TeamID'] ).drop('TeamID', axis=1).rename(columns={'Seed': 'SeedB'})
df_test['SeedA'] = df_test['SeedA'].apply(treat_seed) df_test['SeedB'] = df_test['SeedB'].apply(treat_seed)

Season Stats

df_test = pd.merge( df_test, df_features_season, how='left', left_on=['Season', 'TeamIdA'], right_on=['Season', 'TeamID'] ).rename(columns={ 'NumWins': 'NumWinsA', 'NumLosses': 'NumLossesA', 'GapWins': 'GapWinsA', 'GapLosses': 'GapLossesA', 'WinRatio': 'WinRatioA', 'GapAvg': 'GapAvgA', }).drop(columns='TeamID', axis=1)
df_test = pd.merge( df_test, df_features_season, how='left', left_on=['Season', 'TeamIdB'], right_on=['Season', 'TeamID'] ).rename(columns={ 'NumWins': 'NumWinsB', 'NumLosses': 'NumLossesB', 'GapWins': 'GapWinsB', 'GapLosses': 'GapLossesB', 'WinRatio': 'WinRatioB', 'GapAvg': 'GapAvgB', }).drop(columns='TeamID', axis=1)

Ratings

# df_test = pd.merge( # df_test, # avg_ranking, # how='left', # left_on=['Season', 'TeamIdA'], # right_on=['Season', 'TeamID'] # ).drop('TeamID', axis=1).rename(columns={'OrdinalRank': 'OrdinalRankA'})
# df_test = pd.merge( # df_test, # avg_ranking, # how='left', # left_on=['Season', 'TeamIdB'], # right_on=['Season', 'TeamID'] # ).drop('TeamID', axis=1).rename(columns={'OrdinalRank': 'OrdinalRankB'})

Differences

df_test['SeedDiff'] = df_test['SeedA'] - df_test['SeedB'] df_test['ratingDiff'] = df_test['538rating_x'] - df_test['538rating_y'] df_test['WinRatioDiff'] = df_test['WinRatioA'] - df_test['WinRatioB'] df_test['GapAvgDiff'] = df_test['GapAvgA'] - df_test['GapAvgB'] # df_test['OrdinalRankDiff'] = df_test['OrdinalRankA'] - df_test['OrdinalRankB']
df_test.head()

Target

df['ScoreDiff'] = df['ScoreA'] - df['ScoreB'] df['WinA'] = (df['ScoreDiff'] > 0).astype(int)

Modeling

features = [ 'SeedA', 'SeedB', 'WinRatioA', 'GapAvgA', 'WinRatioB', 'GapAvgB', # 'OrdinalRankA', # 'OrdinalRankB', 'SeedDiff','538rating_x','538rating_y','ratingDiff', 'WinRatioDiff', 'GapAvgDiff' # 'OrdinalRankDiff', ]
from sklearn.preprocessing import StandardScaler,MaxAbsScaler,PolynomialFeatures,\ MinMaxScaler
def rescale(features, df_train, df_val, df_test=None): # min_ = df_train[features].min() # max_ = df_train[features].max() scalar= MinMaxScaler() df_train[features] = scalar.fit_transform(df_train[features]) df_val[features] = scalar.transform(df_val[features]) # df_train[features] = (df_train[features] - min_) / (max_ - min_) # df_val[features] = (df_val[features] - min_) / (max_ - min_) if df_test is not None: # df_test[features] = (df_test[features] - min_) / (max_ - min_) df_test[features] = scalar.transform(df_test[features]) return df_train, df_val, df_test

Cross Validation

  • Validate on season n, for n in the 10 last seasons.
  • Train on earlier seasons
  • Pipeline support classification (predict the team that wins) and regression (predict the score gap)
def kfold_reg(df, df_test_=None, plot=False, verbose=0, mode="reg"): seasons = df['Season'].unique() cvs = [] pred_tests = [] target = "ScoreDiff" if mode == "reg" else "WinA" for season in seasons[13:]: if verbose: print(f'\nValidating on season {season}') df_train = df[df['Season'] < season].reset_index(drop=True).copy() df_val = df[df['Season'] == season].reset_index(drop=True).copy() df_test = df_test_.copy() df_train, df_val, df_test = rescale(features, df_train, df_val, df_test) if mode == "reg": # model = ElasticNet(alpha=1, l1_ratio=0.5) model = CatBoostRegressor(iterations=200,od_type="Iter",l2_leaf_reg=3, learning_rate=0.3, depth=13,verbose=0) # model = LGBMRegressor(learning_rate=0.001, n_estimators=2000, # random_state=33) elif mode == "lgbm": model = LGBMClassifier(learning_rate=0.005, n_estimators=1000, num_leaves=32, random_state=33) elif mode == "nb": model = MultinomialNB(alpha=0.5) elif mode == "cat": model = CatBoostClassifier(iterations=100,od_type="Iter",l2_leaf_reg=5, learning_rate=0.3, depth=13,verbose=0) elif mode == "ls": model = LabelSpreading(kernel='rbf',n_neighbors=8,alpha=0.01,max_iter=500,tol=0.003) else: model = LogisticRegression(C=10) model.fit(df_train[features], df_train[target]) if mode == "reg": pred = model.predict(df_val[features]) # pred = (pred - pred.min()) / (pred.max() - pred.min()) else: pred = model.predict_proba(df_val[features])[:, 1] if df_test is not None: if mode == "reg": pred_test = model.predict(df_test[features]) # pred_test = (pred_test - pred_test.min()) / (pred_test.max() - pred_test.min()) else: pred_test = model.predict_proba(df_test[features])[:, 1] pred_tests.append(pred_test) if plot: plt.figure(figsize=(15, 6)) plt.subplot(1, 2, 1) plt.scatter(pred, df_val['ScoreDiff'].values, s=5) plt.grid(True) plt.subplot(1, 2, 2) sns.histplot(pred) plt.show() loss = log_loss(df_val['WinA'].values, pred) cvs.append(loss) if verbose: print(f'\t -> Scored {loss:.3f}') print(f'\n Local CV is {np.mean(cvs):.3f}') return pred_tests
pred_tests = kfold_reg(df, df_test, plot=False, verbose=1, mode="reg")

Submission

  • Note that this pipeline is leaky during the first stage of the competition : the LB will be underestimated since the last 4 models were trained
pred_test = np.mean(pred_tests, 0).astype('int')
sub = df_test[['ID', 'Pred']].copy() sub['Pred'] = pred_test sub.to_csv('submission.csv', index=False)
_ = sns.histplot(sub['Pred'])
sub.head()

Recommended on Deepnote

Stock Market Analysis

Stock Market Analysis

Last update 3 months ago
The 10 Best Ways to Create NumPy Arrays

The 10 Best Ways to Create NumPy Arrays

Last update 4 months ago
Wide Residual Networks

Wide Residual Networks

Last update 4 months ago