NCAAM Spread of Score Difference Prediction
Data preparation
Seeds
This file identifies the seeds for all teams in each NCAA® tournament, for all seasons of historical data. Thus, there are between 64-68 rows for each year, depending on whether there were any play-in games and how many there were. In recent years the structure has settled at 68 total teams, with four "play-in" games leading to the final field of 64 teams entering Round 1 on Thursday of the first week (by definition, that is DayNum=136 each season). We will not know the seeds of the respective tournament teams, or even exactly which 68 teams it will be, until Selection Sunday on March 15, 2020 (DayNum=132).
The seed is a 3/4-character :
- First character : Region (W, X, Y, or Z)
- Next two digits : Seed within the region (01 to 16)
- Last character (optional): Distinguishes teams between play-ins ( a or b)
Season results
This file identifies the game-by-game results for many seasons of historical data, starting with the 1985 season (the first year the NCAA® had a 64-team tournament). For each season, the file includes all games played from DayNum 0 through 132. It is important to realize that the "Regular Season" games are simply defined to be all games played on DayNum=132 or earlier (DayNum=132 is Selection Sunday, and there are always a few conference tournament finals actually played early in the day on Selection Sunday itself). Thus a game played on or before Selection Sunday will show up here whether it was a pre-season tournament, a non-conference game, a regular conference game, a conference tournament game, or whatever.
Features
For each team at each season, I compute :
- Number of wins
- Number of losses
- Average score gap of wins
- Average score gap of losses
And use the following features :
- Win Ratio
- Average score gap
Merge
Compute features
Tourney results
This file identifies the game-by-game NCAA® tournament results for all seasons of historical data. The data is formatted exactly like the MRegularSeasonCompactResults data. All games will show up as neutral site (so WLoc is always N). Note that this tournament game data also includes the play-in games (which always occurred on day 134/135) for those years that had play-in games. Thus each season you will see between 63 and 67 games listed, depending on how many play-in games there were.
The DayNum
features can be improved by replacing it by the corresponding round.
Ratings
- Only for men...
Massey Ordinals
This file lists out rankings (e.g. #1, #2, #3, ..., #N) of teams going back to the 2002-2003 season, under a large number of different ranking system methodologies.
- Season - this is the year of the associated entry in MSeasons.csv (the year in which the final tournament occurs)
- RankingDayNum - First day that it is appropriate to use the rankings for predicting games. Use 133 for the tournament.
- SystemName - this is the (usually) 3-letter abbreviation for each distinct ranking system.
- TeamID - this is the ID of the team being ranked, as described in MTeams.csv.
- OrdinalRank - this is the overall ranking of the team in the underlying system. Most systems from recent seasons provide a complete ranking from #1 through #351, but sometimes there are ties and sometimes only a smaller set of rankings is provided, as with the AP's top 25. This year and last year they will typically go up to #353 because two new teams were added to Division I last year.
Feature Engineering
Train data
- Each row corresponds to a match between
WTeamID
andLTeamID
, which was won byWTeamID
. - I only keep matches after 2003 since I don't have the ratings for the older ones.
- I start by aggregating features coresponding to each tem.
Seeds
SeedW
is the seed of the winning teamSeedL
is the seed of the losing team
Season Stats
WinRatioW
is the win ratio of the winning team during the seasonWinRatioL
is the win ratio of the losing team during the season
Ratings
OrdinalRankW
is the average Massey Ranking of the winning teamOrdinalRankL
is the average Massey Ranking of the losing team
Add symetrical
- Right now our data only consists of won matches
- We duplicate our data, get rid of the winner loser
Differences
- We compute the difference between the team for each feature.
- This helps further assessing how better (or worse) team A is from team B
Test Data
Preparing
Seeds
Season Stats
Ratings
Differences
Target
Modeling
Cross Validation
- Validate on season
n
, forn
in the 10 last seasons. - Train on earlier seasons
- Pipeline support classification (predict the team that wins) and regression (predict the score gap)
Submission
- Note that this pipeline is leaky during the first stage of the competition : the LB will be underestimated since the last 4 models were trained