Introduction.
I'll just say it: I don't know the first thing about football. I played for my hometown under 12's, but once I moved to the country from the city I grew up involved in almost every sport but football. I'm approaching the sport as a BI project because that's what I know! So i'm essentially working with black-box data in the hope that something other than the goal variable can be attributed towards uprooting some otherwise unseen values in a competitive football player. Because i'm doing this to teach myself (as opposed to an audience), some of the charts may seem technical (read: confusing, usually it's my job to make these things as easy for someone to read as humanly possible... but not this time) and the end model may or may not just return some insanely obvious insights, but I'm a footie noob and I need all the help I can get.
The data was collected from a football statistics site after a couple of days' worth of scraping efforts proved to be fruitless. I had tried several sites using Selenium and Beautiful Soup but current anti-scraping measures hampered all endeavours. I managed to grab the odd dataframe full of team info but the scraping limit was reached quite rapidly, and maxing-out the limit enough times to fill additional dataframes with the info I wanted would have taken over a week.
I am continually returning to this to add to the analysis around other projects, so it is likely to change (April 2024).
The data.
Cleaning the column labels.
Missing values.
There are a lot of columns here with 100% missing values. A recurring theme with a several datasets i've worked with, and it's always a pity because a lot of these columns (and the ones in previous projects) could have been super insightful. As it stands with this dataset though, there is still a lot of rich information remaining.
Dropping 100% missing value columns.
The remaining data.
Most of the column labels make plenty of sense to someone who doesn't know much about the sport at least. Besides "clean_sheets_away", which I don't *think* represents the counts for a player soiling their bedsheets when they're not at home (personal opinions on the physical toughness of football players aside), so I will look that up.
There are 47 columns left to work with, that includes 7 categorical columns which will ultimately be 6 once i've dropped 'birthday_gmt' after using its data to create an column containing features representing the players' age.
Data cleaning / creation.
Adding that 'age' column and dropping some unnecessary columns. I will set the current year to 2019 so as to be a little more conservative with the age estimation / allow some headroom for players with birthdays later in the year of 2018.
Statistical descriptions.
• The average player age is 27.
• The average minutes played overall is 1317. At home: 658.9. Away: 658.6.
• The average appearances overall is 18.3. At home: 9.18. Away: 9.19.
• The average overall goals per player is 1.82. At home: 1.0. Away: 0.8.
• The average overall assists figure is 1.30. At home: 0.71. Away: 0.58.
• The average penalty goals is 0.14. The average penalty miss average is 0.03r.
• The average yellow cards overall is 2.2566. The average red cards overall is 0.082.
So already there are some visible differences as far as home and away play goes. Lending some credence to the old 'home advantage' maxim.
In the categorical description:
• Danny Ward has the highest player name frequency.
• All values are from the Premier League.
• Midfielder is the most common position.
• Huddersfield Town is the most common club, and England is the most common nationality.
Analysis.
Some correlations.
• The cards per 90 overall are more correlated to the yellow cards than red cards, so more yellow were handed out than red.
• The red cards and 'cards per 90' overall are more strongly correlated to the the Midfielders, as are the 'conceded overall' and 'age' values.
• Penalties are more strongly correlated to home turf than away.
Nationality distribution.
• England holds the top nationality count here with 215 counts and 37% of the overall data.
• Spain is a distant second with 35 counts.
• France is third with 31.
Rank in league top Attackers, Defenders & Midfielders vs. goals overall.
The size of the points represents the goal counts, the depth of hue represents the Defenders, the x-axis represents the Midfielders and the y-axis (should) represent the Attackers. Phew.
I'm interested in the size of the points that aren't in the bottom left corner, as those in the bottom left corner will be #1 (or close to) ranked players, with what one would presume to be good overall goals counts.
It seems there are some low-ranked Midfielders doing quite well here, such as #249 ranked, #250 ranked, #305 ranked scoring as well as the #4 and #16 ranked Attackers. The low-ranking Defenders are also scoring well but I think this is more down to the fact that there aren't too many high-ranked Defenders here.
Sum of current club goals home vs. goals away.
• The greatest sum of away goals clearly belong to Man. City, followed by Liverpool, Spurs and Man. United.
• The greatest sum of goals at home belong to Man. City again, Liverpool, Arsenal and Chelsea.
Disparity between home and away goals (average).
There are quite large distances between the home & away goal figures for some of the leading teams, with teams such as Man. City, Chelsea, Arsenal and Everton scoring far better at home.
Sum of current club assists home vs. assists away.
• Manchester City, Liverpool, Chelsea and Arsenal top the scale for assists at home.
• Man. City, Crystal Palace, Spurs, AFC Bournemouth, Liverpool and Arsenal hold the largest counts for assists away.
Disparity between home and away assists (average).
Besides the usual leading teams, the Wolverhampton Wanderers, Watford and Leicester have made their way up this chart by making less away assists than home assists.
Current club by penalty goals & position.
From a quick glance of the chart it seems the Midfielders are responsible for the majority of the penalty goals, so I will include a table for those metrics. The Midfielders scored 43 penalties and the Forwards weren't too far off that mark with 41 penalties scored. Cool.
• Crystal Palace tips the scale for Midfielders here, followed by Man. Utd., West Ham and Liverpool.
• The Forwards for AFC Bournemouth scored the most penalties, followed by Arsenal, Spurs and Brighton.
Current club by penalty misses & position.
• The Midfielders for Everton, Man. Utd. and Newcastle Utd. were responsible for the most penalty misses.
• The Forwards for AFC Bournemouth missed 2 penalties where the rest of the Forwards' misses amount to one per team in this chart.
Penalties hit & missed ratio.
• Of three penalties taken for Newcastle Utd., one was a score.
• Following Newcastle Utd. are Everton, Fulham, Leicester and Man. Utd. (with Cardiff and Man. City equaling Manchester United's figure of 0.3r).
• Wolverhampton, West Ham and Watford actually did pretty well here.
The names of the players with the best penalty hit / miss ratio for this season:
Appearances and goals home & away.
Some basic information here outlining the players' performance home vs. away, with the same amount of appearances for each, but a notable decline in the away goal count.
Average goals vs. min per assist overall.
There are flat spots here, specifically between the 6 and 10 goal values, and again between the 14 and 20 goal values. Above the 20 goal count though, the minutes per assist value skyrockets. So if we look at the run-up to the peaks around the 5, 13 and 22 goal mark, the case could be made that in general, assists do equate to a good outcome.
I will assume the shorter assist time is shared between the Attackers...
Average goals vs. min per assist overall per position for the 0-5 overall goal range.
As expected, the forwards are putting in the shortest assist time and accruing the largest overall goal count to boot (no pun intended), and the Midfielders are putting in average assist times / less goal counts, which I suppose is the nature of footie. There is plenty of action happening here so this assist range will be highly important for the end model.
Average goals vs. min per assist overall per position for the > 20 overall goal range.
This was all the work of the Forwards. As i'm sure there won't be too many Forwards with this level of success, I can get away with adding the club name to see who these Forwards play for, this returns Liverpool and Arsenal.
Positions in the dataset.
Position per nationality included in the data.
There are some nationalities with only one or two players which result in one or two positions, such as Ecuador, whose sole player in the dataset is a Defender. Based on that finding, I will look at some of the nationalities with all four positions present.
A tree map of the nationalities having a unique position count of four (all positions present).
• The data shows the Midfielder having the highest proportional position count for England, Belgium, Brazil, Germany, and Portugal.
• The ROI, Netherlands, Denmark and Wales have the highest proportional Defender counts.
• Spain, Argentina and Denmark's Defender value is equal to their Midfielder value for their highest position count.
• Interestingly, Belgium and Brazil's Forwards have a higher count than Defenders. The best sort of defence.....
• Denmark has more goalkeepers than Forwards.
Average age per position, per nationality.
It looks as though goalies are the players with the highest age judging by a quick check of the table above, so I will list the average age per position and see what's up with that.
It does appear that goalkeepers are the eldest overall, and midfielders have the lowest average age. From a tactical efficiency POV this makes quite a bit of sense because you'd want your most experienced in goal / defence, plus there isn't a huge amount of physical exertion involved in that role when compared to that of a Midfielder. That might not be the case though, as I said i'm useless with football, just good with strategy.
But with an average age of +4 for goalies on the average Midfielder age, that's a lot of experience in sports terms.
Average age by overall goal count.
I read earlier that recent research showed footballers of a higher age are slightly more productive than normally assumed. There will be a really obvious reason for this and i've no idea why it's taken 50+ years for someone to figure out -.- so I will group the age by some other metrics, because we have already seen the Forwards having a higher average age than the Midfielders, plus the Forwards will be scoring the most goals overall... I think. Let's not forget the whole "often(correlation != causation)" issue here, we can only do some educated logical reasoning.
that being said, the Forwards have scored the most goals overall (almost 3x that of the Midfielders, to be expected) and they do have a slightly higher average age than the Midfielders.
Average age by overall assist count.
And to be fair to the goalkeepers, I know they have scored no goals but I would like to search through some other variables to see where they may have helped out, and 'assists' strikes me as a logical consideration.
Although the assist value is nowhere near as high as that of the forwards et. al., the goalies are making some assists. There is no doubt that the lower-aged Midfielders will be putting in an insane amount of effort running around the pitch, but if we were to look at the goals saved, assists, the average age by distinct count of position favoured per nationality, as well as some other information not included here, we could begin to see how the productivity by average age would climb higher than what 'social norms' would have us assume.
Average age by min per goal overall.
Another helping hand for the Forwards of slightly higher average age: They are closer to the opponent's goal so they hold a much lower time figure for offensive manoeuvres. Based on that, these figures will raise per position distance from the opponent's goal, resulting in the home defender's min per goal figure being the largest overall. If we're assessing this from a productivity / efficiency POV then we would also have to consider the fact that the Forwards are always going to score higher than the other positions in this respect, pushing the productivity vs. average age up a bit without too much interference, thanks to the nature of that position.
I thought another telltale piece of info to back up the other research article would be to introduce the red card count per position per age. Can't be very productive if you're sitting on the bench after all! The lower-aged Midfielders are responsible for the most red cards here. But let's not also forget that the Midfield is the position with the largest player count for some of the most common countries.
Whether or not these attributes were analysed by the other research team is unknown, but it's good to see that there are visible patterns to back up their findings in this single dataset, let alone a fully compiled dataset with hundreds of columns. Many analysis i've seen over the years don't get down to the nuts & bolts of a dataset and favour a more big-picture generalisation of a subject which still happens to this day with the best of data analysts / scientists, so a little extra situation-specific consideration goes a long way in this day & age if you don't happen to be a SME (like me, who really does suck at football info).
Average age per rank in club top scorer and goals overall.
• Of the #1 top scoring players, the majority consists of Forwards with an average age of 27.8 and an average goal count of 14.13. Additionally, within that cluster are the Midfielders with an average age of 26.4 and an average goal count of 10.6.
• In the #2 top scoring bar we see the introduction of Defenders assisting in the goal count. The Forwards here are made up of an average age of 27.6 and have an average overall goal count of 12.5. Its Midfielders see an average age of 27 and they have an average overall goal count of 8.37. Then the Defenders come in with an average age of 28 and a goal count of 4.
• From there, we see a stable run of Forwards, Midfielders & Defenders scoring well until rank 9, and then we witness a jump up to 110 in the combined position age from rank 13 onwards.
• We have to visit the number 4 and number 6 ranked players to see a slightly younger Forward average age vs. the Midfielder average age.
So if goals overall are the order of the day, more experienced (read: 'slightly older than school age') Forwards and younger Midfielders seems to be quite a good recipe for success.
Average rank per position.
The Defenders have the highest average rank in club followed by Midfielders and Forwards.
Average age per goals home, goals away and position.
I will focus on the eye-catching deep blue hues representing the count of goals away first. As seen earlier, a good home player doesn't always equate to a good away player, but they're quite close, and the age aspect here:
The three highest-scoring away players:
• Forwards with an average age of 28.5, Forwards with an average age of 28.6, Forwards with an average age of 31.
The three highest-scoring home players:
• Forwards with an average age of 27, Forwards with an average age of 31, Forwards with an average age of 28.5.
This means, 2 out of 3 times in the highest-performing groups, Forwards of a higher average age are performing better away than they are at home.
Maybe three more to see if there's anything else to back this up...
• The 12 home goals cluster includes 5 away goals and Forwards of an average age of 25.
• The 11 home goals cluster includes 5 away goals and an average age of 28 including Midfielders and Forwards.
So for the 12 and 11 home goals cluster, the youngest average age group consists purely of forwards and holds the highest count for home goals, which goes a way to back up the above findings. From there, the average age climbs to 28, we see one less home goal and an equal amount of away goals, so the older players are indeed holding their own away while not doing as well as their younger team mates at home. A younger players' propensity towards making a name for themselves on home turf vs. the elder's experience abroad, perhaps(?).
Age by average appearances and minutes played overall.
These figures are slightly more obvious, with the peak ages for football (26-30) making the most appearances and putting in the most minutes played overall.
There is a mild 'outlier' here at the age 39 mark, with a Defender and Goalie being brought in for an average of 26 appearances between them.
The Goalkeepers' values will out-shadow the reset due to the smaller Goalkeeper count; to share a similar amount of game time as the rest of the team but split between 1-3 Goalies (as opposed to 2x-3x that amount of players in other positions) will result in a thicker bar, so it would be wise to ignore the size of the green bar in this case. Still quite impressive to witness a 34 year old Goalkeeper having a very similar amount of appearances / minutes played as a 25 year old Goalkeeper though.
Age by average goals per 90 minutes overall and position.
Here we should see mostly Forwards in the upper age bracket, if past evidence is anything to go by. Besides the 20 year old whippersnapper cooking up a storm with around +50% more goals per 90 minutes than the next best performer. Who is this wizard...
That looks like an 'Edward Nketiah' playing for Arsenal. Well done to the talent scout who discovered that chap!
True to form, not including 'Sir Edward of Arsenal', the majority groups here belong to the players aged 26 and above, with the deeper blue hues representing the more average & below average aged Midfielders.
Again, the older Defender is visible here, let's check him out... This is a Defender for Everton by the name of 'Phil Jagielka'. Looking at the chart above, he's put in 23% of the next youngest Defender's field time and made 41% of that Defender's appearances which doesn't look too great until you add the goals overall feature. He's made very good use of his time on-pitch.
Conceded home & away per position.
Age by average of conceded home and average of goals overall.
The positions and ages conceding the most at home, in order, are (most likely to be Defenders):
• 30, 31, 26 and 28 year old Defenders.
The positions and ages conceding the least at home, in order, are:
• 37, 19, 21 year old Defenders and a 20 year old Midfielder.
Those with a good blend of conceded and goals scored are:
• 27, 30, 31 and 32 year old Forwards.
Let's see who those Forwards are. I'm looking for the age groups of 27, 30, 31 and 32, with a conceded_home value of less than 10, as well as a goals_overall count greater than 5.
And here they are:
Age by average of conceded away and average of goals overall.
As the overall performance away has proven to be slightly lower than the home performance, the average 'conceded_away' values are likely to be higher than those above and I will raise the conceded_away value by 5 when searching for the players.
We see similar patterns to the above chart here even if those values are slightly higher, and the most efficient Forwards (keeping the concession rate equal to / below 20) are of a slightly lower age overall.
Those players visualised.
It seems a 'Sergio Aguero' is as good at home as he is away which, judging by this data is no mean feat. The next most balanced players at home are 'Raheem Stirling' and a 'Harry Kane'.
For the away players, we see 'Sadio Mané', 'Pierre-Emerick Aubameyang' and 'Mohamed Salah' (a name i've actually heard of in the past) as the heaviest-hitters. It would also be worth mentioning 'Jamie Vardy' for his impressive sum of 18 away goals as well. 'Heung-Min Son' and 'Rondon' make up the remaining two players with respectable away goal sums (over ten goals).
Position by average of red & yellow cards overall and min per match.
My initial thought was that the longer the player on the pitch, the greater the likelihood there is of getting a red or yellow card. That is almost true until we get to the Midfielders, which, I would have to assume would either be the most aggressive of the two positions (Defender & Midfielder) and therefore will be susceptible to experiencing more fouls, or that position could see a lot more action in their part of the pitch; if the majority of the possession belongs to Midfielders for the longest duration, the probability of a foul happening will increase.
The Defenders seem to have the highest min per match but a slightly lower red and yellow card average than the Midfielders, possibly due to seeing a little less action than the Midfielders. This depends on the strategy involved, but I would hazard a guess that a 4-4-2 formation has been used more often than not if the Midfielders' stats are similar to the Defenders'.
I had a quick check online and a few sports articles did say that, "The 4-4-2 has returned in 2018", but I wouldn't like to say that's 100% the case, because data.
Average min per card overall by club.
• Cardiff City, Arsenal and Leicester have the shortest min per card figures.
• Liverpool, Man. City and Newcastle have the largest.
It's quite a thing to have a large min per card figure so i'll check the positions of the players with the largest figures for this data, i'll assume that they're out of the way of the drama most of the time.
All Goalkeepers....
Minutes played home & away vs. overall goals.
• The away games saw more goals in the earlier minute range and a few good peaks between the 1300-1600 minute range.
• The most effective home minute range being somewhere between 1050-1600.
Average minutes per goal by position.
And the Forwards have the lowest min per goal figure followed by Midfielders and Defenders (pitch location will help).
Displaying the top three players for each club by average min per goal overall.
Modeling.
Easily discernible patterns here so this'll be no sweat for LGBM with no params. Setting the "goals_overall" feature as the independent variable, dropping the "current club" and "nationality" columns among others, and for dimensionality's sake, the "minutes_played_overall" column too.
SHAP deep explainer for the LightGBM model.
• With 'goals_overall' as the target variable we see the resulting chart return a lot of previously uprooted info
• Goals home featuring slightly higher than goals away.
• Low values for min per goal overall as expected, the best Forwards having the shortest attack time. Although with the hue being purple in colour, the Midfielders' min per goal values have also been deemed quite important.
• And less obviously (to me at least), the rank in league top attackers' low values bear the most importance. The EDA showed the average rank for the Forwards was around 8, with Midfielders hovering around the 11 rank average. Additionally, we saw that the young'uns made up the meat of the Forward players so these are highly likely to be up-and-comers, making their way further back in the field as they age.
• High minutes per match values are important features: the longer the match, the more chance of a goal.
• Assists home: high values are more important at home, but assists overall aren't too important.
• High-ranked Midfielders are of importance.
• With low and high penalty goals bearing positive impact, it's a case of penalties having a higher probability of hitting home than if the goal were to be attempted in normal game time. Going to penalties is either a good thing or a bad thing depending on how you look at it ( / how accurate your players are)!
So.
Nothing out of the ordinary appeared in the model that couldn't be found in the EDA, all elementary insights that make complete sense to a football noob such as me. If you were looking for any genius recommendations then i'm afraid this isn't the time or place but I learned a bit and had fun doing it :o)