Expected goals (xG) have become the most widely used metric in football analytics in the last decade. In short, xG models, such as the one developed by American Soccer Analysis, calculate how many goals a team should have scored based on the characteristics of the shots they have taken. Penalties provide the easiest example: roughly 76% of penalties are converted, so each penalty has an xG value of 0.76. Many articles have shown that using expected goals is preferred to actual goals when evaluating performance, because the metric is generally much more stable due to the rare nature of actual goals.
One of the limitations of xG is that the models do not take into account the player taking a shot. Not all players are created equal, meaning a shot taken by Hany Mukhtar, the 2022 MLS MVP, is more likely to end up in goal than a shot taken by let’s say, Nouhou Tolo, the legendary Sounders defender who isn’t particularly known for his finishing skills. However, a model that only takes into account a player’s past xG values, will predict the same amount of future goals for Mukhtar and Tolo in the unlikely but not impossible case both have produced similar shots in a given sample. And while it is proven that finishing is not necessarily a repeatable skill, it’s safe to say that if we want to predict future goals, adding shooting ability seems like a good idea.
In come PFF Shooting Grades. The PFF player grading system evaluates every player for every event during a game, including every shot. Each player is given a grade of -2.0 to +2.0 in 0.5 increments on a given play with 0 generally being the average or “expected” grade. The plus-minus grades, which involve manual scoring of each event by a human, are transformed into a 0-100 scale at a game and season level to make them more intuitive. For a detailed breakdown of how PFF player grades are calculated, read this article. The 2021 and 2022 leaderboards for shooting are given below (Mukhtar is the only player in the top-10 for both years, he currently ranks 12th for the ongoing 2023 season).
|Rank||Player||Team||PFF shooting grade|
|1||Damir Kreilach||Real Salt Lake||91.9|
|2||Ricardo Pepi||FC Dallas||90.9|
|3||Felipe Mora||Portland Timbers||90.6|
|4||Ola Kamara||D.C. United||90.3|
|5||Hany Mukhtar||Nashville SC||90.1|
|6||Johnny Russell||Sporting Kansas City||89.0|
|8||Daniel Salloi||Sporting Kansas City||87.6|
|9||Diego Fagúndez||Austin FC||86.6|
|10||Daryl Dike||Orlando City SC||86.3|
|Rank||Player||Team||PFF Shooting Grade|
|1||Hany Mukhtar||Nashville SC||99.2|
|2||Lewis Morgan||New York Red Bulls||91.5|
|3||Daniel Gazdag||Philadelphia Union||90.2|
|4||Dejan Joveljic||LA Galaxy||88.5|
|5||Jesús Ferreira||FC Dallas||88.0|
|6||Sebastián Driussi||Austin FC||87.7|
|7||Thiago Almada||Atlanta United||87.4|
|8||Gonzalo Higuaín||Inter Miami CF||87.1|
|9||Carlos Vela||Los Angeles FC||86.7|
|10||Mikael Uhre||Philadelphia Union||85.5|
So let’s try to build a model. Before we start, we are not expecting miracles here. Predicting football games is harder than predicting the stock market, so we are bound to find out that we are still missing a lot of information that influences future goalscoring.
We’ll start with using stats from the previous 5 games to predict the number of goals scored in the next 5 games. Goalkeepers are excluded from the dataset, obviously. We combine data from ASA, which is easily retrieved through their API, with PFF data and focus our attention on the 2021 and 2022 seasons in MLS. The idea is simple: future goalscoring depends on both being able to get yourself into scoring position (measured by xG) as well as being able to shoot the ball (measured by PFF Shooting Grades).
First we need to make sure that xG and PFF Shooting Grades are not measuring the same thing. In the plot below, we see a small linear relationship between the sum of xG and the average PFF Shooting Grades in the previous 5 games. However, LOWESS suggests that outliers are playing a role: for xG totals north of 1.0, there seems to be no relationship between the two, while there is still plenty of variation left in PFF Shooting Grades for xG totals below 1.0, which implies that both are two independent measures. But how much do they contribute to goalscoring?
Now that we’ve gotten that out of the way, let’s look at some simple OLS models. The table below shows the R-squared, F-statistic and Akaike Information Criterion (AIC) for some. The results confirm that a model with xG per 90 outperforms one with goals per 90 as a single explanatory variable. It also shows that xG per 90 is a better predictor of future goals than the PFF Shooting Grades alone. However, combining variables is the way to go and as it turns out, combining xG per 90 with the PFF Shooting Grades improves the performance of the model, while adding goals per 90 as a third explanatory variable brings little extra to the table.
|Goals per 90||0.053||1874.9||75011|
|xG per 90||0.073||2673.8||74262|
|Avg. PFF Shooting Grades||0.059||2130.2||74769|
|Goals per 90 & xG per 90||0.080||1475.7||74008|
|Goals per 90 & Avg. PFF Shooting Grades||0.075||1371.4||74200|
|xG per 90 & Avg. PFF Shooting Grades||0.102||1926.1||73189|
|All 3 variables||0.103||1287.2||73183|
If we take the last model and multiply the coefficient of each variable by its standard deviation in our sample, we get an absolute impact value, representing the variable's contribution to the prediction. Normalizing the impact values by dividing them by the sum of all impacts allows for a measure of each variable's importance relative to each other. In our case, xG per 90 demonstrates the highest impact (51.9%), followed by PFF Shooting Grades (42.9%), and goals per 90 (5.2%).
As a side note, changing the number of games (using the previous X games to predict the number of goals scored in the next X games) has the expected effect: including more games leads to better predictions at first, the AIC increase from including 5 games to including 10 games indicates that including an additional 5 games in the model does not lead to a significant improvement in the model's fit to the data, and the extra complexity introduced by the additional parameters outweighs any potential benefits. Below are OLS outputs for models that include all three variables:
|No. of games included||Observations||R-squared||F-statistic||AIC|
But our dataset doesn’t necessarily lend itself to OLS. Not only is our dependent variable ordinal, our sample is also flooded with players taking a very limited number of shots. Knowing this, a Random Forest model is a better choice. We include goals per 90, xG per 90 and the average PFF Shooting Grades over the past 5 games to predict the number of goals a player will score in the next 5. It leads to an R-squared value of 0.115, which easily beats the values obtained when considering each variable individually.
Our Random Forest model also gives us an opportunity to compare feature importances. Again, xG per 90 wins the race (50.1%), followed by a somewhat different distribution than we saw earlier between PFF Shooting Grades (30.2%), and goals per 90 (19.7%). Per position, the feature importances vary: the number of predicted future goals depends more on previous xG per 90 for defenders and midfielders, while actual goals per 90 come into play more for forwards. However, the importance of shooting ability is relatively stable for all.
|Goals per-90||xG per-90||PFF shooting Grades|
So both OLS and Random Forest confirm what we were expecting: predictions of future goalscoring are improved by adding information about shooting ability that isn’t captured in xG models. This makes a ton of sense and is in line with Ben Tovarney’s findings here, with the key difference that he did not have the PFF Shooting Grades at his disposal. Obviously, the predictions can be improved by including other relevant variables or by using more sophisticated methods, for example by incorporating team strengths into the mix, but we found what we were looking for: evidence that adding a measure of shooting ability improves our predictions of future goalscoring.
Lastly, if we want to use our Random Forest model to predict who will score most goals in the next 5 games, it is important to add some features that we know will improve our predictions. For example, if we don’t include playing time during the previous 5 games as a feature, subs who happened to be successful in limited playing time will get too much credit for their inflated per 90 stats. Furthermore, we know positions play a role, so let’s add in the primary position of a player as well. The R-squared increases to 0.272, which is nice, and the feature importances are as expected (strikers are the most important when in comes to goalscoring):
|xG per 90||35.6%|
|Sum of minutes||25.0%|
|Avg. PFF Shooting Grades||20.9%|
|Goals per 90||11.2%|
So without further ado, we present the players who are predicted to score the most in the next 5 games:
|Ryan Gauld||Vancouver Whitecaps||2.47|
|Teal Bunbury||Nashville SC||2.32|
|Carles Gil||New England Revolution||2.30|
|Julián Carranza||Philadelphia Union||2.08|
|Cucho Hernández||Columbus Crew||1.97|
|Alan Pulido||Sporting Kansas City||1.92|
|Josef Martínez||Inter Miami CF||1.82|
|Brandon Vázquez||FC Cincinnati||1.81|
|Daniel Gazdag||Philadelphia Union||1.77|
|Christian Benteke||D.C. United||1.75|
|C.J. Sapong||Toronto FC||1.67|
|Thiago Almada||Atlanta United||1.65|
|Luciano Acosta||FC Cincinnati||1.64|
|Hany Mukhtar||Nashville SC||1.62|