How good are our xG models?

Expected goals are a difficult metric. Apart from the huge amount of work in takes to create an xG-model, once you’ve got one it’s hard to tell if it’s any good. Most people try to check this by checking if their xG totals for entire seasons are similar to the actual goal totals, mostly using R2. In my latest blog I tried to explain why I think this is a very poor way to evaluate your xG-model. I also hinted at a better way to evaluate an xG-model, something I will explain and apply today. First I’ll explain my methodology and afterwards I’ll apply it to evaluate different xG-models, including a few of the most prominent ones in the community like Michael Caley’s and 11tegen11’s model. Are those models really better than other models? And how close to being perfect are they? If you’re only interested in the results please skip the next paragraph.


Let’s first start with the methodology. One of the critiques I had against the full season R2 plots was that a lot of information was lost by summing the entire season together. Therefore I decided to look at single match totals. As far as I know this is the smallest sample at which we look at xG-values, and by looking at single matches rather than seasons we have a lot more data points. The method I will be using to compare the xG-scores to the actual scores is the root-mean-square error percentage (RMSEP). The standard root-mean-squared error is a very simple statistic that measures the differences between your predictions and the actual outcomes. The root-mean-square error percentage (also known as the coefficient of variation of the RMSE) is a normalized version of this that I’ll be using so it’s possible to compare it with my ‘perfect’ model, a theoretical model which I explained in my last blog. The exact formula for the root-mean-square error percentage is:


There might be better or more suitable metrics out there but I think this is still reasonably easy and understandable/reproducible, and it’s a theoretically sound metric.

However, the exact value of this metric alone might not give much insight as it’s quite technical. So we’re going to need something to compare the results with. For this purpose, I decided to include two extra ‘models’ in my evaluation, which try to describe the upper and lower bound of performances. For the lower bound (lower is better) I’ll be using the ‘perfect’ model described in my latest blog. As an upper bound I’ll be using the model explained in an infamous Deadspin article, which assigns every shot an xG-value of exactly 0.095. The idea behind this is that if an xG-model can’t do better than that then really what’s the point of making one, so it creates a nice upper bound.

The results are in…

First a short introduction into the models that will be used in this evaluation:

  1. Nils Mackay, my own model. The start of the methodology can be found here, although I have greatly improved it since.
  2. Michael Caley (@MC_of_A). Numbers taken from the xG-plots on his Twitter account. Methodology here.
  3. 11tegen11 (@11tegen11). Methodology here.
  4. FootballStatistics (@stats4footy), no methodology available.
  5. @SteMc74. Methodology here.
  6. Willy Banjo (@bertinbertin). Methodology here.
  7. SciSports, a Dutch start-up company. (@SciSportsNL). Methodology online soon.
  8. Ben (@Torvaney). Ben uses a model that only uses x,y location and whether it was a header or not. You can create your own numbers using his model here.

Now for the results:


And the winner is…. Michael Caley!

What we see above is the RMSEP for every model (the white dot) for the set of games I used (the first 260 games in this year’s Premier League). On the top you see the ‘perfect’ model, which is a boxplot of 200 simulations I did. Basically, due to variation, a simulation can be relatively more in line with the xG-values, or less. If actual scores are ‘luckier’ or ‘less expected’ than the RMSEP becomes higher, and vice versa. What we can see is that (over a sample of 260 games) the RMSEP of a ‘perfect’ model varies about 0.08 in both directions. To kind of illustrate this ‘confidence interval’ I added the blue lines for all other models, which are basically just the observed value (white dot) plus or minus 0.08. Mind you these are not actual ‘confidence intervals’ as I don’t have those. These confidence intervals have to be seen in comparison with the ‘perfect’ model only though, as different outcomes of games would roughly affect all models similarly. Therefore I think the ranking of the models above is basically what it will be in any sample of 260 games or more.

What’s really surprising is that Michael Caley’s model performs almost as good as the ‘perfect’ model. This indicates that his estimations are really good and don’t have much room for improvement. This is somewhat surprising as the general consensus is that positional data will improve xG-models by a lot. My analysis shows that, although there’s still room for improvement, it won’t really matter that much (for xG-models).

Following by a decent margin we find 11tegen11 in second place and FootballStatistics in third. In fourth and fifth we find @SteMc74 and my own model, closely trailed by Willy Banjo’s model in sixth. In seventh and eighth we find the model by SciSports and Ben’s model. All the way at the back we find the ‘upper bound’, the Deadspin model. As somewhat expected by myself this model performs very poorly on single matches and it’s ‘confidence interval’ doesn’t even touch the worst simulation of the ‘perfect’ model.

So what can we take from here? First of all, even a simple model like Ben’s is a lot better than just counting shots. Second of all, creating a good xG-model can be really hard, but it is not impossible. Caley’s model is living proof that it’s possible to create a model that’s close to a ‘perfect’ model, even without using positional data.

I’ve done some additional analysis that looks at whether the models have certain biases. I feel like this article will get too long if I add it here, so I’ll write something about that in a week or so. Great thanks to FootballStatistics (@stats4footy) for his work in this article. Also great thanks to all the modelers who were so kind to provide data for this analysis.

(NOTE: I decided not to include Paul Riley’s (@footballfactman) model in the analysis. His model looks at xG2, while all the other models in this analysis look at xG1. The main difference between them is that xG2 assigns a value of 0 for blocked shots and shots off target, while xG1 doesn’t look at what happens to a shot. The implications if this is that there are fewer shots to be given a xG-value, which will lead to a smaller RMSEP due to lower variance. I figured comparing his model with the rest would be like comparing apples and pears. For who’s interested, his RMSEP was 0.81, very close but slightly behind Caley’s model.)

(NOTE 2: If you wish to reproduce these results please note that the actual value of the RMSEP for a model varies significantly between different samples. This is due to different amount/quality of shots in the games used. So if you want to see how your own model is doing, you’re going to have to use the first 260 games of the Premier League 15-16, or you’ll have to get data from all modelers for a different sample.)

How NOT to evaluate your xG model

Expected goals is a complex metric. Not only because it is difficult to calculate, but mostly because the models are very hard to evaluate. This is something I realized after recently creating my own Expected Goals (xG) model. (For those who are unaware what xG means; it’s a metric describing the probability a certain shot will end up being a goal. The simplest example for this is a penalty, which has an xG of about 0.75. In other words, about 3 out of 4 penalties are scored.) I soon realized it is very hard to determine how good my model really was, as I had nothing to compare it with.

Therefore I first set out on making a benchmark. In this case that benchmark would be a perfect xG model, so it is possible to ask yourself: how close is my model to being perfect? *(What does a ‘perfect’ xG even mean? I discuss this in the appendix since it is quite technical.)

How can you possibly have a perfect xG model?

I don’t. If I would I probably wouldn’t even have to write this article. It is however fairly simple to find out how a perfect model would perform. I might not know the exact xG values for all shots during this BPL season, but let’s assume I do know them for this BPL season in an alternate universe (stick with me). Let’s just for now assume that the xG values I calculated for this BPL season were actually 100% correct (which they are definitely not). The only thing we miss now are actual results in this ‘second universe’, so to gather these all I had to do was simulate all matches once using the ‘perfect’ xG values. This is also known as a Monte Carlo simulation. This gives one possible outcome of all matches. Let’s look at a 4-shot example:

table3At the left we can see that in reality, out of these shots only shot 1 was scored. Next to that we can see the xG value my model assigned to those shots. Simulating these xG values gave the results on the right. These simulated goals now have the xG values as their true underlying probabilities. In other words, these xG values are ‘perfect’ and the ‘simulated goals’ on the right is one of the possible outcomes.

Whatever measure we are going use, we can always check how a ‘perfect’ model would perform to compare. In the rest of this article, I will use this method as a benchmark.

R-squared is really, really not ok

When checking around the web what methods were used, I was surprised to see that R2 was the most common way of evaluating whether an xG model was any good. In most cases I would see a plot similar to this:


What we see here is the amount of goals all 20 teams in the 2014/2015 Premier League scored, and the sum of the Expected Goals a model of mine assigned to all shots taken by that team. Next I applied a linear regression which gave a R2 of 0.807. This sounds great!

But it isn’t. For several reasons:

  1. Information loss
    By summing all the xG values over a season, we lost a huge amount of data. We started with around 10000 points but reduced it to 20. Furthermore, this only gives us a sample size of 20, which is way too small.
  1. Is 0.807 even good?
    How good is this R2 figure really? Apart from the fact that the small sample size probably means that the value relies heavily on variation, it is also not as good as it sounds. If we simply count the shots a team attempts in a season and plot it against the goals scored, in this specific example you’ll get an R2 of 0.712! Over large samples like we use in this example, the xG per shot tends to be pretty similar for all teams, meaning the xG values you calculated won’t improve your results by much. Even more shockingly, a single simulation of the ‘perfect’ model gave a R2 of 0.755, which is lower than what our model achieved. Obviously over a larger sample of shots it will outperform my xG model, but the fact that it doesn’t here shows how unreliable these numbers are. The variance over such a small sample size appears to be so big, that we really can’t say anything useful about this R2 value.
  1. It’s theoretically wrong
    R2 measures how much variation of the response variable (actual goals) is explained by the decision variable (xG). To do this it finds a linear function that is the best fit. The line in the above example is:

Actual goals = -6.18 + 1.11 * xG

    This is NOT what we try to model when we create xG. The idea behind xG is that 1 xG is worth exactly 1 actual goal, which is not what is assumed by the linear regression method. For example, using the above formula we would expect to score 38 goals when we score 40 xG. This is clearly not what we aim to measure when using an xG model.

Go on then smartass, what metric should we use?

I have to admit that although I know R2 is wrong, I’m not sure what the best way is to evaluate xG models. Personally I believe a good way to evaluate an xG model is by looking at smaller samples than entire seasons. One could for instance look at single match totals of xG values and actual outcomes. That will make the influence of individual xG estimations much bigger, while single matches usually are the smallest sample in which we actively look at xG.

In an upcoming blog I will explain this method. Furthermore I will evaluate a set of xG models from i.a. Michael Caley, SciSports, @SteMc74, myself and more, using this method. Then we’ll finally know how close we are to a perfect xG model and which one is closest. If you want to participate with your own xG model please contact me on Twitter (@NilsMackay).

*Appendix (What is a perfect xG value?)

Since xG attempts to predict whether a shot ends up in the goal or not, one might say that a perfect xG model takes into account all possible variables. It would take into account things like: wind speed, wind direction, keeper positioning, the keeper’s reaction time, the way the ball is hit etc. However, such a model would perfectly predict whether a shot will become a goal or not and therefore only return values of 1 and 0. In other words for every shot it would say either: “Yes, this shot will become a goal” or “No, this shot will not become a goal”. Such a model would return the same xG values as the amount of actual goals scored in a match, which would be rather useless. The purpose of xG is (in my opinion) not to predict right before a shot gets taken whether it becomes a goal or not. The way in which it is used is to assess the quality of a chance and thereby the quality of a team’s performance.

Therefore I prefer to look at xG as the probability a shot becomes a goal when the given player tries to score from that exact situation. This will give answers like: “If Messi tried this shot from this exact situation 100 times, he would probably score 24 times”, which would correspond with a 0.24 xG value. In this article, I assumed this definition of an xG.


Introducing my Expected Goals model

While browsing the web in March 2014 I stumbled over an article by Sander Ijtsma (@11tegen11) and Michiel de Hoog (@MichielDeHoog) explaining why Lex Immers, at the time a regular starter in the midfield of Feyenoord, was one of the best players in the Eredivisie. At the moment, that was a very controversial statement, as Immers was known to blow huge chances regularly. Many Feyenoord fans even blamed him for Feyenoord not winning the title that season, as Feyenoord finished second behind Ajax with only a four point difference. The general consensus was that Immers was not good enough for a team like Feyenoord.

The article however gave a different view of reality, as it introduced me to the concept of Expected Goals (xG), a measure which quantifies how big a chance is. Basically, every shot on goal is given a value between 0 and 1, illustrating the probability that the shot will end up in the back of the net. The article showed that Immers was actually not blowing chances at all, but was scoring as much as expected. Furthermore it showed that Immers was actually very good at creating chances for a midfielder as well. The more analytic way of looking at football resonated with me, probably also partly because I was at the time (and am currently) a student in Business Analytics. After reading the article I started following the football analytics community (mostly based on Twitter), very interested in what it had to offer.

Soon after, I decided I wanted to play around with the data myself, so I wouldn’t be dependent on answers of other people for my questions. My first step was to create my own xG-model. I have to admit it took longer than I expected at first, but the first version of it is now done. That is also what this article is about. I will explain my methodology, which is different from those of the models I’ve seen so far, and test to see if it is actually doing what it suggests: predicting the probability a certain shot will end up in the back of the net.


The variables I used to predict the xG-values are the following:

  • Shot location
  • Whether it’s a header or a shot
  • Whether it’s a penalty or not
  • Whether it’s an own goal or not

Obviously there are many more factors which influence the chance of a shot going in, such as assist type, positioning of defenders and many more. Especially assist type is something I might pick up later, but for now I tried to keep it simple.

All models I know calculate the influence of shot location by dividing it into several factors such as distance to goal, and angle to goal (some use even more). Although it might be a good approximation, to me it sounded like a very complex way to compute the influence of location. The problem is that the goal posts make the distribution of values across the field very complex. For instance, a shot from 10 cm on the outside of the goalpost on the goal line will have an xG-value of practically zero, whereas the xG-value for a shot from 10 cm on the inside of the goalpost on the goal line will be about 1. This makes the exact values very hard to approximate by using angle and distance only.

To me it sounded more logical and precise to calculate the probability of a goal for a shot from a certain location by literally counting how many shots were taken from that exact position and counting how many of them ended up in the goal. Thus I divided the football pitch into squares of about a square meter, by making 100 squares in the length of the field and 50 squares in the width. Doing this for shots only gives the following field, in which a white corresponds with high xG-values and red with low xG-values:


This is a mess. Even though I’ve used 10 seasons worth of shots (about 80,000 shots) for this, the sample size seems to be too small as the differences between neighboring squares is too big at certain locations. Furthermore, lucky long shots screw up the values for locations far from goal, as not many shots are attempted from such range. This makes the influence of one lucky goal very big on the resulting xG-value. The plot for headers was very similar.

To fix this, I decided to calculate the xG-value of a position by looking at all surrounding squares. This increases the sample size significantly, apart from the fact that it makes sense intuitively. The probability to score from a certain location won’t change significantly if you move less than 1 meter from that position. The actual shots from the square itself were given some extra weight.

This still has some issues. Lucky goals from long shots still have a huge influence on the xG-value for that square. Furthermore, by also counting the squares around the actual square, the problem with the goalposts arises again. To solve this, I decided that squares that didn’t have a minimum amount of shots taken from them and goals scored from them, would be set equal to the minimum xG-value, which is about 0.017. This means that no matter from where a shot is attempted it will always have at least an xG-value of 0.017. The idea behind this is that players will only attempt a shot if they think it’ll have at least a certain probability of ending up in the goal.

This sounds very specific, but really all it does is eliminate weird xG-values for squares with a too small sample size, and eliminate incorrect xG-values for squares from which there was never scored before. I think it’s safe to say that if in a sample of 80000 shots there isn’t a single goal scored from a certain location, the xG-value for that position is probably not that big.

The updated field, for shots only, then looks like this:


That looks better. You can clearly see that if the angle becomes too sharp, the chance to score drops immensely. The lucky long shots are accounted for, and the probability a shot becomes a goal rises quickly as you approach the goal. On the goal line it is nearly 100%. The field for headers is slightly different and generally gives lower values, but it looks quite similar.

Does it work though?

Although it looks pretty, the question that arises immediately is: does it work? Or in other words:

  • Does it correlate well with the actual scoring chances within the sample data?

And more importantly:

  • Is it able to predict the probability a shot outside the sample will end up in the goal?

Let’s start with the first one. To check if the model even agrees with the sample data, I grouped shots in small bins which are determined by xG-value. For example, if a shot is given an xG-value of 0.12 it is put in the bin that contains shots with values between 0.1 and 0.2. Next I calculated the average of the xG-values in the bin, and calculated the number of those shots that actually became a goal in real life. This gave the following table:


Just to clarify, the actualxG values are the percentages of shots within the bin that were scored. The modelxG is the average xG-value that the model assigned to those shots. Thus the values within the modelxG column are by definition within the range of the bin, which is not necessarily true for the values in the actualxG column.

Once again the effect of sample size is easily visible. The bins that contain the most of the shots have the highest accuracy. I’m pretty happy with the overall results. For most of the shots the actual bin value and the model bin value are less than 0.3% apart. For more rare shots this increases slightly, but the percentage differences for those shots is still fairly small. Notice that all 26 shots in the bin for values between 0.9 and 1 are scored thus far. This is likely a ‘hot streak’, as a 100% chance practically doesn’t exist. The model rates the average xG-value for those shots at around 94%.

The fact that the model’s values are close to the actual probabilities was to be expected. The model itself uses the number of times a shot went into the goal from that position. The fact that the values are similar doesn’t say that much, apart from the fact that the model describes its own sample adequately. More interesting is to see if the model is able to predict the chance a shot will become a goal for shots outside of the sample. The sample I used for this model are the seasons 12/13 and 13/14 for all 5 major leagues (Premier League (ENG), La Liga (ESP), Bundesliga (GER), Serie A (ITA) and Ligue 1 (FRA)). To see if the model has predictive value, I will do a similar test as above, except for the fact that the shots will be from the season 14/15 for those 5 leagues. These shots are not used to make the model. The table now looks as follows:


I’m very happy with the results. The difference for the small chances increased slightly, but is still below a 1% difference for most shots. As we can see the chances within the 0.9-1 bin are not all scored this time, as we expected.

Obviously these figures aren’t perfect. For instance, let’s look at the 0.1-0.2 bin. The model predicted the shots to be scored at around 14.1% of the time. If we would simulate all the 9886 shots within this bin using that probability, chances are about 1394 of them would be scored. In reality, only 1315 of those shots were scored. A simple binomial test shows us that if the probability of 14.1% is correct, the probability that 1315 or less goals would occur, is practically zero. That’s solid proof that the model isn’t perfect, but that’s also not the point of what I did. The model I created only uses location and some very basic things to calculate the scoring probability, while in real life the scoring probability is obviously dependent on more variables. It does however give a very decent estimate and is easy to understand. The addition of more variables should improve the results even more.

Hope you enjoyed! If you did, please share. My next blog will be expanding on this subject. It’s my first blog so any comments/advice/feedback would be appreciated. If you find a flaw in my reasoning or calculations please let me know!