How to ruin tennis (a modelling sequel)
It’s that time of year again. The weather starts to get less miserable and (some of us) get excited about… no, I’m not talking about the next Life Conference. Of course, it’s the greatest tennis competition in the world. Wimbledon: a paradise in South London fraught with Pimm’s and overpriced strawberries and cream.
After Craig Lynch and Deven Rickaby’s groundbreaking APR article entitled ‘How to ruin football’ in 2020, you may have wondered to yourself: surely there aren’t any other sports that APR could ruin? Unfortunately for you, after sitting the first couple of actuarial exams recently, we were both spurred on to use our new-found statistical powers to ruin everyone’s favourite tennis competition.
Perhaps this year more than ever, there is more attention being given to predicting the emerging contenders for the Wimbledon title, mainly because a lot of effort is needed to even predict the players who will be entered into the initial draw!
Of course, the controversy that has been dominating the headlines has been the ban on Russian players from being included, ruling out not only the highly-ranked Daria Kasatkina and Andrey Rublev, but perhaps most importantly the newly-promoted Men’s world number one, Daniil Medvedev. As such, Wimbledon has been stripped of its ranking points for 2022, forcing many big names to seriously consider giving up their place at the All England Club.
But for those still eligible and willing, the drama definitely does not end there. Djokovic will be hungry for another Grand Slam victory after losing against Nadal in the French Open quarter-finals. Nadal went on to win a historic 14th Roland Garros title and, for the first time in his career, has the chance to win all four Grand Slams in a calendar year. He therefore has to be included in any discussion of potential champions (even if he has to get numbing injections for umpteen injuries while he takes the title).
In the Women’s tournament all eyes will be on Iga Świątek who, at the time of writing, is on an unbelievable 35-game win streak that includes 6 consecutive tournament victories. There is no question that she is right in the midst of the contenders, but the more exciting question is: who will have the capabilities to be able to stop her? From a neutral perspective, one might like to see Coco Gauff pick herself up from her unfortunate French Open final defeat. After all, Wimbledon is the tournament where she made a name for herself, defeating Venus Williams back in 2019 as a then-15-year-old.
Who are going to win the Men’s and the Women’s Singles – are the favourites in the eyes of the people really expected to win? Who are the top 10 favourites for each? How many rounds of the tournament do we expect, say, the great Richard Gasquet to progress? These are the important questions that we wanted to answer.
To find out some answers, we created two different prediction models – one for the Men’s Singles, one for the Women’s Singles. In this article, we summarise how our two models worked and assess their results. The models themselves adopt very different approaches, but both are useful examples of how one might go about modelling a prediction problem like this.
For any readers keen to see the mathematical detail underlying the models, please do get in touch and we will be happy to share a supplementary document. And for any avid tennis fans with no interest in modelling whatsoever, feel free to skip to our predictions at the end.
For now, read on to see us serve up some Wimbledon 2022 predictions, and ace a few bad tennis puns along the way!
Men’s Singles – Model Overview
For the Men’s model, we make a few assumptions to begin with:
- Every player has a different probability of winning each match, independently of whom they are playing. We’d expect the best players to be more likely to win their matches.
- This probability for each player can vary a bit each game. Maybe sometimes they wake up on the wrong side of the bed, maybe they prefer playing in Miami to, say, Eastbourne (or vice-versa).
- In fact, prior to any calculations, we assume that they are likely to vary in a particular way. We expect most players will congregate around having a 50% win percentage, with a few of the great players hovering around the higher percentages.
See below a plot of what we assume to be our initial guess of how the players’ win probabilities vary. The peak of this graph is slightly less than 0.5 which ties in with our prior belief that the Men’s game is dominated by a few players who win a lot more than everyone else.
Now, we look at each player’s wins and losses from ATP tournaments in the last two years. Using these results, and our prior belief about how we think their win probability varies in each match, we pick our best guess for what their average win probability is. What’s more, once we have this, we can use our model to predict how many matches the player will win in a given tournament – cue the Richard Gasquet prediction you’re waiting for!
To summarise, essentially we’re starting with an initial guess, and then using actual results to improve our guesswork by adjusting it. The expected number of rounds we obtain won’t necessarily be a whole number, as it represents an average of all the rounds we expect a given player might reach, accounting for their possible variation.
In statistical language, the belief we have about the player win probabilities prior to any calculation is called the prior distribution. Post-observation of their match results, we get what’s called a posterior distribution for our win probabilities. What our model calculates is a posterior mean for the win probabilities of each player – this is our best estimate for how likely we think they are to win a match. This is an example of a Bayesian method – we use our prior belief about a number to influence our prediction for that number in the future. The supplementary document gives more detail of the background here.
Men’s Singles – Model Limitations and Possible Improvements
As George E.P. Box famously said, ‘all models are wrong but some are useful’. Wrong seems a bit harsh, but this model is definitely not perfect. It is left to the reader to decide whether this model is useful.
Key limitations and points for improvement are:
- Independence of the win probabilities. Ideally the win probability would depend on the relative difference of abilities between the two players, because currently there is no allowance for this.
- No allowance is made for recent form or recent injury. If a player had a great win streak at the start of 2021, this would be considered as valuable as having a win streak just before Wimbledon. We could improve this by assigning more of a weighting to recent matches.
- No allowance is made for relative performance on different surfaces. This model assumes Nadal is just as good on clay as he is on grass. To improve this, we could put in separate prior variables to represent each of the different surfaces.
Women’s Singles – Model Overview
Some of the assumptions that we make for the Women’s model are:
- The number of rounds reached by a player is a variable that follows a specific chosen distribution (in our case, we assume a Poisson distribution – details of this can be found in the supplementary document mentioned earlier).
- Historical observations of how far players make it in the draw are independent of each other. This assumption is actually partially violated, since our data contains multiple observations for any player that played in more than one Wimbledon tournament.
We use what is called a generalised linear model (GLM) to directly model the expected number of rounds that a player will make it to at Wimbledon, which again may not necessarily be a whole number.
By examining certain characteristics of Wimbledon players from the last 15 years, we can pick out potential factors (or covariates) that may have impacted their performance at the tournament, while trying to find a relationship between these and how far the player made it in the draw. These could be anything from their height to their win rate on a given surface, or simply a measure of just how much they love strawberries and cream (spoiler alert: that will not be included in the final model). We fit various models that use different combinations of these covariates, allowing us to narrow them down to those with a good ability to make accurate predictions.
A major difference from the Men’s model is that rather than trying to separate out players individually, we are instead trying to find global patterns, i.e. we assume that a certain factor impacts the Wimbledon performance of all players in the same way. This means that when we come to investigate the effect of, say, a player’s height on their number of rounds reached, we can use data from all participants from previous years to try and quantify this in our model.
The factors that we find to be most valuable in the final model are:
- World ranking immediately prior to Wimbledon.
- Number of matches on clay/hard/grass courts in the preceding year.
- Number of wins on clay/hard courts in the preceding year.
- Number of main draw wins at the previous Wimbledon.
The method detailed above is an example of a frequentist method, which is basically a method that makes estimations based only on our data set. This is in direct contrast to the Bayesian method used to model the Men’s draw, in which a prior distribution had to be assumed as well as using the data set. The choice of using a frequentist or Bayesian framework for modelling is often a topic of heated debate among statisticians, but it can safely be agreed that there are pros and cons to each.
Women’s Singles – Model Limitations and Possible Improvements
George E.P. Box’s quote rings just as true for our Women’s Singles model, with several areas in which improvements could be made.
One somewhat unfortunate oversight was constructing our model in such a way as to make predictions using the last 12 months of match results, which seemed sensible… until it was announced that 7-time Wimbledon champion Serena Williams was expected to return from a 12-month injury absence. Although it was great news from a tennis perspective, it wasn’t such great news for us!
Other limitations of the model include:
- The violation of the independence assumption among our observations. To deal with this, we could have built a prediction model that accounts for the dependence in the data. However, it is unlikely that this issue has a major impact on our final results.
- The global estimation of model parameters (assuming they are identical across all players) when in fact it may be more realistic to allow impacts of some factors to vary across specific players.
Finally, the moment you’ve all been waiting for. Using our models above, here are our predictions for Wimbledon 2022. We show the round that we expect each player to make it to (which could also be interpreted as the average round they would make it to if the tournament was played many times). Here, a value of 1.00 translates to definitely going out in the 1st Round and a value of 8.00 translates to definitely winning the title.
In the Men’s, it can be no real surprise that Djokovic comes out on top. We expect him on average to at least reach the 4th round of Wimbledon. In fact, using Djokovic’s calculated win probability, we can estimate that he has a 17% chance of winning the whole thing. The 6-time champion has been so dominant over the last few years and he is still the man to beat. Nadal follows closely in second place, indicating that we are still in the era of the tournaments being dominated by the greats.
With both of the authors of this article being based in the Edinburgh APR office, you may be surprised to see that Andy Murray misses out on the top 10. In fact, we actually have him ranked 31st. Statistics aside, we just use this as proof that we have not been swayed by any confirmation bias when creating these models!
After her long winning streak, it seems sensible that Iga Świątek has come out as our favourite to win the title in the Women’s draw. We would expect her to make it to at least the quarter-finals (further than Djokovic, perhaps a bold prediction given her past form on grass). This prediction translates to a 42% chance of her winning the tournament, which emphasises the extent of her domination just now.
The caveat that comes with the Women’s results is an absence of a prediction for Serena Williams due to her lengthy injury period. However using our “expert judgement”, as actuaries are so often asked to do, it seems unlikely she would make our top 10 with a lack of recent match practice behind her.
Lastly, in case you thought we’d forgotten, the Men’s model predicts that Richard Gasquet will win any given match with probability of 48.5% – which almost translates as an expected trip to the 2nd round for the decorated Frenchman. In fact, his estimated number of rounds is 1.94 and he has a probability of winning the tournament of 0.64% – well, you never know.
Now it’s time for us all to sit back, enjoy the Championships and sincerely hope that Djokovic and Świątek don’t make shock first-round exits – our credibility depends on it!
Addendum – written on 19th July, after Wimbledon 2022 had finished
So, how did we do? They say hindsight is a wonderful thing, but they probably haven’t tried predicting a tennis tournament. There were upsets, there were disappointments, there were shocks, and of course there were truckloads of strawberries and cream.
In the Men’s Singles, the drama began to unfold when our Number 5 pick, Berrettini, had to withdraw due to COVID-19 in the first round. At this point, we were starting to get worried. Casper Ruud (Number 4) and Stefanos Tsitsipas (Number 3) bowed out in the second and third rounds respectively, leaving us with a depleted stock from our top 10.
Fortunately, three of them – Cameron Norrie, Nadal and Djokovic all made it to the semi-finals. However, our model failed to predict the injury-forced withdrawal of Nadal at this point which left us with an unexpected Djokovic-Kyrgios final. Kyrgios, who our model predicted 20th most likely to win, had a characteristically explosive tournament, but he was no match for Djokovic. The Serb’s 7th Wimbledon title and 21st Grand Slam never really looked in doubt, and we can say we knew that all along.
It’s good that we can say that, because we certainly can’t say that we had full confidence (or much at all really) that Elena Rybakina would win the Women’s tournament, as she did. Our model had her as 18th most likely to take the title, but bear in mind that Wimbledon’s own seedings placed her in 17th before the tournament, so we weren’t far off the official list. We consoled ourselves with this fact after the heartbreak of seeing Iga Swiatek – our predicted favourite by some distance – end her 37-match winning streak in style with a huge loss to Alizé Cornet in the third round. We’re yet to face up to the “I told you so” that will be coming our way from the reviewer of this article and APR colleague, John Nicholls, so that’s something to look forward to…
Before you decide to chuck our model into whatever the equivalent of a dustbin is for mathematical models, you should know that our number 2 pick, Ons Jabeur, did actually make it to the final. She became the runner-up to Rybakina after a three-set battle that swung both ways. Our reputation just about remains intact.
Modelling aside, it was yet another glorious instalment in the Wimbledon saga as far as the tennis was concerned. As obsessed as we were with outcomes matching our predictions, we still managed to find some moments of solace to sit back and appreciate the quality of tennis going on. Richard Gasquet even made it to the third round!
But the question on everybody’s lips is: were our models actually any good? It’s difficult to say with certainty – we only ever said who was more likely to win any given match. So we won some and we lost some. Djokovic was victorious and Swiatek bombed out. What is certain is that we will need to ruin tennis once again next year…
Ross Witney-Hunter and Josh Payne
 Number of matches and wins relate to WTA250 level or above, (i.e. any wins at lower-level tournaments don’t contribute).