What predicts team score in an AFL match?
Published:
Introduction
I am back with another AFL post! My last one proved to be quite popular, so I decided to cash in on the excitement and go even deeper. This time, we are going to again leverage the amazing fitzRoy
R package to investigate the relationship between various quantities measured in an AFL match (such as marks, contested possessions, etc.) on a team’s score at the end of the match. I previously did this analysis years ago for a unit in the coursework component of my PhD, and figured it was time to revise it with updated data and methods. I also have a secondary goal (pun intended) of this article which is to demonstrate the importance and value of heteroscedastic-robust estimators and marginal effects for interpreting statistical models.
Setting up R
Before we get started, here are the packages we’ll need:
library(dplyr)
library(tidyr)
library(ggplot2)
library(janitor)
library(fitzRoy)
library(MASS)
library(marginaleffects)
library(patchwork)
Downloading and cleaning match-level statistics
fitzRoy
is an amazing package – it allows us to pull player-level data for every match, down to the number of handballs made or disposal efficiency, etc.. Based on my prior modelling of predicting scores with this data, we are going to replicate that analysis and pull out a few interesting quantities to use as predictor variables:
- Tackles
- Clangers (unforced errors)
- Inside 50s
- Marks inside 50
- Contested marks
- Contested possessions
- Free kicks for
- Handballs
- Hit outs
- Marks
- Rebounds
combns <- crossing(season = 2013:2024, round_number = 1:26) %>% filter(season != 2020)
match_stats <- vector(mode = "list", length = nrow(combns))
for(i in 1:nrow(combns)){
tmp <- combns[i, ]
match_summary <- fetch_player_stats(season = unique(tmp$season),
round_number = unique(tmp$round_number)) %>%
clean_names() %>%
reframe(score = (sum(goals) * 6) + sum(behinds),
marks = sum(marks),
handballs = sum(handballs),
hit_outs = sum(hitouts),
tackles = sum(tackles),
rebound50s = sum(rebound50s),
inside_50s = sum(inside50s),
clangers = sum(clangers),
frees_for = sum(frees_for),
contested_possessions = sum(contested_possessions),
contested_marks = sum(contested_marks),
marks_inside_50 = sum(marks_inside50),
.by = c("home_team_name", "away_team_name", "team_name"))
match_stats[[i]] <- match_summary
}
match_stats <- match_stats[!sapply(match_stats, is.null)]
match_stats <- do.call("rbind", match_stats)
Visualising the relationships
With our data in hand, we can start the analysis by visualising key relationships (as we always should!). Here are the bivariate relationships between team total score and each of the various match statistics we pulled:
match_stats %>%
pivot_longer(marks:marks_inside_50, names_to = "names", values_to = "values") %>%
ggplot(aes(x = values, y = score)) +
geom_point(alpha = 0.2) +
geom_smooth(formula = y ~ x, method = "rlm", colour = "steelblue2", fill = "steelblue2", alpha = 0.3) +
labs(title = "Bivariate relationships between various AFL metrics and total score",
x = "Value",
y = "Team final score") +
theme_bw() +
theme(strip.background = element_blank()) +
facet_wrap(~names, scales = "free_x")
Keen R users might note that I used MASS::rlm
(robust linear model) for the regression lines on the plots. I did this because we have pretty evident heteroscedasticity (i.e., variance changes as a function of the predictor variable) and some outlying points which I wanted to make the regression line robust to the influence of. However, this phenomenon might be a bit hard to visually determine from such a busy matrix of plots, so let’s zoom in on one example:
match_stats %>%
ggplot(aes(x = contested_possessions, y = score)) +
geom_point(alpha = 0.2) +
geom_smooth(formula = y ~ x, method = "rlm", colour = "steelblue2", fill = "steelblue2", alpha = 0.3) +
labs(x = "Contested possessions",
y = "Team final score") +
theme_bw()
Here we can see a sort of ‘fanning-out’ shape as we move from left-to-right along the \(x\)-axis (number of contested possessions). In other words, the variance in final score increases as number of contested possessions increases. This violates a core assumption of linear regression that the data is homoscedastic. As such, we need to think about adopting one of three solutions in our statistical modelling approach:
- Use a robust linear model (as I did in the basic illustrative plot above).
- Model the heteroscedasticity explicitly under a Bayesian inference framework by specifying a prediction equation for the variance term \(\sigma\) in addition to our response variable equation.
- Use heteroscedastic-robust standard errors to compute confidence intervals or bootstrapping for empirical, non-symmetrical confidence intervals. Both of these options are made extremely accessible by the incredible
marginaleffects
R package.
In my initial analysis of this data for my PhD coursework, I used approach 3 – heteroscedastic-robust standard errors. I am going to adopt the same approach here for comparability.
Heteroscedastic-robust standard errors
Heteroscedastic-robust estimators produce robust estimations of standard errors, test statistics, and p-values. Robust estimators are implemented in R
using the sandwich
package (which marginaleffects
calls under-the-hood). The estimators work by introducing a new term, \(\Omega\), that flexibly acts on the diagonal of the variance-covariance matrix, and relaxes the assumption of homogeneity by enabling differing variances along the matrix diagonal (see the two equations below). The inclusion of heteroscedastic-robust estimators reduces the size of test statistics, drives significance values away from zero, and increases standard errors to reflect the variance structure of the data.
\[ (X^{'}X)^{-1}X^{'}\Omega X (X^{'}X)^{-1} \]
\[ \Omega = \sigma^{2}I_{n} \]
Numerous \(\Omega\) options are available in the sandwich
package. Most of the options returned negligibly different values for this analysis, so the default HC3
parameter recommended by the authors was retained. It is defined as:
\[ HC3 = \frac{\mu^{2}_{\hat{i}}}{(1-h_{i})^{2}} \]
It’s important to keep in mind that these estimators are applied after fitting a statistical model – they are adjustments to the calculation of standard errors (and subsequently, confidence intervals).
Fitting a statistical model
It’s modelling time! Note that I am going to use a Poisson generalised linear model because our response variable – total team score – is a zero or positive integer value, and so assuming a Gaussian continuous response is inappropriate. This can be done using straightforward glm
syntax:
mod <- glm(score ~ marks + handballs + hit_outs + tackles + rebound50s + inside_50s +
clangers + frees_for + contested_possessions + contested_marks + marks_inside_50,
data = match_stats, family = poisson(link = "log"))
We are going to ignore model summaries and coefficient values because they are unimportant and go straight to the informative quantities we want: marginal effects and predictions, which are reported on the scale of interest (e.g., team score) rather than than link-function-transformed values of the coefficients.
Let’s start with predictions. Intuitively, this quantity represents the average team score conditioned on values of our predictor variable(s). The below code defines a function to do this for all our predictors, taking care to apply the HC3
robust variance-covariance estimator when computing the standard errors and confidence intervals:
#' Function to plot predicted home team win rate conditioned on a predictor variable
#'
#' @param model \code{glm} object containing the model
#' @param condition \code{character} denoting the variable to plot
#' @return \code{ggplot} object containing the plot
#' @author Trent Henderson
#'
predict_scores <- function(data, condition){
p <- plot_predictions(mod, condition = condition, vcov = "HC3") +
labs(subtitle = condition,
x = "Value",
y = "Total team score") +
theme_bw()
return(p)
}
plots <- list()
vars <- c("marks", "handballs", "hit_outs", "tackles", "rebound50s", "inside_50s",
"clangers", "frees_for", "contested_possessions", "contested_marks", "marks_inside_50")
for(i in vars){
p <- predict_scores(data = mod, condition = i)
plots[[match(i, vars)]] <- p
}
wrap_plots(plots) +
plot_layout(axis_titles = "collect") +
plot_annotation(title = "Conditional predictions of team score",
caption = "95% confidence intervals computed using heteroscedastic-robust estimator.")
We can immediately see the result of our heteroscedastic-robust handiwork in the varying widths of the confidence intervals that change as a function of our predictors. Stepping away from the statistical nuances for a moment, in terms of AFL performance, we see two key takeaways from this matrix of plots:
- Higher numbers of marks, handballs, hit outs, rebound 50s, inside 50s, free kicks for, contested possessions, and marks inside 50 are associated with higher predicted scores
- Higher numbers of tackles, clangers (unforced errors), and contested marks are associated with lower predicted scores
These overarching relationships are intuitive, but it’s nice to see a quantification of magnitude in any case. Interestingly, though, is the fact that contested possessions has a positive relationship with score, yet contested marks has a negative relationship. This suggests that the two metrics – despite both being contested in nature – quantify different game dynamics.
Going further: marginal effects
So far, we have just looked at predictions. More specifically, conditional predictions – that is, the predicted values rely on some grid of values that fix the other covariates. In the case of our matrix of plots above, these predictions were computed over each row of the dataset that was passed into the statistical model. However, we can go a step further than this. By using marginal effects (i.e., slopes), we can make statements about the rate of change in team score due to our various predictors. Basically, a marginal effect is the partial derivative of the regression equation with respect to a predictor of interest1. In a simple linear model like
\[ Y = \beta_0 + \beta_1 X +\varepsilon , \] the partial derivative of \(Y\) with respect to the predictor \(X\) is:
\[ \frac{\partial Y}{\partial X} = \beta_1 \]
Put simply, the slope \(\frac{\partial Y}{\partial X}\) is the effect of a one-unit change in \(X\) on the predicted value of \(Y\). This process is made very easy for us by the wonderful marginaleffects
package again2:
avg_slopes(mod, vcov = "HC3")
##
## Term Estimate Std. Error z Pr(>|z|) S 2.5 %
## clangers -0.4560 0.0306 -14.89 <0.001 164.2 -0.51606
## contested_marks -0.1559 0.0760 -2.05 0.0401 4.6 -0.30484
## contested_possessions 0.2342 0.0260 9.02 <0.001 62.3 0.18336
## frees_for 0.3504 0.0582 6.02 <0.001 29.1 0.23625
## handballs 0.0226 0.0107 2.10 0.0356 4.8 0.00152
## hit_outs 0.0954 0.0260 3.67 <0.001 12.0 0.04451
## inside_50s 0.8942 0.0426 21.01 <0.001 323.0 0.81072
## marks 0.0819 0.0168 4.87 <0.001 19.8 0.04893
## marks_inside_50 2.5588 0.0733 34.91 <0.001 884.6 2.41517
## rebound50s 0.1680 0.0446 3.77 <0.001 12.6 0.08058
## tackles -0.0466 0.0223 -2.09 0.0368 4.8 -0.09026
## 97.5 %
## -0.39602
## -0.00703
## 0.28510
## 0.46458
## 0.04358
## 0.14631
## 0.97758
## 0.11489
## 2.70249
## 0.25544
## -0.00286
##
## Type: response
## Comparison: dY/dX
Now we’re talking! Let’s interpret some of these results to make sense of them (noting that these quantities are on the scale of interest automatically – i.e., total team score). In the first row, we see that each additional clanger committed by a team is associated with an average change in total score of \(-0.456\) (95% CI: \(-0.396 \text{ to } -0.516\)). However, if we look further down, we see a vastly more important result: Each one-unit increase in marks inside 50 is associated with an average change in total score of \(2.559\) (95% CI: \(2.415 \text{ to } 2.702\))! That’s huge, and really highlights the importance of controlled possession (i.e., a mark that leads to a set shot on goal) within scoring distance. The importance of a set shot on goal is made even more stark when considering the marginal effect of just a standard inside 50, which, for every one-unit increase is associated with an average change in total score of just \(0.894\) comparatively (95% CI: \(0.811 \text{ to } 0.976\)).
Parting thoughts
Thanks for reading another AFL post! Hopefully this one shed some light on some of the lesser quantified relationships in the sport, or, at the very least, inspired you to consider using robust estimators and/or marginaleffects
in your own work.
Technically, a partial derivative should only be applied to a continuous predictor, whereas we have integer values.↩︎
I seriously cannot overstate how amazing this package is nor its usefulness and importance in science and applied statistical analysis. Everyone should be using it to interpret their statistical models.↩︎