What predicts the probability of winning an AFL match?

35 minute read

Published:

Introduction

I love AFL. As a long-time supporter of Richmond through both horrendous and amazing years, I have always been curious about the data side of things. A few years ago, I analysed AFL data in a statistical modelling unit for my PhD, but that focused more on relationships between granular match statistics – like the relationship between clearances and goals conceded and the relationship between the number of contested possessions and total points scored. In this post, I wanted to explore whether historical information could be used to predict the probability of either team winning a given fixture before it starts1. To do so, we are going to make use of the incredible fitzRoy R package which provides streamlined scrapers for AFL data.

Setting up R

Before we get started, here are the packages we’ll need:

library(dplyr)
library(tidyr)
library(ggplot2)
library(janitor)
library(lubridate)
library(fitzRoy)
library(brms)
library(marginaleffects)
library(patchwork)

'%ni%' <- Negate('%in%')

Downloading, wrangling, and preprocessing the data

This section is going to be pretty long as there is a large amount of data cleaning and preparation to do to get all the variables I wanted. Please bear with me!

Fixture results

To start with, we’ll download the fixture results for every match in the 2012 to 2025 seasons. Even though I am going to extract data for the 2020 season, we’ll ignore it for modelling as, due to COVID-19, it was an anomalous season without the regular home-away structure and obviously the broader socioeconomic factors which could lead to it being considered a discontinuous part of the series. The reason I am pulling data for it is so later on I can compute a win rate for the season for each team to use as information for modelling of the subsequent 2021 season. I am also going to remove post-season games (i.e., finals) because the stakes for such games are different and do not necessarily reflect regular season performance or expectations.

# Pull AFL data but ignore 2020 as it was an anomalous season

years <- c(seq(from = 2012, to = 2025, by = 1))
all_seasons <- fetch_results_afltables(years)

# Removes finals as these matches are likely very different to season games and would bias analysis

all_seasons <- all_seasons %>%
  clean_names() %>%
  filter(round %ni% c("EF", "SF", "QF", "PF", "GF"))
  
# Other preprocessing

all_seasons <- all_seasons %>%
  dplyr::select(c(season, round_number, home_team, home_points, away_team, away_points, venue))

all_seasons <- all_seasons %>%
  mutate(venue = as.factor(venue)) %>%
  mutate(home_win = case_when(
          home_points > away_points  ~ "Home win",
          home_points == away_points ~ "Tie",
          home_points < away_points  ~ "Away win"),
         home_win = factor(home_win, levels = c("Home win", "Tie", "Away win"))) %>%
  mutate(home_team = ifelse(home_team == "Footscray", "Western Bulldogs", home_team),
         away_team = ifelse(away_team == "Footscray", "Western Bulldogs", away_team))

Player statistics

With match data in hand, let’s turn our attention to player data. fitzRoy provides easy access to this information through the fetch_player_details function. In some years, certain teams (such as Gold Coast) have different team names in the data (e.g., Gold Coast vs Gold Coast SUNS), so we’ll clean this as well. After pulling the data, I’m going to calculate some team–season averages for numerical variables. As a final step, we’ll rejoin this data to our main dataset, taking care to add it for both home and away teams so that we have each team’s stats for every fixture.

# Pull data for each team and season and clean up numeric columns of interest

players <- fetch_player_details(team = NULL, season = years) %>%
  mutate(team = case_when(
          team == "Gold Coast SUNS"   ~ "Gold Coast",
          team == "West Coast Eagles" ~ "West Coast",
          team == "GWS GIANTS"        ~ "GWS",
          team == "Sydney Swans"      ~ "Sydney",
          team == "Geelong Cats"      ~ "Geelong",
          team == "Adelaide Crows"    ~ "Adelaide",
          TRUE                        ~ team)) %>%
  clean_names() %>%
  mutate(debut_year = as.numeric(debut_year),
         height_in_cm = as.numeric(height_in_cm),
         weight_in_kg = as.numeric(weight_in_kg)) %>%
  mutate(weight_in_kg = ifelse(weight_in_kg == 0, NA, weight_in_kg),
         height_in_cm = ifelse(height_in_cm == 0, NA, height_in_cm))

# Calculate team-season level summaries to use as covariates

players_summary <- players %>%
  reframe(avg_tenure = median(season - debut_year, na.rm = TRUE),
          avg_age = median(as.integer(season - year(date_of_birth)), na.rm = TRUE),
          avg_height = median(height_in_cm, na.rm = TRUE),
          avg_draft_pos = median(as.integer(draft_position), na.rm = TRUE),
          .by = c("team", "season"))

# Make home and away variants

home_players <- players_summary %>%
  rename(home_team = team,
         home_avg_tenure = avg_tenure,
         home_avg_height = avg_height,
         home_avg_age = avg_age,
         home_avg_draft_pos = avg_draft_pos)

away_players <- players_summary %>%
  rename(away_team = team,
         away_avg_tenure = avg_tenure,
         away_avg_height = avg_height,
         away_avg_age = avg_age,
         away_avg_draft_pos = avg_draft_pos)

# Join player data to results

df <- all_seasons %>%
  inner_join(home_players, by = c("season" = "season", "home_team" = "home_team")) %>%
  inner_join(away_players, by = c("season" = "season", "away_team" = "away_team")) %>%
  mutate(home_team = as.factor(home_team),
         away_team = as.factor(away_team),
         season = as.integer(season),
         round_number = as.integer(round_number))

df2 <- df %>%
  filter(home_win != "Tie") %>%
  mutate(home_win = as.factor(as.character(home_win))) # Not enough samples to warrant this level

Cumulative team performance throughout each season

Now that we have fixture and player data, we can calculate some performance statistics. We’ll start with cumulative win rates for each team throughout each season. For the first match of each season, we’ll set the historical win rate to chance (i.e., \(50\%\))2.

combns <- crossing(team = unique(df2$home_team), 
                   season = unique(df2$season)[!unique(df2$season) %in% c(2020)])

win_rate <- vector(mode = "list", length = nrow(combns))

for(i in 1:nrow(combns)){
  
  tmp <- combns[i, ]
  
  tmp_home <- df2 %>%
    inner_join(tmp, by = c("home_team" = "team", "season" = "season")) %>%
    mutate(did_win = ifelse(home_win == "Home win", 1, 0)) %>%
    dplyr::select(c(round_number, did_win))
  
  tmp_away <- df2 %>%
    inner_join(tmp, by = c("away_team" = "team", "season" = "season")) %>%
    mutate(did_win = ifelse(home_win == "Away win", 1, 0)) %>%
    dplyr::select(c(round_number, did_win))
  
  tmp_both <- bind_rows(tmp_home, tmp_away) %>%
    arrange(round_number) %>%
    mutate(
      cum_wins = lag(cumsum(did_win), default = 0),
      cum_games = lag(row_number(), default = 0),
      prop_won = ifelse(round_number == 1 | (round_number == 2 & min(round_number) == 2) |
                        (round_number == 3 & min(round_number) == 3), 
                        0.5, cum_wins / cum_games)
    ) %>%
    dplyr::select(round_number, prop_won) %>%
    mutate(team = unique(tmp$team),
           season = unique(tmp$season))

  win_rate[[i]] <- tmp_both
}

win_rate <- do.call("rbind", win_rate)
rm(i, tmp, tmp_home, tmp_away, tmp_both)

# Join back to main data

df2 <- df2 %>%
  inner_join(win_rate, by = c("home_team" = "team", "season" = "season", "round_number" = "round_number")) %>%
  rename(home_team_win_rate = prop_won) %>%
  inner_join(win_rate, by = c("away_team" = "team", "season" = "season", "round_number" = "round_number")) %>%
  rename(away_team_win_rate = prop_won)

Previous season performance per team

In addition to the cumulative within-season win rate, the win rate for the previous season is also likely a decent predictor of a given season’s performance. For each team and season, we’ll compute these summaries as well:

win_rate_season <- vector(mode = "list", length = length(unique(df2$home_team)))

for(i in unique(df2$home_team)){
  tmp_home <- df2 %>%
    filter(home_team == i) %>%
    mutate(did_win = ifelse(home_win == "Home win", 1, 0)) %>%
    dplyr::select(c(season, round_number, did_win))
  
  tmp_away <- df2 %>%
    filter(away_team == i) %>%
    mutate(did_win = ifelse(home_win == "Away win", 1, 0)) %>%
    dplyr::select(c(season, round_number, did_win))
  
  tmp_both <- bind_rows(tmp_home, tmp_away) %>%
    reframe(total_wins = sum(did_win),
            total_games = length(unique(round_number)),
            .by = "season") %>%
    arrange(season) %>%
    group_by(season) %>%
    mutate(prop_won = total_wins / total_games) %>%
    ungroup() %>%
    dplyr::select(c(season, prop_won)) %>%
    mutate(team = i) %>%
    mutate(prop_won_prior_season = lag(prop_won)) %>%
    dplyr::select(-c(prop_won))
  
  win_rate_season[[match(i, unique(df2$home_team))]] <- tmp_both
}
  
win_rate_season <- do.call("rbind", win_rate_season)
rm(i, tmp_home, tmp_away, tmp_both)

# Join back to main data

df2 <- df2 %>%
  inner_join(win_rate_season, by = c("home_team" = "team", "season" = "season")) %>%
  rename(home_team_win_rate_prior_season = prop_won_prior_season) %>%
  inner_join(win_rate_season, by = c("away_team" = "team", "season" = "season")) %>%
  rename(away_team_win_rate_season_prior_season = prop_won_prior_season) %>%
  filter(season != "2012") # Remove 2012 as no preceding year of data

Match-level statistics

fitzRoy is such a good package that it also allows us to pull player-level data for every match, down to the number of handballs made or disposal efficiency, etc.. Based on my prior modelling of predicting scores with this data, we are going to pull out a few key statistics to use as indicators of team ‘form’:

  • Tackles
  • Clangers (unforced errors)
  • Inside 50s
  • Marks inside 50

My previous analysis demonstrated that tackles and clangers were strongly negatively associated with final score for a team, while inside 50s and marks inside 50 were very strongly and positively associated with final team score. This makes sense – if teams are tackling more, they are likely not controlling possession and steering the game, and a larger number of unforced errors of course leads to turnovers and generally indicates poor performance. Meanwhile, lots of time inside the opposition 50 means a team is spending more time in scoring opportunities. My previous analysis did not investigate disposal efficiency, but I am also going to include it here because I think it might be useful3. We are going to operationalise these features such that the zeroes are set for the first round of each season, but subsequent matches use information from the preceding match to indicate form beyond just a win rate.

# Pull and aggregate data

combns2 <- crossing(season = unique(df2$season), round_number = unique(df2$round_number))
match_stats <- vector(mode = "list", length = nrow(combns2))

for(i in 1:nrow(combns2)){
  
  # Pull data
  
  tmp <- combns2[i, ]
  
  match_data <- fetch_player_stats(season = unique(tmp$season), round_number = unique(tmp$round_number)) %>%
    clean_names()
  
  # Summarise
  
  match_summary <- match_data %>%
    reframe(total_tackles = sum(tackles),
            total_clangers = sum(clangers),
            total_inside50s = sum(inside50s),
            total_marksinside50 = sum(marks_inside50),
            avg_disposal_efficiency = mean(disposal_efficiency, na.rm = TRUE),
            .by = c("home_team_name", "away_team_name", "team_name")) %>%
    mutate(season = unique(tmp$season),
           round_number = unique(tmp$round_number))
  
  match_stats[[i]] <- match_summary
}

match_stats <- match_stats[!sapply(match_stats, is.null)]
match_stats <- do.call("rbind", match_stats)
# Clean team names

match_stats <- match_stats %>%
  mutate(team_name = case_when(
    team_name == "Gold Coast SUNS"   ~ "Gold Coast",
    team_name == "West Coast Eagles" ~ "West Coast",
    team_name == "GWS GIANTS"        ~ "GWS",
    team_name == "Sydney Swans"      ~ "Sydney",
    team_name == "Geelong Cats"      ~ "Geelong",
    team_name == "Adelaide Crows"    ~ "Adelaide",
    TRUE                             ~ team_name)) %>%
  mutate(home_team_name = case_when(
    home_team_name == "Gold Coast SUNS"   ~ "Gold Coast",
    home_team_name == "West Coast Eagles" ~ "West Coast",
    home_team_name == "GWS GIANTS"        ~ "GWS",
    home_team_name == "Sydney Swans"      ~ "Sydney",
    home_team_name == "Geelong Cats"      ~ "Geelong",
    home_team_name == "Adelaide Crows"    ~ "Adelaide",
    TRUE                                  ~ home_team_name)) %>%
  mutate(away_team_name = case_when(
    away_team_name == "Gold Coast SUNS"   ~ "Gold Coast",
    away_team_name == "West Coast Eagles" ~ "West Coast",
    away_team_name == "GWS GIANTS"        ~ "GWS",
    away_team_name == "Sydney Swans"      ~ "Sydney",
    away_team_name == "Geelong Cats"      ~ "Geelong",
    away_team_name == "Adelaide Crows"    ~ "Adelaide",
    TRUE                                  ~ away_team_name)) %>%
  dplyr::select(-c(home_team_name, away_team_name))

# Calculate lagged versions for historical prediction

combns3 <- crossing(team = unique(df2$home_team), season = unique(df2$season))
lagged_matched_stats <- vector(mode = "list", length = nrow(combns3))

for(i in 1:nrow(combns3)){
  
  tmp <- combns3[i, ]
  
  tmp_stats <- match_stats %>%
    inner_join(tmp, by = c("team_name" = "team", "season" = "season")) %>%
    arrange(round_number) %>%
    mutate(
      total_tackles = lag(total_tackles, default = 0),
      total_clangers = lag(total_clangers, default = 0),
      total_inside50s = lag(total_inside50s, default = 0),
      total_marksinside50 = lag(total_marksinside50, default = 0),
      avg_disposal_efficiency = lag(avg_disposal_efficiency, default = 0))
  
  lagged_matched_stats[[i]] <- tmp_stats
}

lagged_matched_stats <- do.call("rbind", lagged_matched_stats)
rm(i)

# Assign home and away labels to the data

df2 <- df2 %>%
  inner_join(lagged_matched_stats, by = c("home_team" = "team_name", "season" = "season", 
                                          "round_number" = "round_number")) %>%
  rename(home_team_total_tackles = total_tackles,
         home_team_total_clangers = total_clangers,
         home_team_total_inside50s = total_inside50s,
         home_team_total_marksinside50 = total_marksinside50,
         home_team_avg_disposal_efficiency = avg_disposal_efficiency) %>%
  inner_join(lagged_matched_stats, by = c("away_team" = "team_name", "season" = "season", 
                                          "round_number" = "round_number")) %>%
  rename(away_team_total_tackles = total_tackles,
         away_team_total_clangers = total_clangers,
         away_team_total_inside50s = total_inside50s,
         away_team_total_marksinside50 = total_marksinside50,
         away_team_avg_disposal_efficiency = avg_disposal_efficiency)

Home and away team travel

We are on to the final lot of data preparation! While coding all of this, I had the idea that maybe travel plays a role in the probability of a team winning. Intuitively, you might assume that only the away team travels, but in fact, there are many matches where both teams have to travel (such as the games played in New Zealand and China) – thus making the ‘home’ and ‘away’ labels a little less implicit in terms of the well-known sporting phenomemon that is the ‘home team advantage’. To account for this, I’m just creating a lookup of home states for each team and the location for each venue. From this information we can then ascertain if each of the home and away teams travelled for any given fixture.

Note that I’m not considering within-state games as representing travel, even though this is technically correct. For example, the Brisbane Lions would have to travel to the Gold Coast to play the Suns, but this is just a one-hour drive and unlikely to impede performance. In contrast, if the Brisbane Lions were playing Richmond in an away game, their travel would instead involve flights and other logistical considerations which could arguably affect performance.

home_state <- df2 %>%
  dplyr::select(c(home_team)) %>%
  distinct() %>%
  mutate(
    home_state = case_when(
          home_team == "Fremantle"        ~ "WA",
          home_team == "Carlton"          ~ "VIC",
          home_team == "Western Bulldogs" ~ "VIC",
          home_team == "GWS"              ~ "NSW",
          home_team == "Gold Coast"       ~ "QLD",
          home_team == "Melbourne"        ~ "VIC",
          home_team == "North Melbourne"  ~ "VIC",
          home_team == "Hawthorn"         ~ "VIC",
          home_team == "St Kilda"         ~ "VIC",
          home_team == "Sydney"           ~ "NSW",
          home_team == "Brisbane Lions"   ~ "QLD",
          home_team == "Port Adelaide"    ~ "SA",
          home_team == "Geelong"          ~ "VIC",
          home_team == "Collingwood"      ~ "VIC",
          home_team == "West Coast"       ~ "WA",
          home_team == "Richmond"         ~ "VIC",
          home_team == "Adelaide"         ~ "SA",
          home_team == "Essendon"         ~ "VIC"
    )
  )

venue_state <- df2 %>%
  dplyr::select(c(venue)) %>%
  distinct() %>%
  mutate(
    venue_state = case_when(
      venue == "Subiaco"            ~ "WA",
      venue == "M.C.G."             ~ "VIC",
      venue == "Docklands"          ~ "VIC",
      venue == "Stadium Australia"  ~ "NSW",
      venue == "Carrara"            ~ "QLD",
      venue == "S.C.G."             ~ "NSW",
      venue == "Gabba"              ~ "QLD",
      venue == "Football Park"      ~ "SA",
      venue == "Bellerive Oval"     ~ "TAS",
      venue == "Manuka Oval"        ~ "ACT",
      venue == "York Park"          ~ "TAS",
      venue == "Wellington"         ~ "NZ",
      venue == "Sydney Showground"  ~ "NSW",
      venue == "Marrara Oval"       ~ "NT",
      venue == "Kardinia Park"      ~ "VIC",
      venue == "Cazaly's Stadium"   ~ "QLD",
      venue == "Adelaide Oval"      ~ "SA",
      venue == "Traeger Park"       ~ "NT",
      venue == "Jiangwan Stadium"   ~ "China",
      venue == "Eureka Stadium"     ~ "VIC",
      venue == "Perth Stadium"      ~ "WA",
      venue == "Riverway Stadium"   ~ "QLD",
      venue == "Norwood Oval"       ~ "SA",
      venue == "Summit Sports Park" ~ "SA",
      venue == "Barossa Oval"       ~ "SA",
      venue == "Hands Oval"         ~ "WA",
      )
  )

# Join to main data

df2 <- df2 %>%
  inner_join(home_state, by = c("home_team" = "home_team")) %>%
  rename(home_team_state = home_state) %>%
  inner_join(home_state, by = c("away_team" = "home_team")) %>%
  rename(away_team_state = home_state) %>%
  inner_join(venue_state, by = c("venue" = "venue")) %>%
  mutate(home_team_travelled = ifelse(home_team_state == venue_state, 0, 1),
         away_team_travelled = ifelse(away_team_state == venue_state, 0, 1)) %>%
  mutate(round_number = as.factor(round_number)) # Decided to treat it this way late!

Fitting a statistical model

Phew! That was a lot. Now we can turn our attention to the cool stuff. For this article, I am going to treat the 2025 season as our test set, so we can easily create a train-test split and \(z\)-score the average age variable to make the model algorithm converge faster as large numbers are a bit of a pain in Stan. Note that we are taking care to rescale both sets on the training set’s sample mean and standard deviation to avoid any information from the test set from bleeding into the training process. Note that there are alternative ways to do this using existing R packages – I just thought this was explicit for the purpose of a blog post.

train <- df2 %>%
  filter(season != "2025")

#' Rescale the set of necessary variables for an AFL model
#' 
#' @param data \code{data.frame} to operate on
#' @return \code{data.frame} of rescaled data
#' @author Trent Henderson
#' 

rescale_afl <- function(data){
  tmp <- data %>%
    dplyr::select(-c(home_avg_tenure, away_avg_tenure, home_avg_draft_pos, away_avg_draft_pos)) %>%
    mutate(home_avg_age_z = (home_avg_age - mean(train$home_avg_age)) / sd(train$home_avg_age),
           away_avg_age_z = (away_avg_age - mean(train$away_avg_age)) / sd(train$away_avg_age),
           home_team_total_tackles_z = (home_team_total_tackles - mean(train$home_team_total_tackles)) /
             sd(train$home_team_total_tackles),
           away_team_total_tackles_z = (away_team_total_tackles - mean(train$away_team_total_tackles)) /
             sd(train$away_team_total_tackles),
           home_team_total_clangers_z = (home_team_total_clangers - mean(train$home_team_total_clangers)) /
             sd(train$home_team_total_clangers),
           away_team_total_clangers_z = (away_team_total_clangers - mean(train$away_team_total_clangers)) /
             sd(train$away_team_total_clangers),
           home_team_total_inside50s_z = (home_team_total_inside50s - mean(train$home_team_total_inside50s)) /
             sd(train$home_team_total_inside50s),
           away_team_total_inside50s_z = (away_team_total_inside50s - mean(train$away_team_total_inside50s)) /
             sd(train$away_team_total_inside50s),
           home_team_total_marksinside50_z = (home_team_total_marksinside50 - mean(train$home_team_total_marksinside50)) /
             sd(train$home_team_total_marksinside50),
           away_team_total_marksinside50_z = (away_team_total_marksinside50 - mean(train$away_team_total_marksinside50)) /
             sd(train$away_team_total_marksinside50))
  
  return(tmp)
}

test <- df2 %>%
  filter(season == "2025") %>%
  rescale_afl()

train <- train %>%
  rescale_afl()

Moving right on to specifying and training our statistical model. We are going to fit a Bayesian logistic regression model using the amazing brms package which calls the C++ probabilistic programming language Stan under the hood. I was tempted to write this model from scratch in Stan, but this post is already long enough so the brms interface will make things easier for most readers to understand. Adopting a Bayesian approach to statistical inference has many, many benefits, but in our case there are two standout reasons:

  1. We can incorporate prior information – All of that previous analysis I did comes in handy! I have pretty decent beliefs about the types of relationships I am expecting to see here, and so a Bayesian framework enables me to encode that prior information into the statistical model.
  2. We can update our model as new data (i.e., matches in a future season) becomes available – this is a crucial selling point for Bayesian methods applied to real-world dynamic data analysis. In so many applications data is continuously added, and frequentist models have no way to properly incorporate this. In a Bayesian framework, our model acts as our priors for the modelling of the new data, and we can just continuously update our beliefs about the relationships between variables as we accumulate more and more data. You can imagine this being exceptionally useful mid season compared to, say, the first round of a season. In brms, this process is made easy by way of the update() function

As denoted by my comments in the code below, the model has a few key parts:

  1. Formula – model formula denoting the response variable and predictors. Note that for all numerical predictors we are specifying smooth functions to ensure we can model non-linearities between them and win probability. This is effectively like specifying a generalised additive model (GAM). Note that I also tried specifying a Gaussian process term over the season variable, which would have had the benefit of enabling correlated win probabilities across seasons, where more recent seasons are weighted more heavily than older ones, which makes a lot of intuitive sense. However, I ran into a lot of diagnostic issues with this complexity, unfortunately4.
  2. Priors – our expectations and beliefs before seeing any data. The ability to specify priors is essentially the whole point of adopting a Bayesian approach to statistical inference instead of a frequentist one. In this case, we are specifying prior probability distributions for the intercept (i.e., probability of home team winning when all covariates are zero) and the coefficients. I’m expecting small values for both but with some degree of variability given my past experience modelling logistic regression problems, and so I’ve reflected that in my priors.
  3. Data and family – just the data frame containing the training data and the likelihood family.
  4. MCMC controls – to help the Markov chain Monte Carlo algorithm used by Stan to sample from the posterior effectively. I’m taking care to specify a large number of iterations (\(10000\)), four chains running in parallel across separate computing cores, a fixed seed for reproducibility, an increase in the tree depth searched by the algorithm, and an increase in the target average acceptance probability (adapt_delta). These last two options help reduce the presence of problematic divergences in the sampling process which can greatly affect statistical inference.
mod <- brm(
  
  # Formula
  
  bf(home_win ~ home_team + away_team + s(home_avg_age_z) + s(away_avg_age_z) + 
       s(home_team_win_rate) + s(away_team_win_rate) +
       s(home_team_win_rate_prior_season) + s(away_team_win_rate_season_prior_season) + 
       s(home_team_total_tackles_z) + s(away_team_total_tackles_z) +
       s(home_team_total_clangers_z) + s(away_team_total_clangers_z) +
       s(home_team_total_inside50s_z) + s(away_team_total_inside50s_z) +
       s(home_team_total_marksinside50_z) + s(away_team_total_marksinside50_z) +
       s(home_team_avg_disposal_efficiency) + s(away_team_avg_disposal_efficiency) +
       round_number + season + home_team_travelled + away_team_travelled,
     decomp = "QR"),
  
  # Priors
  
  prior = c(set_prior("normal(-10, 5)", class = "Intercept"), # Expecting small intercept once exponentiated
            set_prior("normal(0, 2)", class = "b")), # Expecting likely small coefficients but with some larger ones
  
  # Data and family
  
  data = train, family = bernoulli,
  
  # MCMC controls
  
  iter = 1e4, chains = 4, cores = 4, seed = 123,
  control = list(max_treedepth = 15, adapt_delta = 0.99)
)

Checking model diagnostics

We can now perform some basic checks to feel confident that there is no pathological behaviour impacting our model. Here is a posterior predictive check, which plots the probability density of the actual data’s response variable (solid blue line) against a bunch of draws from our model’s posterior (light blue lines). We ideally want to see the model draws bounded around the actual data, which indicates that we have captured the broad shape but obviously with some uncertainty around it. We see that this is indeed the case for our model.

pp_check(mod, ndraws = 2e3) + theme_bw()

We can also inspect the posterior distributions and trace plots for each of the model parameters. We want the trace plots to look like white noise – any observable patterns indicates potential pathology with the model. Here are the coefficients for ther home team factor levels, as an example:

plot(mod, variable = "^b_home_team", regex = TRUE)

We can also inspect diagnostics for each parameter, such as the \(\hat{R}\) value, which, if no pathology is present, should be \(\hat{R}=1.0\) for each parameter. This information is returned automatically when calling summary() on a brmsfit object:

summary(mod)
##  Family: bernoulli 
##   Links: mu = logit 
## Formula: home_win ~ home_team + away_team + s(home_avg_age_z) + s(away_avg_age_z) + s(home_team_win_rate) + s(away_team_win_rate) + s(home_team_win_rate_prior_season) + s(away_team_win_rate_season_prior_season) + s(home_team_total_tackles_z) + s(away_team_total_tackles_z) + s(home_team_total_clangers_z) + s(away_team_total_clangers_z) + s(home_team_total_inside50s_z) + s(away_team_total_inside50s_z) + s(home_team_total_marksinside50_z) + s(away_team_total_marksinside50_z) + s(home_team_avg_disposal_efficiency) + s(away_team_avg_disposal_efficiency) + round_number + season + home_team_travelled + away_team_travelled 
##    Data: train (Number of observations: 2146) 
##   Draws: 4 chains, each with iter = 10000; warmup = 5000; thin = 1;
##          total post-warmup draws = 20000
## 
## Smoothing Spline Hyperparameters:
##                                                Estimate Est.Error l-95% CI
## sds(shome_avg_age_z_1)                             0.41      0.42     0.01
## sds(saway_avg_age_z_1)                             0.53      0.46     0.03
## sds(shome_team_win_rate_1)                         1.44      0.78     0.37
## sds(saway_team_win_rate_1)                         1.14      0.79     0.11
## sds(shome_team_win_rate_prior_season_1)            1.15      0.91     0.04
## sds(saway_team_win_rate_season_prior_season_1)     0.99      0.75     0.08
## sds(shome_team_total_tackles_z_1)                  0.77      0.74     0.03
## sds(saway_team_total_tackles_z_1)                  0.58      0.58     0.02
## sds(shome_team_total_clangers_z_1)                 0.59      0.58     0.02
## sds(saway_team_total_clangers_z_1)                 0.70      0.67     0.02
## sds(shome_team_total_inside50s_z_1)                0.63      0.59     0.02
## sds(saway_team_total_inside50s_z_1)                1.97      1.35     0.10
## sds(shome_team_total_marksinside50_z_1)            0.61      0.59     0.02
## sds(saway_team_total_marksinside50_z_1)            0.56      0.55     0.02
## sds(shome_team_avg_disposal_efficiency_1)          0.93      0.90     0.03
## sds(saway_team_avg_disposal_efficiency_1)          0.83      0.81     0.03
##                                                u-95% CI Rhat Bulk_ESS Tail_ESS
## sds(shome_avg_age_z_1)                             1.54 1.00     7381    12154
## sds(saway_avg_age_z_1)                             1.75 1.00     8260     8471
## sds(shome_team_win_rate_1)                         3.39 1.00    12737     9743
## sds(saway_team_win_rate_1)                         3.16 1.00     9450     8276
## sds(shome_team_win_rate_prior_season_1)            3.27 1.00     5830    11453
## sds(saway_team_win_rate_season_prior_season_1)     2.90 1.00     8182     8275
## sds(shome_team_total_tackles_z_1)                  2.77 1.00    10959    11044
## sds(saway_team_total_tackles_z_1)                  2.16 1.00    12034    10369
## sds(shome_team_total_clangers_z_1)                 2.14 1.00    12971    12273
## sds(saway_team_total_clangers_z_1)                 2.50 1.00    13126    11757
## sds(shome_team_total_inside50s_z_1)                2.19 1.00    12669    11870
## sds(saway_team_total_inside50s_z_1)                5.18 1.00     6102     8428
## sds(shome_team_total_marksinside50_z_1)            2.21 1.00    11682    12464
## sds(saway_team_total_marksinside50_z_1)            2.04 1.00    13605    11563
## sds(shome_team_avg_disposal_efficiency_1)          3.32 1.00    10690    11845
## sds(saway_team_avg_disposal_efficiency_1)          3.03 1.00    13480    11975
## 
## Regression Coefficients:
##                                           Estimate Est.Error l-95% CI u-95% CI
## Intercept                                   -13.84     41.70   -94.85    68.32
## home_teamBrisbaneLions                        0.21      0.32    -0.40     0.83
## home_teamCarlton                             -0.32      0.31    -0.94     0.28
## home_teamCollingwood                         -0.43      0.30    -1.02     0.17
## home_teamEssendon                             0.07      0.31    -0.55     0.68
## home_teamFremantle                            0.13      0.30    -0.46     0.73
## home_teamGeelong                              0.76      0.35     0.08     1.47
## home_teamGoldCoast                           -0.05      0.31    -0.67     0.56
## home_teamGWS                                  0.09      0.32    -0.55     0.72
## home_teamHawthorn                             0.55      0.33    -0.10     1.20
## home_teamMelbourne                           -0.27      0.32    -0.89     0.35
## home_teamNorthMelbourne                      -0.07      0.32    -0.71     0.56
## home_teamPortAdelaide                         0.14      0.31    -0.47     0.75
## home_teamRichmond                             0.24      0.32    -0.40     0.87
## home_teamStKilda                             -0.30      0.32    -0.92     0.31
## home_teamSydney                               0.00      0.32    -0.63     0.63
## home_teamWestCoast                           -0.11      0.31    -0.71     0.50
## home_teamWesternBulldogs                     -0.05      0.31    -0.65     0.56
## away_teamBrisbaneLions                       -0.24      0.32    -0.87     0.40
## away_teamCarlton                              0.11      0.32    -0.52     0.75
## away_teamCollingwood                         -0.64      0.30    -1.24    -0.06
## away_teamEssendon                             0.05      0.31    -0.56     0.65
## away_teamFremantle                           -0.14      0.31    -0.74     0.46
## away_teamGeelong                             -0.42      0.31    -1.04     0.19
## away_teamGoldCoast                            0.79      0.35     0.11     1.48
## away_teamGWS                                 -0.32      0.31    -0.93     0.30
## away_teamHawthorn                            -0.25      0.31    -0.87     0.36
## away_teamMelbourne                           -0.44      0.31    -1.06     0.17
## away_teamNorthMelbourne                       0.23      0.32    -0.40     0.86
## away_teamPortAdelaide                        -0.30      0.30    -0.88     0.29
## away_teamRichmond                            -0.12      0.31    -0.74     0.49
## away_teamStKilda                              0.02      0.31    -0.60     0.63
## away_teamSydney                              -0.99      0.31    -1.60    -0.38
## away_teamWestCoast                           -0.15      0.31    -0.75     0.45
## away_teamWesternBulldogs                     -0.28      0.31    -0.89     0.34
## round_number2                                 1.18      3.23    -4.99     7.94
## round_number3                                 1.51      3.23    -4.65     8.25
## round_number4                                 0.92      3.22    -5.26     7.65
## round_number5                                 0.71      3.22    -5.47     7.43
## round_number6                                 0.67      3.22    -5.51     7.37
## round_number7                                 0.80      3.23    -5.39     7.55
## round_number8                                 1.12      3.22    -5.04     7.79
## round_number9                                 1.13      3.22    -5.05     7.85
## round_number10                                1.26      3.22    -4.90     7.98
## round_number11                                1.35      3.22    -4.81     8.04
## round_number12                                1.19      3.22    -5.02     7.88
## round_number13                                0.58      3.22    -5.61     7.27
## round_number14                                1.39      3.22    -4.76     8.11
## round_number15                                1.30      3.21    -4.85     7.96
## round_number16                                0.74      3.22    -5.44     7.46
## round_number17                                0.57      3.22    -5.59     7.23
## round_number18                                1.20      3.22    -4.97     7.90
## round_number19                                1.21      3.21    -4.95     7.89
## round_number20                                1.32      3.22    -4.82     8.02
## round_number21                                0.69      3.22    -5.50     7.39
## round_number22                                1.33      3.22    -4.84     7.98
## round_number23                                0.64      3.22    -5.50     7.33
## round_number24                                0.75      3.26    -5.48     7.55
## round_number25                               75.32     55.85     3.20   205.14
## season                                        0.01      0.02    -0.03     0.05
## home_team_travelled                          -0.06      0.21    -0.47     0.36
## away_team_travelled                           0.24      0.15    -0.04     0.52
## shome_avg_age_z_1                             0.38      0.81    -1.12     2.36
## saway_avg_age_z_1                            -0.40      0.93    -2.37     1.50
## shome_team_win_rate_1                         0.01      1.62    -3.23     3.05
## saway_team_win_rate_1                        -0.45      1.44    -2.94     2.62
## shome_team_win_rate_prior_season_1            1.87      1.45    -1.33     4.70
## saway_team_win_rate_season_prior_season_1    -0.63      1.43    -3.18     2.50
## shome_team_total_tackles_z_1                 -0.30      1.30    -3.03     2.36
## saway_team_total_tackles_z_1                  0.31      1.19    -2.20     2.77
## shome_team_total_clangers_z_1                 0.50      1.28    -2.26     3.00
## saway_team_total_clangers_z_1                 0.04      1.35    -2.79     2.69
## shome_team_total_inside50s_z_1                1.22      1.38    -1.89     3.63
## saway_team_total_inside50s_z_1               -1.03      1.88    -4.46     2.90
## shome_team_total_marksinside50_z_1            0.60      1.22    -2.14     2.90
## saway_team_total_marksinside50_z_1           -0.03      1.18    -2.59     2.27
## shome_team_avg_disposal_efficiency_1          0.01      1.68    -3.35     3.27
## saway_team_avg_disposal_efficiency_1         -0.45      1.68    -3.67     2.96
##                                           Rhat Bulk_ESS Tail_ESS
## Intercept                                 1.00    31157    14059
## home_teamBrisbaneLions                    1.00    37037    16771
## home_teamCarlton                          1.00    34928    17581
## home_teamCollingwood                      1.00    37046    16750
## home_teamEssendon                         1.00    32119    16365
## home_teamFremantle                        1.00    35226    16406
## home_teamGeelong                          1.00    35186    14932
## home_teamGoldCoast                        1.00    35851    17074
## home_teamGWS                              1.00    36726    17435
## home_teamHawthorn                         1.00    34568    16152
## home_teamMelbourne                        1.00    36402    16253
## home_teamNorthMelbourne                   1.00    35046    16862
## home_teamPortAdelaide                     1.00    37031    15391
## home_teamRichmond                         1.00    36638    16519
## home_teamStKilda                          1.00    35400    16276
## home_teamSydney                           1.00    35723    15711
## home_teamWestCoast                        1.00    33724    15532
## home_teamWesternBulldogs                  1.00    23626    17185
## away_teamBrisbaneLions                    1.00    41635    14987
## away_teamCarlton                          1.00    44426    14140
## away_teamCollingwood                      1.00    41872    14156
## away_teamEssendon                         1.00    44263    14782
## away_teamFremantle                        1.00    44465    15509
## away_teamGeelong                          1.00    41447    14690
## away_teamGoldCoast                        1.00    40659    15321
## away_teamGWS                              1.00    26287    17301
## away_teamHawthorn                         1.00    39213    14898
## away_teamMelbourne                        1.00    38396    15294
## away_teamNorthMelbourne                   1.00    42493    14716
## away_teamPortAdelaide                     1.00    42949    15094
## away_teamRichmond                         1.00    43180    14276
## away_teamStKilda                          1.00    42869    14946
## away_teamSydney                           1.00    38827    14631
## away_teamWestCoast                        1.00    41659    13621
## away_teamWesternBulldogs                  1.00    41740    14546
## round_number2                             1.00     2180     3862
## round_number3                             1.00     2180     3741
## round_number4                             1.00     2181     3825
## round_number5                             1.00     2185     3820
## round_number6                             1.00     2188     3803
## round_number7                             1.00     2187     3796
## round_number8                             1.00     2192     3820
## round_number9                             1.00     2191     3806
## round_number10                            1.00     2193     3818
## round_number11                            1.00     2192     3841
## round_number12                            1.00     2198     3808
## round_number13                            1.00     2195     3826
## round_number14                            1.00     2201     3905
## round_number15                            1.00     2194     3864
## round_number16                            1.00     2196     3832
## round_number17                            1.00     2197     3866
## round_number18                            1.00     2199     3927
## round_number19                            1.00     2198     3917
## round_number20                            1.00     2211     3861
## round_number21                            1.00     2205     3855
## round_number22                            1.00     2210     3875
## round_number23                            1.00     2204     3923
## round_number24                            1.00     2253     3930
## round_number25                            1.00     5588     6885
## season                                    1.00    37428    14729
## home_team_travelled                       1.00    43000    13943
## away_team_travelled                       1.00    42000    14160
## shome_avg_age_z_1                         1.00    14663    11499
## saway_avg_age_z_1                         1.00    17442    13445
## shome_team_win_rate_1                     1.00    20138    11197
## saway_team_win_rate_1                     1.00    17432    15317
## shome_team_win_rate_prior_season_1        1.00    19157    13742
## saway_team_win_rate_season_prior_season_1 1.00    17096    15823
## shome_team_total_tackles_z_1              1.00    18217    14462
## saway_team_total_tackles_z_1              1.00    19290    14713
## shome_team_total_clangers_z_1             1.00    18468    14377
## saway_team_total_clangers_z_1             1.00    20496    15243
## shome_team_total_inside50s_z_1            1.00    16867    13568
## saway_team_total_inside50s_z_1            1.00    15835    15675
## shome_team_total_marksinside50_z_1        1.00    18935    13838
## saway_team_total_marksinside50_z_1        1.00    20030    14888
## shome_team_avg_disposal_efficiency_1      1.00    16265    14468
## saway_team_avg_disposal_efficiency_1      1.00    14126    15121
## 
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

All the \(\hat{R}\) values look good! Other important information returned by summary() include the effective sample sizes. We want these to be high – especially in the tail – to indicate appropriate sampling across the distribution.

Note that there are many other checks one could and should do of a Bayesian model. For the sake of brevity, I’ll move along here, though. See this paper and this paper for more information.

Interpreting covariates

Interpreting statistical models is hard! Interpretation is made even more challenging when we have non-Gaussian response variables, numerous predictors, and smooth functions. Basically, people should stop reporting and interpreting model coefficients and tests and start using modern tools that are far, far more fit for this purpose, such as marginal effects and Parametric g-Computation. Not only do these tools account for the various factors at play in a model that a simple parameter cannot, but they are also interpreted on the scale of interest rather than the scale of whatever link function was used in the model. For our purposes, we are going to plot the predicted values (with the associated \(95\%\) credible interval) for each value of our covariates of interest to ascertain the nature of the relationship between them and win probability. All of this is made possible by the amazing marginaleffects R package which is basically my web browser homepage at this point.

Below I have written a function that does this calculation for a covariate of interest and draws and formats a nice plot. The matrix of plots for all covariates is then pieced together so we can inspect everything in one place.

#' Function to plot predicted home team win rate conditioned on a predictor variable
#' 
#' @param model \code{brmsfit} object containing the model
#' @param condition \code{character} denoting the variable to plot
#' @return \code{ggplot} object containing the plot
#' @author Trent Henderson
#'

plot_afl_smooth <- function(data, condition){
  p <- plot_predictions(mod, condition = condition) + 
    geom_hline(aes(yintercept = 0.5), linetype = "dashed", linewidth = 0.6) +
    labs(subtitle = gsub("_", " ", condition),
         x = gsub("_", " ", condition),
         y = "Home team win probability") + 
    theme_bw()
  
  if(condition %in% c("home_team", "away_team")){
    p <- p +
      theme(axis.text.x = element_text(angle = 90))
  }
  
  return(p)
}

plots <- list()

vars <- c("home_avg_age_z", "away_avg_age_z", 
          "home_team_win_rate", "away_team_win_rate", 
          "home_team_win_rate_prior_season", "away_team_win_rate_season_prior_season",
          "home_team_total_tackles_z", "away_team_total_tackles_z",
          "home_team_total_clangers_z", "away_team_total_clangers_z",
          "home_team_total_inside50s_z", "away_team_total_inside50s_z",
          "home_team_total_marksinside50_z", "away_team_total_marksinside50_z",
          "home_team_avg_disposal_efficiency", "away_team_avg_disposal_efficiency",
          "home_team_travelled", "away_team_travelled",
          "home_team", "away_team")

for(i in vars){
  p <- plot_afl_smooth(data = mod, condition = i)
  plots[[match(i, vars)]] <- p
}

wrap_plots(plots)

We see that the decision to model continuous covariates as smooth functions was a good decision, given the non-linear shape of the relationships. Understandably given the data complexity and the stochastic nature of professional sports matches, the credible intervals are also quite wide for all the predicted values.

However, we do see some interesting things, such as the poor home team win probability (i.e., well below chance probability of \(50\%\)) for when the home team’s cumulative season win rate is below \(\approx 30\%\). Interestingly, travel requirements do not seem to differentially impact win probability. More interestingly, we also see that the home team win probability for Carlton, Collingwood, Melbourne, and St Kilda is lower when these teams play at home. Conversely, we also see a lower home team win probability when Sydney is the away team (meaning Sydney is more likely to win on the road). Now that is fascinating!

I hope this brief analysis helps you see why interpretable statistical models are incredibly powerful tools! Too often do researchers and analysts default to black box machine learning or deep learning models because of the incentive to maximise predictive accuracy. Those tools, however, do not afford the level of insight and understand we just obtained, and so it can be difficult to explain why they generated the results that they did.

Evaluating predictive accuracy on the 2025 season

Finally, with our interpretation and inference completed, let’s see how well our model does on out-of-sample data from the 2025 season. All we need to do is calculate predictions for the test data, and compute the average proportion of predicted class labels (i.e., "Home win and "Away win") that match the actual class labels. We’ll assign any predicted probabilities \(<0.5\) to class 0 in our binary setting, which corresponds to "Away win", and predicted probabilities \(>0.5\) to "Home win.

preds <- as.data.frame(predictions(mod, newdata = test, type = "response"))
preds$estimate_label <- ifelse(preds$estimate < 0.5, "Away win", "Home win")
mean(preds$estimate_label == preds$home_win)
## [1] 0.72

\(72\%\)! That is far above chance (i.e., \(50\%\) for a two-class classification problem) and overall not too bad for a first iteration model built for inference and not maximisation of predictive performance.

Parting thoughts

Thanks for reading! Obviously, there is a lot more we could do on the feature engineering side to improve predictive performance, such as including details about injured players (though I do not know where to source such systematic data for this…), team captains, coaches, and a range of other likely useful variables (such as whether a team is coming off the back of a surprising or emphatic win). However, I will say that in cases like modelling sporting win probabilities, there will likely always be a relatively low ceiling for predictive capability compared to other domains, because, at the end of the day, any team can realistically beat any other, despite what the numbers on paper might say. Sport wouldn’t be as exciting as it is without that possibility!


  1. No, I will not be gambling money on the basis of this work.↩︎

  2. There are alternative ways to handle this, but I’ll stick to this simple approach for the purpose of this post.↩︎

  3. Though the same thing could be said for the numerous other variables available to us. We do need to be cognizant of multicollinearity, however.↩︎

  4. I might try tinkering with this more and seeing if more explicit specification of priors and parameters can get it to work because it’s a cool idea.↩︎