Extending fable.gam (part one): Incorporating non-Gaussian families
Published:
Introduction
In my last blog post I introduced my new fable.gam R package which integrates the ability to fit generalised additive models into the incredible fable tidy forecasting framework in R. Since then, I have continued to work on fable.gam, and this post represents the first follow-up post which documents my ongoing progress.
One of the first major areas of improvement that I (and others) wanted to have available was the ability to fit models for non-Gaussian families. Not only is this ability currently mostly absent from fable, but it is also extremely common in my work as a statistician. For example, in a lot of policy evaluation work that I do, I will often be modelling count outcomes, such as the number of people entering residential aged care or the number of crashes that occur on Australian roads. As another example, it is also common for me to have to model positive, continuous, non-zero quantities, such as income. In both of these cases – especially if there are values close to zero in the latter case – a Gaussian model family is inappropriate. Instead, it is far more advisable to (if you are in a parametric setting, which we are here) think about a probability distribution which better captures your beliefs about the data generating process. This might represent a Poisson distribution in the case of counts or a gamma distribution in the case of positive, non-zero values.
Leveraging the distributional R package for vectorised distribution objects
One of the benefits (and challenges!) of adding support for parametric model families is the ability to represent them as objects in R which fable can then use for highly efficient computations, given their vectorised nature. The distributional R package handles this, whilst also providing methods which are wrappers to the standard d, p, q, and r distribution functions, as well as additional distributional statistics, including the mean(), median(), variance(), and intervals with hilo().
Basically, my key task these past few days has been to figure out how to enable fable.gam users to specify a model family in the model() stage of the fable workflow, and then have all the subsequent calculations from this object be performed correctly, including transformations if there is a link function (such as the log-link) used.
Another thing I have been thinking about is this excellent post by Gavin Simpson which discusses the various ways to compute uncertainty for generalised linear models and the underlying assumption of symmetry if this calculation is performed on the response scale rather than the link scale.
Enough theory, what did you do?
Okay, now this is still a work in progress, but basically, I have created a new family() special function to add to the GAM() formula where you can specify one of a few model families available in mgcv for fitting GAMs as well as the link function (for the Gaussian family only, for now). The set of currently supported options include:
- Gaussian family with any link function (e.g., identity)
- Gamma family with log link
- Poisson family with log link
- Negative binomial family with log link
Under the hood, I have implemented code which computes predictions as part of the forecast() function and then tries to correctly map and store those predictions as the appropriate distributional object. For example, in the Gaussian case, I have aimed to calculate the predictions and then pass them into distributional::dist_normal via:
distributional::dist_normal(preds$fit, sqrt(preds$se.fit^2 + model$sig2))where preds$fit represents the mean (location parameter; \(\mu\)) of the distribution and sqrt(preds$se.fit^2 + model$sig2) represents the standard deviation (scale parameter; \(\sigma^{2}\)) of the distribution. In the Poisson case (which currently only supports the log-link), we instead have:
distributional::dist_poisson(preds)where preds represents the rate parameter (mean and variance) of the distribution, also known as \(\lambda\).
With all of that out of the way, let’s dive into some case studies to see how this works in practice.
Example 1: Forecasting counts
Count quantities are incredibly common in applied statistics. From estimating the number of homeless people in a city to the number of workers displaced by AI initiatives at technology companies to the number of Nodo donuts that Trent eats every time he is back in Brisbane, count quantities are everywhere. This makes having the ability to model them properly essential to any modern statistical forecasting tool.
We will first load some packages to get started:
library(dplyr)
library(tsibble)
library(tsibbledata)
library(ggplot2)
library(fable)
library(fable.gam)Now we can get some count data. At work, I have been recently analyzing crash data from Queensland roads. We can easily pull this from the Transport and Main Roads data portal, aggregate to a monthly level, and draw a plot:
crash_locations <- read.csv("https://www.data.qld.gov.au/datastore/dump/e88943c0-5968-4972-a15f-38e120d72ec0?bom=True")
crash_locations <- crash_locations |>
filter(Crash_Severity != "Property damage only") |> # Only goes to 31 December 2010 so skews overall down
mutate(yearmon = yearmonth(as.Date(paste0("01-", Crash_Month, "-", Crash_Year),
format = "%d-%B-%Y"))) |>
reframe(counter = n(), .by = "yearmon") |>
as_tsibble(index = "yearmon")
crash_locations |>
autoplot()We can verify that we are dealing with count data:
head(crash_locations)## # A tsibble: 6 x 2 [1M]
## yearmon counter
## <mth> <int>
## 1 2001 Jan 982
## 2 2001 Feb 1034
## 3 2001 Mar 1240
## 4 2001 Apr 1004
## 5 2001 May 1195
## 6 2001 Jun 1221Okey dokey, we can now fit a GAM model and specify the Poisson model family using the family() special function that is designed to work in fable.gam. We will also compute one year of forecasts and draw a plot to see how they look.
fc <- crash_locations |>
model(gam = GAM(counter ~ trend2(k = 5, bs = "gp") + season(12) + family(poisson))) |>
forecast(h = "52 weeks")
fc |>
autoplot(crash_locations)Not bad! We can verify that the forecasts are all valid integer counts:
head(fc)## # A fable: 6 x 4 [1M]
## # Key: .model [1]
## .model yearmon counter .mean
## <chr> <mth> <dist> <dbl>
## 1 gam 2025 Jul Pois(1287) 1287.
## 2 gam 2025 Aug Pois(1299) 1299.
## 3 gam 2025 Sep Pois(1221) 1221.
## 4 gam 2025 Oct Pois(1276) 1276.
## 5 gam 2025 Nov Pois(1252) 1252.
## 6 gam 2025 Dec Pois(1132) 1132.Clearly we have the appropriate Poisson distributions and so everything should be in order, but we can compute a \(95\%\) interval from the first one to further verify:
hilo(fc$counter[1])## <hilo[1]>
## [1] [1217, 1357]95Example 2: Forecasting positive, continuous quantities
In the first example we modelled counts. For this last one, we will examine the case where a different distribution might be best. For this one, we are going to take an applied lens and model the Cash Rate Target set by the Reserve Bank of Australia. The cash rate is the interest rate on unsecured overnight loans between banks. It is the (near) risk-free benchmark rate (RFR) for the Australian dollar and is also know by the acronym AONIA in financial markets. While (in my non-economist’s understanding) the cash rate could theoretically be zero or negative to stimulate certain parts of the economy, there is no historical precedent for this in Australia, and so we will assume here that it must be positive.
We can download the data I have presaved to the GitHub repository of my blog and wrangle it into a tsibble via the following:
cr <- read.csv("https://raw.githubusercontent.com/hendersontrent/hendersontrent.github.io/refs/heads/master/files/cash_rate.csv")
cr <- cr |>
dplyr::select(c(date, cash_rate)) |>
mutate(date = as.Date(date, format = "%d-%b-%y")) |>
filter(date >= as.Date("02-08-1990", format = "%d-%m-%Y")) |>
mutate(cash_rate = as.numeric(cash_rate)) |>
as_tsibble(index = "date")
cr |>
autoplot()You might have the inclination to fit a gamma family here given the positive, continuous nature of the data. However, if we take a look at the empirical distribution, we have a pretty interesting shape:
plot(density(cr$cash_rate))Now contrast that to the canonical gamma probability density function shapes you can see on resources like Wikipedia or a statistics textbook:
Pretty different! The gamma family also assumes a specific mean-variance relationship that I do not believe is appropriate for this data. We can test the gamma GAM anyway and see how it looks:
cr |>
model(gam = GAM(cash_rate ~ trend2(k = 8, bs = "gp") + season(12) + family(Gamma, link = "log"))) |>
forecast(h = "52 weeks") |>
autoplot(cr)Not good… Instead, let’s try a Gaussian family but with a log-link function (i.e., a ‘lognormal’ model) instead of the default identity link:
cr |>
model(gam = GAM(cash_rate ~ trend2(k = 8, bs = "gp") + season(12) + family(gaussian, link = "log"))) |>
forecast(h = "52 weeks") |>
autoplot(cr)Much more sensible! As an additional note, it is worth pointing out that there are alternatives in fable such as specifying an interval which forecasts must be bounded in – which is particularly helpful if, for example, government policy mandates that a quantity be constrained between a lower and upper bound. This is discussed in the excellent fpp3 forecasting textbook.
Parting thoughts
Thank you for reading! I’m not sure what the future cadence of fable.gam updates will be like given my balancing of both full-time work and finishing my PhD, but I will try and post here as often as I can when an exciting new feature is added. See you next time!
